What are the best practices for versioning and testing prompts in production?

Question

I'm working on a production system where we use GPT-4 for various tasks. As we iterate on our prompts, I'm concerned about:

1. **Version control**: How do you track prompt changes over time?
2. **A/B testing**: What's the best way to test prompt variations?
3. **Regression testing**: How do you ensure new prompts don't break existing functionality?
4. **Monitoring**: What metrics should we track for prompt performance?

We currently just have prompts in our codebase, but I feel like we need a more robust system. What tools and processes do successful AI teams use for prompt management?

Dr. Sarah Chen · Accepted Answer

Great question! Here's how we handle prompt management in production:

**1. Version Control**
We store prompts in a dedicated `prompts/` directory with semantic versioning:
```
prompts/
  customer-support/
    v1.0.0.txt
    v1.1.0.txt
    v2.0.0.txt
```

**2. A/B Testing Framework**
We use a simple feature flag system:
```python
prompt_version = get_prompt_version(user_id, experiment="support-prompt")
prompt = load_prompt(f"customer-support/{prompt_version}")
```

**3. Regression Testing**
We maintain a test suite with expected outputs:
```python
def test_prompt_v2():
    response = llm.complete(prompt_v2, test_inputs)
    assert "refund" in response.lower()
    assert response.word_count < 100
```

**4. Monitoring Metrics**
- Response time (p50, p95, p99)
- Token usage (cost tracking)
- User satisfaction scores
- Task completion rate

**Tools we use:**
- LangSmith for prompt versioning and tracing
- Weights & Biases for experiment tracking
- Custom dashboards for production monitoring

The key is treating prompts like code: version control, testing, and gradual rollouts.

Jessica Wang · Answer

From a product perspective, I'd add:

**Prompt Registry**
Create a centralized prompt registry where teams can discover and reuse prompts. We use a simple YAML format:

```yaml
name: customer-support-v2
version: 2.1.0
model: gpt-4
temperature: 0.7
max_tokens: 150
prompt: |
  You are a helpful customer support agent...
test_cases:
  - input: "I want a refund"
    expected_contains: ["refund policy", "process"]
```

This makes it easy to audit what prompts are in production and roll back if needed.

What are the best practices for versioning and testing prompts in production?

Comments

2 Answers

Comments

Comments