What are the best practices for versioning and testing prompts in production?
I'm working on a production system where we use GPT-4 for various tasks. As we iterate on our prompts, I'm concerned about:
- Version control: How do you track prompt changes over time?
- A/B testing: What's the best way to test prompt variations?
- Regression testing: How do you ensure new prompts don't break existing functionality?
- Monitoring: What metrics should we track for prompt performance?
We currently just have prompts in our codebase, but I feel like we need a more robust system. What tools and processes do successful AI teams use for prompt management?
Comments
This is exactly what I needed! We just started using LLMs and had no idea how to manage prompts properly.
Please log in to add a comment
Log In2 Answers
Great question! Here's how we handle prompt management in production:
1. Version Control
We store prompts in a dedicated prompts/ directory with semantic versioning:
prompts/
customer-support/
v1.0.0.txt
v1.1.0.txt
v2.0.0.txt
2. A/B Testing Framework We use a simple feature flag system:
prompt_version = get_prompt_version(user_id, experiment="support-prompt")
prompt = load_prompt(f"customer-support/{prompt_version}")
3. Regression Testing We maintain a test suite with expected outputs:
def test_prompt_v2():
response = llm.complete(prompt_v2, test_inputs)
assert "refund" in response.lower()
assert response.word_count < 100
4. Monitoring Metrics
- Response time (p50, p95, p99)
- Token usage (cost tracking)
- User satisfaction scores
- Task completion rate
Tools we use:
- LangSmith for prompt versioning and tracing
- Weights & Biases for experiment tracking
- Custom dashboards for production monitoring
The key is treating prompts like code: version control, testing, and gradual rollouts.
Comments
LangSmith looks interesting. Does it work with non-OpenAI models like Anthropic Claude?
Please log in to add a comment
Log InFrom a product perspective, I'd add:
Prompt Registry Create a centralized prompt registry where teams can discover and reuse prompts. We use a simple YAML format:
name: customer-support-v2
version: 2.1.0
model: gpt-4
temperature: 0.7
max_tokens: 150
prompt: |
You are a helpful customer support agent...
test_cases:
- input: "I want a refund"
expected_contains: ["refund policy", "process"]
This makes it easy to audit what prompts are in production and roll back if needed.
Comments
Great answer! One addition: for code chunks, also consider keeping import statements with the functions that use them.
Please log in to add a comment
Log InSign in to post an answer
Sign In