What are the best practices for versioning and testing prompts in production?

Asked about 2 months agoViewed 310 times
18

I'm working on a production system where we use GPT-4 for various tasks. As we iterate on our prompts, I'm concerned about:

  1. Version control: How do you track prompt changes over time?
  2. A/B testing: What's the best way to test prompt variations?
  3. Regression testing: How do you ensure new prompts don't break existing functionality?
  4. Monitoring: What metrics should we track for prompt performance?

We currently just have prompts in our codebase, but I feel like we need a more robust system. What tools and processes do successful AI teams use for prompt management?

asked about 2 months ago

Comments

S

This is exactly what I needed! We just started using LLMs and had no idea how to manage prompts properly.

Sophie Anderson420about 2 months ago

Please log in to add a comment

Log In

2 Answers

3

Great question! Here's how we handle prompt management in production:

1. Version Control We store prompts in a dedicated prompts/ directory with semantic versioning:

prompts/
  customer-support/
    v1.0.0.txt
    v1.1.0.txt
    v2.0.0.txt

2. A/B Testing Framework We use a simple feature flag system:

prompt_version = get_prompt_version(user_id, experiment="support-prompt")
prompt = load_prompt(f"customer-support/{prompt_version}")

3. Regression Testing We maintain a test suite with expected outputs:

def test_prompt_v2():
    response = llm.complete(prompt_v2, test_inputs)
    assert "refund" in response.lower()
    assert response.word_count < 100

4. Monitoring Metrics

  • Response time (p50, p95, p99)
  • Token usage (cost tracking)
  • User satisfaction scores
  • Task completion rate

Tools we use:

  • LangSmith for prompt versioning and tracing
  • Weights & Biases for experiment tracking
  • Custom dashboards for production monitoring

The key is treating prompts like code: version control, testing, and gradual rollouts.

answered about 2 months ago

Comments

R

LangSmith looks interesting. Does it work with non-OpenAI models like Anthropic Claude?

Raj Patel1650about 2 months ago

Please log in to add a comment

Log In
50

From a product perspective, I'd add:

Prompt Registry Create a centralized prompt registry where teams can discover and reuse prompts. We use a simple YAML format:

name: customer-support-v2
version: 2.1.0
model: gpt-4
temperature: 0.7
max_tokens: 150
prompt: |
  You are a helpful customer support agent...
test_cases:
  - input: "I want a refund"
    expected_contains: ["refund policy", "process"]

This makes it easy to audit what prompts are in production and roll back if needed.

answered about 2 months ago

Comments

A

Great answer! One addition: for code chunks, also consider keeping import statements with the functions that use them.

Alex Rodriguez1920about 2 months ago

Please log in to add a comment

Log In

Sign in to post an answer

Sign In