How to implement RAG (Retrieval-Augmented Generation) with custom embeddings?

Question

I want to build a RAG system for our internal documentation, but I'm confused about the embedding strategy.

**Current setup:**
- 500+ markdown documentation files
- Using OpenAI's text-embedding-3-small
- Storing in Pinecone vector database

**Questions:**
1. Should I fine-tune embeddings on our domain-specific content?
2. What chunk size works best for technical documentation?
3. How do I handle code snippets vs prose differently?
4. What's the best way to re-rank retrieved chunks before sending to LLM?

I've seen some teams use hybrid search (keyword + semantic). Is that worth the added complexity?

Marcus Johnson · Accepted Answer

I've built several RAG systems. Here's what works:

**1. Chunk Size**
For technical docs, I recommend:
- **Prose**: 512-1024 tokens with 128 token overlap
- **Code**: Keep functions/classes intact (don't split mid-function)
- **Tables**: Treat as atomic units

**2. Hybrid Search**
Yes, it's worth it! Combine:
- **Semantic search**: For conceptual queries ("how to handle errors")
- **Keyword search**: For exact matches (function names, error codes)

```python
semantic_results = vector_db.search(query_embedding, top_k=20)
keyword_results = bm25.search(query, top_k=20)
combined = rerank(semantic_results + keyword_results, top_k=5)
```

**3. Re-ranking**
Use a cross-encoder for re-ranking:
```python
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([(query, doc) for doc in candidates])
```

**4. Fine-tuning Embeddings**
Only if you have >10k domain-specific examples. Otherwise, the generic embeddings work surprisingly well.

**Pro tip**: Add metadata filters (document type, date, author) to improve retrieval precision.

How to implement RAG (Retrieval-Augmented Generation) with custom embeddings?

Comments

1 Answer

Comments