How to implement RAG (Retrieval-Augmented Generation) with custom embeddings?
I want to build a RAG system for our internal documentation, but I'm confused about the embedding strategy.
Current setup:
- 500+ markdown documentation files
- Using OpenAI's text-embedding-3-small
- Storing in Pinecone vector database
Questions:
- Should I fine-tune embeddings on our domain-specific content?
- What chunk size works best for technical documentation?
- How do I handle code snippets vs prose differently?
- What's the best way to re-rank retrieved chunks before sending to LLM?
I've seen some teams use hybrid search (keyword + semantic). Is that worth the added complexity?
Comments
Have you considered using late chunking? It can improve retrieval quality significantly.
Please log in to add a comment
Log In1 Answer
I've built several RAG systems. Here's what works:
1. Chunk Size For technical docs, I recommend:
- Prose: 512-1024 tokens with 128 token overlap
- Code: Keep functions/classes intact (don't split mid-function)
- Tables: Treat as atomic units
2. Hybrid Search Yes, it's worth it! Combine:
- Semantic search: For conceptual queries ("how to handle errors")
- Keyword search: For exact matches (function names, error codes)
semantic_results = vector_db.search(query_embedding, top_k=20)
keyword_results = bm25.search(query, top_k=20)
combined = rerank(semantic_results + keyword_results, top_k=5)
3. Re-ranking Use a cross-encoder for re-ranking:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([(query, doc) for doc in candidates])
4. Fine-tuning Embeddings Only if you have >10k domain-specific examples. Otherwise, the generic embeddings work surprisingly well.
Pro tip: Add metadata filters (document type, date, author) to improve retrieval precision.
Comments
The verification step is brilliant! Have you open-sourced this pattern anywhere?
Please log in to add a comment
Log InSign in to post an answer
Sign In