Everyone demos RAG with clean PDFs. Nobody talks about the ugly parts.
The problem with toy examples
Most RAG tutorials go: chunk your docs → embed them → store in a vector DB → retrieve top-k → feed to LLM. Works great on Wikipedia articles. Falls apart in production.
Here's what actually went wrong when I built one:
1. Chunk size matters more than anything
I started with 512-token chunks. The retrieval was garbage. The issue: my documents had context that spanned paragraphs — splitting mid-thought meant the retrieved chunk was meaningless without what came before.
Switched to semantic chunking (splitting on paragraph boundaries rather than token count) and retrieval quality jumped immediately.
2. Embedding staleness
When users updated their data, the embeddings weren't regenerating. So queries were returning results based on old state. Had to build an explicit invalidation layer — whenever a record updates, queue a re-embed job.
3. Retrieval ≠ relevance
Top-k cosine similarity returns the most similar chunks, not the most relevant ones. Sometimes the top 3 chunks are all saying the same thing from different angles. Added a re-ranking step using a cross-encoder — slower, but the LLM answers got dramatically better.
What I'd do differently
- Start with hybrid search (BM25 + vectors) from day one
- Build an eval harness before you build the pipeline
- Log every retrieval with the final answer — you'll spot patterns fast
Still a work in progress. But it's in production and it works.