What I learned building a RAG pipeline that actually works in production

Everyone demos RAG with clean PDFs. Nobody talks about the ugly parts.

The problem with toy examples

Most RAG tutorials go: chunk your docs → embed them → store in a vector DB → retrieve top-k → feed to LLM. Works great on Wikipedia articles. Falls apart in production.

Here's what actually went wrong when I built one:

1. Chunk size matters more than anything

I started with 512-token chunks. The retrieval was garbage. The issue: my documents had context that spanned paragraphs — splitting mid-thought meant the retrieved chunk was meaningless without what came before.

Switched to semantic chunking (splitting on paragraph boundaries rather than token count) and retrieval quality jumped immediately.

2. Embedding staleness

When users updated their data, the embeddings weren't regenerating. So queries were returning results based on old state. Had to build an explicit invalidation layer — whenever a record updates, queue a re-embed job.

3. Retrieval ≠ relevance

Top-k cosine similarity returns the most similar chunks, not the most relevant ones. Sometimes the top 3 chunks are all saying the same thing from different angles. Added a re-ranking step using a cross-encoder — slower, but the LLM answers got dramatically better.

What I'd do differently

Start with hybrid search (BM25 + vectors) from day one
Build an eval harness before you build the pipeline
Log every retrieval with the final answer — you'll spot patterns fast

Still a work in progress. But it's in production and it works.