/05 Thinking

The Case for Reranking in Production RAG

Retrieval gets you candidates. Reranking gets you the right candidates. The distinction matters more than most teams realize when building production RAG systems.

The standard pipeline retrieves the top-k chunks by vector similarity, stuffs them into the context window, and hopes the LLM figures out which ones matter. This works until it doesn't. When retrieval returns marginally relevant chunks alongside highly relevant ones, the model's output quality degrades in ways that are hard to debug.

The Retrieval-Reranking Split

Retrieval and reranking solve different problems with different tools. Retrieval needs to be fast. It searches across thousands or millions of documents and must return results in milliseconds. This speed requirement forces architectural compromises. Vector similarity with approximate nearest neighbor search is fast but imprecise.

Reranking operates on a much smaller set—typically 20 to 50 candidates from the retrieval stage. With fewer items to process, reranking can use more expensive models that directly compare the query against each candidate. Cross-encoders process the query and document together, capturing interactions that bi-encoders miss.

A bi-encoder embeds the query and document separately, then compares the embeddings. It can't see how specific words in the query relate to specific words in the document. A cross-encoder processes both together, attending across the full query-document pair. This architectural difference translates directly to ranking quality.

What Reranking Catches

Consider a query about "termination clauses in the 2024 vendor agreement." Retrieval might return chunks from multiple contracts, all discussing termination. Some are from the right document, some aren't. Vector similarity treats them as roughly equivalent—they're all semantically about termination clauses.

A cross-encoder reranker sees the full context. It recognizes that "2024 vendor agreement" is a specific reference and boosts chunks from that document. It demotes chunks that discuss termination generally but don't match the specific agreement. The reranker understands the query intent in a way that embedding similarity cannot.

Reranking also helps with diversity. If retrieval returns five chunks from the same section of one document, a reranker can identify redundancy and promote chunks from other relevant sections. This prevents the context window from being dominated by repetitive content.

The Latency Tradeoff

Reranking adds latency. A cross-encoder must process each candidate sequentially or in small batches. For 30 candidates, this might add 200-500ms to the response time. In interactive applications, this latency is noticeable.

The tradeoff is usually worth it. A 300ms increase in latency that improves answer relevance by 15-20% is a good trade for most applications. Users notice wrong answers more than they notice slightly slower answers.

The key is keeping the candidate set small enough that reranking stays fast. Retrieve 30-50 candidates, rerank them, and pass the top 5-10 to the LLM. The retrieval stage casts a wide net; the reranker filters it down to what actually matters.

Implementation Considerations

Production reranking requires attention to failure modes. If the reranker service is unavailable, the system should fall back to retrieval-only results rather than failing entirely. If reranking latency spikes, the system needs timeouts that preserve responsiveness.

Reranker models also need periodic evaluation. As document collections grow and query patterns shift, a reranker trained on old data may drift in effectiveness. Monitoring reranker performance—not just overall system performance—catches degradation early.

The investment in reranking infrastructure pays off in answer quality. For systems where accuracy matters, reranking isn't optional. It's the difference between a retrieval pipeline and a production-grade RAG system.

Back to Thinking