/05 Thinking

Hybrid Retrieval vs Pure Embeddings

Vector embeddings transformed document retrieval. Instead of matching keywords, we could match meaning. A query about "company earnings" could find documents discussing "financial performance" or "quarterly results" without those exact words appearing. Semantic search felt like magic.

Then teams deployed it to production and discovered the edge cases.

Where Embeddings Fall Short

Embeddings encode semantic similarity, not lexical precision. When a user searches for invoice INV-2024-0847, they want that specific invoice. Vector search might return chunks containing similar invoice numbers, or documents that discuss invoices generally. The embedding captures "this is about invoices" but loses the specific identifier.

The same problem appears with names, dates, acronyms, and technical terms. A search for "HIPAA compliance" might surface documents about "healthcare regulations" that never mention HIPAA specifically. Semantically related, but not what the user needed.

Embeddings also struggle with negation and precise relationships. "Contracts that do not include arbitration clauses" is a specific query that vector similarity handles poorly. The embedding for this query will be similar to documents that do include arbitration clauses, because both are about arbitration.

The Keyword Baseline

Traditional full-text search solved these problems decades ago. BM25 and TF-IDF rank documents by term frequency and specificity. If you search for a specific invoice number, keyword search finds it. If you search for an exact phrase, keyword search matches it.

Keyword search fails at semantic understanding. A search for "termination policy" won't find documents that only use "ending employment" or "separation procedures." Users have to guess the exact terminology used in the documents.

Neither approach is sufficient alone. Both are necessary.

Reciprocal Rank Fusion

Hybrid retrieval runs both searches in parallel and combines the results. The standard approach uses Reciprocal Rank Fusion (RRF), which scores documents based on their rank in each result set rather than their raw scores.

A document that appears in position 2 of vector results and position 5 of keyword results gets a combined score based on both positions. Documents that appear in only one result set still contribute, but documents that appear in both get a boost.

RRF handles the score normalization problem elegantly. Vector similarity scores and BM25 scores aren't directly comparable—their ranges and distributions differ. By converting to ranks first, RRF sidesteps the calibration issue entirely.

Implementation Reality

Running hybrid search requires maintaining two indexes: a vector store for embeddings and a full-text search index. This adds operational complexity. You're syncing documents to two systems, managing two query paths, and combining results before passing them to the LLM.

The complexity is worth it. In production systems with real user queries, hybrid retrieval consistently outperforms pure vector search. The improvement is especially noticeable for queries involving specific identifiers, exact phrases, or technical terminology.

The architecture decision is straightforward. If your documents contain specific terms that users will search for exactly, you need hybrid retrieval. If your users ask questions using different vocabulary than your documents, you need hybrid retrieval. In practice, this means almost every production system needs both.

Back to Thinking