The semantic search hype cycle
When embedding-based search became accessible, the narrative was simple: semantic search understands meaning, keyword search matches strings. Therefore semantic search is always better. Like most simple narratives, this one is wrong.
Semantic search excels at understanding intent, handling paraphrases, and surfacing conceptually related content. But it also has failure modes that keyword search handles effortlessly — and pretending otherwise leads to search systems that are impressive in demos and frustrating in production.
Where semantic search wins
Conceptual queries
When a user searches for "how to reduce cloud spending" and the relevant document talks about "cost optimization strategies for AWS infrastructure," semantic search connects these concepts even though they share almost no keywords. This is the core promise, and it delivers.
Natural language questions
Users increasingly search by asking questions rather than typing keywords. Semantic search handles questions naturally because the embedding space captures semantic relationships between questions and answers.
Cross-lingual retrieval
Multilingual embedding models can match a query in one language with documents in another. This is transformative for organizations with multilingual content.
Where keyword search wins
Exact terms and identifiers
Search for "ERR-4021" or "invoice INV-2024-0847" and semantic search might return documents about errors or invoices in general. Keyword search finds the exact match instantly. For searches involving product codes, error messages, API endpoints, or any specific identifier, keyword search is vastly more reliable.
Rare or technical terms
Embedding models encode meaning based on training data distribution. Terms that are rare in the training data — niche technical jargon, brand names, newly coined terms — may not be well-represented in the embedding space. Keyword search doesn't care about frequency; it just matches the string.
Known-item search
When a user knows exactly what they're looking for — a specific document, a specific section, a specific piece of data — keyword search is faster and more precise. Semantic search adds latency and can introduce noise by returning conceptually similar but incorrect results.
The hybrid search pattern
The best production search systems combine both approaches. The implementation varies, but the core pattern is consistent:
- Run both a semantic search and a keyword search against the same query
- Normalize the scores from each system to a common scale
- Combine the scores using a weighted formula
- Re-rank the merged results
The weight between semantic and keyword components can be static (a fixed ratio like 70/30) or dynamic (adjusted based on query characteristics — more keyword weight for short queries with specific terms, more semantic weight for natural language questions).
Building effective hybrid search
Reciprocal rank fusion
The simplest and most effective score combination method for most use cases. It doesn't require score normalization because it works on rank positions rather than raw scores. Given a document's rank in each result list, RRF produces a combined score that balances both signals.
Query classification
A lightweight classifier that examines the query and adjusts the search strategy. Queries containing specific identifiers, code snippets, or error messages get routed to keyword-heavy search. Natural language questions get routed to semantic-heavy search. This adds a small amount of complexity but can significantly improve result quality.
Re-ranking
After merging results from both methods, a cross-encoder re-ranker examines each (query, document) pair and produces a more accurate relevance score. This is the most computationally expensive step but often produces the biggest quality improvement. Modern re-ranking models are small and fast enough to re-rank the top 20–50 results with acceptable latency.
Measuring search quality
The most common metrics for evaluating search quality are:
- Recall@K — Of the relevant documents, how many appear in the top K results?
- Precision@K — Of the top K results, how many are relevant?
- MRR (Mean Reciprocal Rank) — How high does the first relevant result appear on average?
- NDCG — A measure that accounts for both relevance and position
For RAG systems specifically, the metric that matters most is recall: if the relevant document isn't retrieved, the LLM can't use it. Precision matters less because the LLM can usually ignore irrelevant context — but very low precision wastes tokens and can confuse the model.
Practical recommendations
If you're building a search system today: start with hybrid search. The additional complexity over pure semantic search is minimal, and the quality improvement on keyword-sensitive queries is substantial. Use BM25 for the keyword component (it's a solved problem), a good embedding model for the semantic component, and reciprocal rank fusion for combining scores. Add a re-ranker if latency permits.
This combination handles the vast majority of production search needs without the failure modes that plague either approach in isolation.