Search Engineering

How Fuzzy Search Works: Trigrams, Edit Distance, and When to Use Each

April 10, 2026

Users make typos. They misspell brand names, swap characters, and forget double letters. A search engine that only matches exact tokens will fail these users. Fuzzy search algorithms solve this by finding approximate matches — but each approach has different tradeoffs in speed, accuracy, and implementation complexity.

Levenshtein (edit) distance

Levenshtein distance counts the minimum number of single-character edits (insertions, deletions, substitutions) to transform one string into another. "kitten" to "sitting" has an edit distance of 3.

Pros: Intuitive, well-understood. An edit distance of 1 covers most single-character typos.

Cons: Computing edit distance against every token in a large index is expensive (O(n*m) per pair). You need data structures like BK-trees or Levenshtein automata to make it practical at scale.

Best for: Small vocabularies, dictionary lookups, or when you can pre-compute automata for common queries.

Trigram similarity

Trigram matching breaks strings into overlapping sequences of 3 characters. "search" becomes ["sea", "ear", "arc", "rch"]. Two strings are similar if they share many trigrams. Similarity is measured using the Jaccard coefficient: the size of the intersection divided by the size of the union.

Pros: Fast to compute using inverted indexes. You can build a trigram index at indexing time and look up candidate matches in O(1) per trigram. Scales well to large corpora.

Cons: Less precise for very short strings (few trigrams to compare). May produce false positives for strings that happen to share substrings but aren't semantically related.

Best for: General-purpose fuzzy search at scale. This is what LumoSearch uses — a trigram inverted index provides fuzzy recall candidates, then BM25F scoring ranks them by relevance.

Phonetic matching

Algorithms like Soundex, Metaphone, and Double Metaphone encode words by how they sound rather than how they're spelled. "Smith" and "Smyth" produce the same phonetic code.

Pros: Handles phonetic misspellings that edit distance misses. Great for names and proper nouns.

Cons: Language-specific (English-centric). High false positive rate — many unrelated words sound alike. Not useful for non-name text.

Best for: Name matching, people search, and contexts where phonetic similarity matters more than string similarity.

Combining approaches

Production search engines typically combine multiple strategies:

  1. Exact match — highest boost. The user typed exactly what's in the document.
  2. Prefix match — user is still typing. Match tokens that start with the query term.
  3. Trigram/fuzzy match — recall candidates that are approximately similar, then rank by BM25F score.

LumoSearch implements all three in this priority order. Exact matches get a 2x boost, prefix matches get 1.25x, and trigram candidates are scored by Jaccard similarity combined with BM25F relevance. This layered approach gives you both precision (exact matches rank first) and recall (typos still find results).

Implementation considerations

If you're building fuzzy search from scratch, consider these factors:

  • Index size: trigram indexes add ~3x the number of entries vs token indexes
  • Query latency: trigram lookup is O(1) per trigram; edit distance is O(n*m) per candidate
  • Minimum query length: fuzzy matching works poorly on 1-2 character queries
  • False positive rate: tune your similarity threshold to balance recall vs precision

Fuzzy search in LumoSearch

LumoSearch builds token, trigram, and prefix indexes automatically when you upload documents. No configuration needed — fuzzy matching works out of the box. See all features.