feat(search): prefix-aware fuzzy match for short tokens vs long words#10005
feat(search): prefix-aware fuzzy match for short tokens vs long words#10005xnohat wants to merge 2 commits into
Conversation
`fuzzyMatchWordWithResult` skips whole-word fuzzy comparisons when the
word is longer than the token by more than `MAX_EDIT_DISTANCE`, so a
short typo like "infa" never reaches a long word like "infrastructure"
even though the user clearly meant the same prefix.
Add a second strategy that, for longer words, computes the edit
distance against the word's leading prefix (sized
`token.length + maxDistance`). The whole-word path still handles
typos in similarly-sized words ("infra" ↔ "infa"); the new prefix path
covers the partial-word case ("infa" → "infrastructure").
Returns the matched prefix substring so the existing highlight
plumbing in `NoteFlatTextExp.smartMatch` highlights only the matched
portion of the longer word.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a prefix-aware fuzzy matching strategy to improve search matching for short tokens with typos against longer words (e.g., matching 'infa' to 'infrastructure'), along with corresponding unit tests. The feedback suggests optimizing performance by adding a first-character match guard to avoid expensive edit distance calculations on unrelated long words, and simplifying a redundant Math.min calculation in the prefix length determination.
| if (normalizedToken.length >= 4 && | ||
| word.length > normalizedToken.length + maxDistance) { |
There was a problem hiding this comment.
Performance Optimization: First-Character Match Guard
In a typical document, there are many long words (length >= 7). Without a first-character match guard, Strategy 2 will run the relatively expensive calculateOptimizedEditDistance (which allocates arrays and performs DP) on almost every single long word in the text for any query token of length >= 4.
Since this is a prefix-aware search, we can assume that the first character of the prefix is typed correctly in the vast majority of cases. Adding a simple check like word[0] === normalizedToken[0] will immediately filter out ~96% of non-matching long words, drastically reducing the number of edit distance calculations and preventing search lag on larger notes.
| if (normalizedToken.length >= 4 && | |
| word.length > normalizedToken.length + maxDistance) { | |
| if (normalizedToken.length >= 4 && | |
| word[0] === normalizedToken[0] && | |
| word.length > normalizedToken.length + maxDistance) { |
There was a problem hiding this comment.
Done in 2d67854 — word[0] === normalizedToken[0] guard added before the prefix substring + edit-distance call.
| // Require a healthy length difference to avoid double-counting Strategy 1 hits. | ||
| if (normalizedToken.length >= 4 && | ||
| word.length > normalizedToken.length + maxDistance) { | ||
| const prefixLen = Math.min(word.length, normalizedToken.length + maxDistance); |
There was a problem hiding this comment.
Redundant Math.min Call
Because of the preceding if condition (word.length > normalizedToken.length + maxDistance), word.length is guaranteed to be strictly greater than normalizedToken.length + maxDistance. Therefore, Math.min(word.length, normalizedToken.length + maxDistance) will always evaluate to normalizedToken.length + maxDistance.
We can simplify this to avoid the redundant function call and make the code cleaner.
| const prefixLen = Math.min(word.length, normalizedToken.length + maxDistance); | |
| const prefixLen = normalizedToken.length + maxDistance; |
There was a problem hiding this comment.
Simplified in 2d67854 — prefixLen = normalizedToken.length + maxDistance directly.
Greptile SummaryThis PR adds a prefix-aware second strategy to
Confidence Score: 5/5The change is additive and self-contained: Strategy 2 fires only where Strategy 1 provably cannot (mutually exclusive by the word-length boundary), the returned prefix substring is a literal substring of the original word so downstream highlighting in NoteFlatTextExp works correctly, and all 17 tests pass with genuine-typo test cases that exercise the new code path. Both strategies are mathematically mutually exclusive — when word.length satisfies Strategy 2's entry condition it cannot satisfy Strategy 1's — so there is no risk of double-triggering or mis-classifying a match. The prefix extraction and returned substring length are bounded by word length (guaranteed by the strict inequality guard), eliminating out-of-bounds slicing. No logic errors or incorrect behavior were found. No files require special attention. Important Files Changed
|
| if (normalizedToken.length >= 4 && | ||
| word.length > normalizedToken.length + maxDistance) { | ||
| const prefixLen = Math.min(word.length, normalizedToken.length + maxDistance); | ||
| const prefix = word.substring(0, prefixLen); | ||
| const distance = calculateOptimizedEditDistance(normalizedToken, prefix, maxDistance); | ||
| if (distance <= maxDistance) { | ||
| return originalWord.substring(0, prefixLen); | ||
| } | ||
| } |
There was a problem hiding this comment.
Strategy 2 widens false-positive surface for semantically unrelated short words
Because Strategy 2 compares the token against the word's leading tokenLength + maxDistance prefix, any two words that happen to share a 2-edit prefix will match. Concretely, with maxDistance = 2: "word" (4 chars) → prefix of "worldwide" is "worldw" (6 chars), edit distance = 2 → match. "test" → prefix of "testimony" is "testim", edit distance = 2 → match. Neither are typos of a prefix. You may want an additional constraint — e.g. requiring token.length / prefixLen >= 0.6 — to reduce spurious matches.
There was a problem hiding this comment.
You're right that the prefix strategy widens the candidate set — test will match testimony's leading testim (edit distance 2 within budget). My take is to leave the constraint as-is and lean on the scoring layer: SearchResult.computeScore already applies a per-token fuzzy-match penalty so prefix-aware hits land below exact and whole-word fuzzy hits in the ranking, and quick search only ever surfaces a handful of top-scored results. Tightening the ratio further would also drop the case this strategy exists for (infa → infrastructure, ratio 4/6 ≈ 0.67), so I'd rather keep the recall and let scoring de-prioritise the false positives. Happy to revisit if the noise turns out to be visible in practice.
- Add a cheap first-character match guard before Strategy 2 runs the
edit-distance DP. Mistypes that change the leading letter are rare,
and the guard rejects ~96% of unrelated long words for the same query
before the more expensive call. (Suggested in review.)
- Drop the redundant `Math.min` around the prefix length. The preceding
`word.length > token.length + maxDistance` check already guarantees
the min always resolves to `token.length + maxDistance`.
- Fix the prefix-match test: `'instal'` is a verbatim substring of
`'installer'`, so the exact-substring fast-path returned true before
Strategy 2 was ever exercised. Use `'insall'` (a genuine missing-letter
typo of `install`) instead, plus an explicit first-character-guard
rejection case.
- Move the `fuzzyMatchWordWithResult` assertion out of the
`describe('fuzzyMatchWord')` block into its own sibling describe so the
test for the result-returning function is grouped under the function
it actually exercises.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Split out from #9963 as requested in review — this change is independent of
the FTS5 work and stands on its own.
fuzzyMatchWordWithResultcurrently skips fuzzy comparisons whenever theword in the text is more than
MAX_EDIT_DISTANCEcharacters longer thanthe search token. That means a short typo like
infanever reaches a longword like
infrastructure, even though the user clearly meant the sameprefix.
This PR adds a second strategy: for words that are still longer than the
token even after the existing length guard, compute the edit distance
against the word's leading prefix (sized
token.length + maxDistance).The whole-word path is unchanged, so typos in similarly-sized words
(
infra↔infa) still match exactly as before; the new prefix pathhandles the partial-word case (
infa→infrastructure).The matched substring is returned so the existing highlight plumbing in
NoteFlatTextExp.smartMatchonly marks the matched portion of the longerword.
Test plan
against the prefix of a longer word, and unrelated long words still
failing the match. Adds a returned-prefix assertion so highlight
behaviour is locked in.
pnpm --filter server test text_utils— 17/17 pass.pnpm typecheckclean.