Skip to content

feat(search): prefix-aware fuzzy match for short tokens vs long words#10005

Open
xnohat wants to merge 2 commits into
TriliumNext:mainfrom
xnohat:feat/search-prefix-fuzzy
Open

feat(search): prefix-aware fuzzy match for short tokens vs long words#10005
xnohat wants to merge 2 commits into
TriliumNext:mainfrom
xnohat:feat/search-prefix-fuzzy

Conversation

@xnohat

@xnohat xnohat commented May 31, 2026

Copy link
Copy Markdown

Summary

Split out from #9963 as requested in review — this change is independent of
the FTS5 work and stands on its own.

fuzzyMatchWordWithResult currently skips fuzzy comparisons whenever the
word in the text is more than MAX_EDIT_DISTANCE characters longer than
the search token. That means a short typo like infa never reaches a long
word like infrastructure, even though the user clearly meant the same
prefix.

This PR adds a second strategy: for words that are still longer than the
token even after the existing length guard, compute the edit distance
against the word's leading prefix (sized token.length + maxDistance).
The whole-word path is unchanged, so typos in similarly-sized words
(infrainfa) still match exactly as before; the new prefix path
handles the partial-word case (infainfrastructure).

The matched substring is returned so the existing highlight plumbing in
NoteFlatTextExp.smartMatch only marks the matched portion of the longer
word.

Test plan

  • New unit tests covering both directions: a short token matching
    against the prefix of a longer word, and unrelated long words still
    failing the match. Adds a returned-prefix assertion so highlight
    behaviour is locked in.
  • pnpm --filter server test text_utils — 17/17 pass.
  • pnpm typecheck clean.

`fuzzyMatchWordWithResult` skips whole-word fuzzy comparisons when the
word is longer than the token by more than `MAX_EDIT_DISTANCE`, so a
short typo like "infa" never reaches a long word like "infrastructure"
even though the user clearly meant the same prefix.

Add a second strategy that, for longer words, computes the edit
distance against the word's leading prefix (sized
`token.length + maxDistance`). The whole-word path still handles
typos in similarly-sized words ("infra" ↔ "infa"); the new prefix path
covers the partial-word case ("infa" → "infrastructure").

Returns the matched prefix substring so the existing highlight
plumbing in `NoteFlatTextExp.smartMatch` highlights only the matched
portion of the longer word.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label May 31, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a prefix-aware fuzzy matching strategy to improve search matching for short tokens with typos against longer words (e.g., matching 'infa' to 'infrastructure'), along with corresponding unit tests. The feedback suggests optimizing performance by adding a first-character match guard to avoid expensive edit distance calculations on unrelated long words, and simplifying a redundant Math.min calculation in the prefix length determination.

Comment on lines +337 to +338
if (normalizedToken.length >= 4 &&
word.length > normalizedToken.length + maxDistance) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Performance Optimization: First-Character Match Guard

In a typical document, there are many long words (length >= 7). Without a first-character match guard, Strategy 2 will run the relatively expensive calculateOptimizedEditDistance (which allocates arrays and performs DP) on almost every single long word in the text for any query token of length >= 4.

Since this is a prefix-aware search, we can assume that the first character of the prefix is typed correctly in the vast majority of cases. Adding a simple check like word[0] === normalizedToken[0] will immediately filter out ~96% of non-matching long words, drastically reducing the number of edit distance calculations and preventing search lag on larger notes.

Suggested change
if (normalizedToken.length >= 4 &&
word.length > normalizedToken.length + maxDistance) {
if (normalizedToken.length >= 4 &&
word[0] === normalizedToken[0] &&
word.length > normalizedToken.length + maxDistance) {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 2d67854word[0] === normalizedToken[0] guard added before the prefix substring + edit-distance call.

// Require a healthy length difference to avoid double-counting Strategy 1 hits.
if (normalizedToken.length >= 4 &&
word.length > normalizedToken.length + maxDistance) {
const prefixLen = Math.min(word.length, normalizedToken.length + maxDistance);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Redundant Math.min Call

Because of the preceding if condition (word.length > normalizedToken.length + maxDistance), word.length is guaranteed to be strictly greater than normalizedToken.length + maxDistance. Therefore, Math.min(word.length, normalizedToken.length + maxDistance) will always evaluate to normalizedToken.length + maxDistance.

We can simplify this to avoid the redundant function call and make the code cleaner.

Suggested change
const prefixLen = Math.min(word.length, normalizedToken.length + maxDistance);
const prefixLen = normalizedToken.length + maxDistance;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified in 2d67854prefixLen = normalizedToken.length + maxDistance directly.

@greptile-apps

greptile-apps Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a prefix-aware second strategy to fuzzyMatchWordWithResult so that short tokens with typos (e.g. infa) can fuzzy-match against the leading prefix of much longer words (e.g. infrastructure), a case the existing whole-word path skips due to the length guard. The strategy computes edit distance against a prefix of length tokenLen + maxDistance, returns the matched prefix substring so highlight plumbing in NoteFlatTextExp.smartMatch correctly marks only the matched portion, and is mathematically mutually exclusive with Strategy 1 — no word triggers both paths.

  • New Strategy 2 in fuzzyMatchWordWithResult: fires only when word.length > tokenLen + maxDistance (the exact boundary where Strategy 1 stops), with a cheap first-character guard before the edit-distance call.
  • Updated tests: prefix-match cases use genuine typos (insall, infa) that cannot satisfy the exact-substring fast-path, and the fuzzyMatchWordWithResult highlight-assertion test is moved to its own describe block.

Confidence Score: 5/5

The change is additive and self-contained: Strategy 2 fires only where Strategy 1 provably cannot (mutually exclusive by the word-length boundary), the returned prefix substring is a literal substring of the original word so downstream highlighting in NoteFlatTextExp works correctly, and all 17 tests pass with genuine-typo test cases that exercise the new code path.

Both strategies are mathematically mutually exclusive — when word.length satisfies Strategy 2's entry condition it cannot satisfy Strategy 1's — so there is no risk of double-triggering or mis-classifying a match. The prefix extraction and returned substring length are bounded by word length (guaranteed by the strict inequality guard), eliminating out-of-bounds slicing. No logic errors or incorrect behavior were found.

No files require special attention.

Important Files Changed

Filename Overview
packages/trilium-core/src/services/search/utils/text_utils.ts Refactors fuzzyMatchWordWithResult to add Strategy 2 prefix-aware fuzzy matching; mutual exclusion between strategies is mathematically guaranteed by the word-length condition, and the returned prefix substring aligns correctly with how note_flat_text.ts highlights results.
packages/trilium-core/src/services/search/utils/text_utils.spec.ts Adds prefix-match tests using genuine typos ('insall', 'infa') that bypass the exact-substring fast-path, addresses prior feedback by separating the fuzzyMatchWordWithResult test into its own describe block, and locks in the returned prefix value for highlight coverage.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[fuzzyMatchWordWithResult] --> B{exact substring\nmatch?}
    B -- yes --> C[Return exact substring]
    B -- no --> D[Split into words, iterate]
    D --> F{Strategy 1:\ntoken len ge 4 AND\nabs diff le maxDist?}
    F -- yes --> G[Edit distance\ntoken vs whole word]
    G --> H{dist le maxDist?}
    H -- yes --> I[Return originalWord]
    H -- no --> J{Strategy 2:\ntoken len ge 4 AND\nword longer than token+maxDist\nAND same first char?}
    F -- no --> J
    J -- yes --> K[prefix = word first\ntokenLen+maxDist chars]
    K --> L[Edit distance\ntoken vs prefix]
    L --> M{dist le maxDist?}
    M -- yes --> N[Return prefix substring]
    M -- no --> D
    J -- no --> D
    D -- done --> O[Return null]
Loading

Reviews (2): Last reviewed commit: "fix(search): address review on prefix-aw..." | Re-trigger Greptile

Comment thread packages/trilium-core/src/services/search/utils/text_utils.spec.ts Outdated
Comment thread packages/trilium-core/src/services/search/utils/text_utils.spec.ts Outdated
Comment on lines +337 to 345
if (normalizedToken.length >= 4 &&
word.length > normalizedToken.length + maxDistance) {
const prefixLen = Math.min(word.length, normalizedToken.length + maxDistance);
const prefix = word.substring(0, prefixLen);
const distance = calculateOptimizedEditDistance(normalizedToken, prefix, maxDistance);
if (distance <= maxDistance) {
return originalWord.substring(0, prefixLen);
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Strategy 2 widens false-positive surface for semantically unrelated short words

Because Strategy 2 compares the token against the word's leading tokenLength + maxDistance prefix, any two words that happen to share a 2-edit prefix will match. Concretely, with maxDistance = 2: "word" (4 chars) → prefix of "worldwide" is "worldw" (6 chars), edit distance = 2 → match. "test" → prefix of "testimony" is "testim", edit distance = 2 → match. Neither are typos of a prefix. You may want an additional constraint — e.g. requiring token.length / prefixLen >= 0.6 — to reduce spurious matches.

Fix in Claude Code

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that the prefix strategy widens the candidate set — test will match testimony's leading testim (edit distance 2 within budget). My take is to leave the constraint as-is and lean on the scoring layer: SearchResult.computeScore already applies a per-token fuzzy-match penalty so prefix-aware hits land below exact and whole-word fuzzy hits in the ranking, and quick search only ever surfaces a handful of top-scored results. Tightening the ratio further would also drop the case this strategy exists for (infainfrastructure, ratio 4/6 ≈ 0.67), so I'd rather keep the recall and let scoring de-prioritise the false positives. Happy to revisit if the noise turns out to be visible in practice.

- Add a cheap first-character match guard before Strategy 2 runs the
  edit-distance DP. Mistypes that change the leading letter are rare,
  and the guard rejects ~96% of unrelated long words for the same query
  before the more expensive call. (Suggested in review.)

- Drop the redundant `Math.min` around the prefix length. The preceding
  `word.length > token.length + maxDistance` check already guarantees
  the min always resolves to `token.length + maxDistance`.

- Fix the prefix-match test: `'instal'` is a verbatim substring of
  `'installer'`, so the exact-substring fast-path returned true before
  Strategy 2 was ever exercised. Use `'insall'` (a genuine missing-letter
  typo of `install`) instead, plus an explicit first-character-guard
  rejection case.

- Move the `fuzzyMatchWordWithResult` assertion out of the
  `describe('fuzzyMatchWord')` block into its own sibling describe so the
  test for the result-returning function is grouped under the function
  it actually exercises.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant