feat(search): prefix-aware fuzzy match for short tokens vs long words by xnohat · Pull Request #10005 · TriliumNext/Trilium

xnohat · 2026-05-31T17:25:43Z

Summary

Split out from #9963 as requested in review — this change is independent of
the FTS5 work and stands on its own.

fuzzyMatchWordWithResult currently skips fuzzy comparisons whenever the
word in the text is more than MAX_EDIT_DISTANCE characters longer than
the search token. That means a short typo like infa never reaches a long
word like infrastructure, even though the user clearly meant the same
prefix.

This PR adds a second strategy: for words that are still longer than the
token even after the existing length guard, compute the edit distance
against the word's leading prefix (sized token.length + maxDistance).
The whole-word path is unchanged, so typos in similarly-sized words
(infra ↔ infa) still match exactly as before; the new prefix path
handles the partial-word case (infa → infrastructure).

The matched substring is returned so the existing highlight plumbing in
NoteFlatTextExp.smartMatch only marks the matched portion of the longer
word.

Test plan

New unit tests covering both directions: a short token matching
against the prefix of a longer word, and unrelated long words still
failing the match. Adds a returned-prefix assertion so highlight
behaviour is locked in.
pnpm --filter server test text_utils — 17/17 pass.
pnpm typecheck clean.

`fuzzyMatchWordWithResult` skips whole-word fuzzy comparisons when the word is longer than the token by more than `MAX_EDIT_DISTANCE`, so a short typo like "infa" never reaches a long word like "infrastructure" even though the user clearly meant the same prefix. Add a second strategy that, for longer words, computes the edit distance against the word's leading prefix (sized `token.length + maxDistance`). The whole-word path still handles typos in similarly-sized words ("infra" ↔ "infa"); the new prefix path covers the partial-word case ("infa" → "infrastructure"). Returns the matched prefix substring so the existing highlight plumbing in `NoteFlatTextExp.smartMatch` highlights only the matched portion of the longer word. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a prefix-aware fuzzy matching strategy to improve search matching for short tokens with typos against longer words (e.g., matching 'infa' to 'infrastructure'), along with corresponding unit tests. The feedback suggests optimizing performance by adding a first-character match guard to avoid expensive edit distance calculations on unrelated long words, and simplifying a redundant Math.min calculation in the prefix length determination.

gemini-code-assist · 2026-05-31T17:27:02Z

+            if (normalizedToken.length >= 4 &&
+                word.length > normalizedToken.length + maxDistance) {


Performance Optimization: First-Character Match Guard

In a typical document, there are many long words (length >= 7). Without a first-character match guard, Strategy 2 will run the relatively expensive calculateOptimizedEditDistance (which allocates arrays and performs DP) on almost every single long word in the text for any query token of length >= 4.

Since this is a prefix-aware search, we can assume that the first character of the prefix is typed correctly in the vast majority of cases. Adding a simple check like word[0] === normalizedToken[0] will immediately filter out ~96% of non-matching long words, drastically reducing the number of edit distance calculations and preventing search lag on larger notes.

Suggested change

if (normalizedToken.length >= 4 &&

word.length > normalizedToken.length + maxDistance) {

if (normalizedToken.length >= 4 &&

word[0] === normalizedToken[0] &&

word.length > normalizedToken.length + maxDistance) {

Done in 2d67854 — word[0] === normalizedToken[0] guard added before the prefix substring + edit-distance call.

gemini-code-assist · 2026-05-31T17:27:02Z

+            // Require a healthy length difference to avoid double-counting Strategy 1 hits.
+            if (normalizedToken.length >= 4 &&
+                word.length > normalizedToken.length + maxDistance) {
+                const prefixLen = Math.min(word.length, normalizedToken.length + maxDistance);


Redundant Math.min Call

Because of the preceding if condition (word.length > normalizedToken.length + maxDistance), word.length is guaranteed to be strictly greater than normalizedToken.length + maxDistance. Therefore, Math.min(word.length, normalizedToken.length + maxDistance) will always evaluate to normalizedToken.length + maxDistance.

We can simplify this to avoid the redundant function call and make the code cleaner.

Suggested change

const prefixLen = Math.min(word.length, normalizedToken.length + maxDistance);

const prefixLen = normalizedToken.length + maxDistance;

Simplified in 2d67854 — prefixLen = normalizedToken.length + maxDistance directly.

greptile-apps · 2026-05-31T17:31:28Z

Greptile Summary

This PR adds a prefix-aware second strategy to fuzzyMatchWordWithResult so that short tokens with typos (e.g. infa) can fuzzy-match against the leading prefix of much longer words (e.g. infrastructure), a case the existing whole-word path skips due to the length guard. The strategy computes edit distance against a prefix of length tokenLen + maxDistance, returns the matched prefix substring so highlight plumbing in NoteFlatTextExp.smartMatch correctly marks only the matched portion, and is mathematically mutually exclusive with Strategy 1 — no word triggers both paths.

New Strategy 2 in fuzzyMatchWordWithResult: fires only when word.length > tokenLen + maxDistance (the exact boundary where Strategy 1 stops), with a cheap first-character guard before the edit-distance call.
Updated tests: prefix-match cases use genuine typos (insall, infa) that cannot satisfy the exact-substring fast-path, and the fuzzyMatchWordWithResult highlight-assertion test is moved to its own describe block.

Confidence Score: 5/5

The change is additive and self-contained: Strategy 2 fires only where Strategy 1 provably cannot (mutually exclusive by the word-length boundary), the returned prefix substring is a literal substring of the original word so downstream highlighting in NoteFlatTextExp works correctly, and all 17 tests pass with genuine-typo test cases that exercise the new code path.

Both strategies are mathematically mutually exclusive — when word.length satisfies Strategy 2's entry condition it cannot satisfy Strategy 1's — so there is no risk of double-triggering or mis-classifying a match. The prefix extraction and returned substring length are bounded by word length (guaranteed by the strict inequality guard), eliminating out-of-bounds slicing. No logic errors or incorrect behavior were found.

No files require special attention.

Important Files Changed

Filename	Overview
packages/trilium-core/src/services/search/utils/text_utils.ts	Refactors fuzzyMatchWordWithResult to add Strategy 2 prefix-aware fuzzy matching; mutual exclusion between strategies is mathematically guaranteed by the word-length condition, and the returned prefix substring aligns correctly with how note_flat_text.ts highlights results.
packages/trilium-core/src/services/search/utils/text_utils.spec.ts	Adds prefix-match tests using genuine typos ('insall', 'infa') that bypass the exact-substring fast-path, addresses prior feedback by separating the fuzzyMatchWordWithResult test into its own describe block, and locks in the returned prefix value for highlight coverage.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[fuzzyMatchWordWithResult] --> B{exact substring\nmatch?}
    B -- yes --> C[Return exact substring]
    B -- no --> D[Split into words, iterate]
    D --> F{Strategy 1:\ntoken len ge 4 AND\nabs diff le maxDist?}
    F -- yes --> G[Edit distance\ntoken vs whole word]
    G --> H{dist le maxDist?}
    H -- yes --> I[Return originalWord]
    H -- no --> J{Strategy 2:\ntoken len ge 4 AND\nword longer than token+maxDist\nAND same first char?}
    F -- no --> J
    J -- yes --> K[prefix = word first\ntokenLen+maxDist chars]
    K --> L[Edit distance\ntoken vs prefix]
    L --> M{dist le maxDist?}
    M -- yes --> N[Return prefix substring]
    M -- no --> D
    J -- no --> D
    D -- done --> O[Return null]

_{Reviews (2): Last reviewed commit: "fix(search): address review on prefix-aw..." | Re-trigger Greptile}

greptile-apps · 2026-05-31T17:31:34Z

+            if (normalizedToken.length >= 4 &&
+                word.length > normalizedToken.length + maxDistance) {
+                const prefixLen = Math.min(word.length, normalizedToken.length + maxDistance);
+                const prefix = word.substring(0, prefixLen);
+                const distance = calculateOptimizedEditDistance(normalizedToken, prefix, maxDistance);
+                if (distance <= maxDistance) {
+                    return originalWord.substring(0, prefixLen);
+                }
            }


Strategy 2 widens false-positive surface for semantically unrelated short words

Because Strategy 2 compares the token against the word's leading tokenLength + maxDistance prefix, any two words that happen to share a 2-edit prefix will match. Concretely, with maxDistance = 2: "word" (4 chars) → prefix of "worldwide" is "worldw" (6 chars), edit distance = 2 → match. "test" → prefix of "testimony" is "testim", edit distance = 2 → match. Neither are typos of a prefix. You may want an additional constraint — e.g. requiring token.length / prefixLen >= 0.6 — to reduce spurious matches.

You're right that the prefix strategy widens the candidate set — test will match testimony's leading testim (edit distance 2 within budget). My take is to leave the constraint as-is and lean on the scoring layer: SearchResult.computeScore already applies a per-token fuzzy-match penalty so prefix-aware hits land below exact and whole-word fuzzy hits in the ranking, and quick search only ever surfaces a handful of top-scored results. Tightening the ratio further would also drop the case this strategy exists for (infa → infrastructure, ratio 4/6 ≈ 0.67), so I'd rather keep the recall and let scoring de-prioritise the false positives. Happy to revisit if the noise turns out to be visible in practice.

- Add a cheap first-character match guard before Strategy 2 runs the edit-distance DP. Mistypes that change the leading letter are rare, and the guard rejects ~96% of unrelated long words for the same query before the more expensive call. (Suggested in review.) - Drop the redundant `Math.min` around the prefix length. The preceding `word.length > token.length + maxDistance` check already guarantees the min always resolves to `token.length + maxDistance`. - Fix the prefix-match test: `'instal'` is a verbatim substring of `'installer'`, so the exact-substring fast-path returned true before Strategy 2 was ever exercised. Use `'insall'` (a genuine missing-letter typo of `install`) instead, plus an explicit first-character-guard rejection case. - Move the `fuzzyMatchWordWithResult` assertion out of the `describe('fuzzyMatchWord')` block into its own sibling describe so the test for the result-returning function is grouped under the function it actually exercises. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label May 31, 2026

xnohat mentioned this pull request May 31, 2026

perf(search): instant quick search via FTS5 + prefix-aware fuzzy #9963

Open

5 tasks

gemini-code-assist Bot reviewed May 31, 2026

View reviewed changes

greptile-apps Bot reviewed May 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(search): prefix-aware fuzzy match for short tokens vs long words#10005

feat(search): prefix-aware fuzzy match for short tokens vs long words#10005
xnohat wants to merge 2 commits into
TriliumNext:mainfrom
xnohat:feat/search-prefix-fuzzy

xnohat commented May 31, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 31, 2026

Uh oh!

xnohat May 31, 2026

Uh oh!

gemini-code-assist Bot May 31, 2026

Uh oh!

xnohat May 31, 2026

Uh oh!

greptile-apps Bot commented May 31, 2026 •

edited

Loading

Flowchart

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot May 31, 2026

Uh oh!

xnohat May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if (normalizedToken.length >= 4 &&
		word.length > normalizedToken.length + maxDistance) {

	const prefixLen = Math.min(word.length, normalizedToken.length + maxDistance);
	const prefixLen = normalizedToken.length + maxDistance;

Uh oh!

Conversation

xnohat commented May 31, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Performance Optimization: First-Character Match Guard

Uh oh!

xnohat May 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Redundant Math.min Call

Uh oh!

xnohat May 31, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

xnohat May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Redundant `Math.min` Call

greptile-apps Bot commented May 31, 2026 •

edited

Loading