perf(search): instant quick search via FTS5 + prefix-aware fuzzy by xnohat · Pull Request #9963 · TriliumNext/Trilium

xnohat · 2026-05-27T16:11:46Z

Summary

Quick search (and any *=* / *= / =* / = content query) used to scan
every note's blob in JavaScript, dominating latency on large knowledge bases
— multi-second response times on databases with several thousand text-content
notes. This PR introduces a SQLite FTS5 inverted index over blob content,
dropping quick search latency by roughly two orders of magnitude while
maintaining strict-superset correctness for the substring/start/end/exact
operators.

The prefix-aware fuzzy improvement that originally rode along here is
independent of the FTS work and now lives in #10005, as suggested in
review. text_utils.ts is therefore not part of this PR's diff.

What changed

notes_fts virtual table (migration v239) indexes blob content with
FTS5's trigram tokenizer, which records every contiguous 3-codepoint
window. A phrase query like "ello" then matches any document containing
the literal substring, making the FTS candidate set a strict superset of
what findInText accepts for *=* / *= / =* / = — no silent drops.
(The earlier unicode61 + prefix-wildcard approach only matched
word-start occurrences, so e.g. ello would have missed hello. The
trigram switch addresses that review finding.)
The index is scoped to blobs that are currently referenced by a
non-deleted text-content note — blobs that only exist because they back
a historical revision or an attachment are skipped, since the search SQL
JOINs through notes and would never return them anyway. Maintenance
triggers fire on notes (insert when a new text note appears; remove +
re-insert when its blobId / type / isDeleted change; clean up on
hard-delete), with an AFTER DELETE ON blobs trigger as a safety net for
blob garbage collection.
NoteContentFulltextExp narrows candidates via notes_fts MATCH for
operators where trigram is a strict superset (*=*, =, *=, =*). The
existing JS findInText re-checks each candidate so the operators' exact
positional / boundary semantics are preserved. ~= / ~* fall back to
the legacy scan because typos can produce zero trigram overlap with the
target. != / %= continue to scan as before.
quickSearch route caps snippet extraction at 15 results — the
dropdown only ever displays a small window, and snippets are the
dominant per-result cost. Results beyond the snippet limit get
empty-string snippets so the API response shape stays uniform.

Trade-off: diacritic folding

The trigram tokenizer does not ship with FTS5 remove_diacritics folding,
so body-content searches where the user types the unaccented form of an
accented word (duong looking for đường) now miss via the FTS path.
Title and attribute matches still go through NoteFlatTextExp, which
normalizes diacritics in JS, so most user queries are unaffected. A
follow-up can layer diacritic-insensitive content search back on by
maintaining a pre-normalized indexed column.

Test plan

Unit tests covering buildFtsMatchQuery — every operator path,
token length / alphanumeric filtering, and FTS5 quote-escaping.
Full pnpm --filter server test search suite passes (348 tests).
pnpm typecheck clean.
Migration v239 runs cleanly under the server test harness.
Smoke-tested on a populated knowledge base of mixed text and
web-clipped notes: quick search returns results in ~200–300 ms,
substring queries surface body matches the prefix-only candidates
would have dropped, punctuation-only queries fall back to the
legacy scan rather than raising fts5: syntax error.

Migration / upgrade notes

Migration v239 runs inside the standard transactional migration flow,
after the automatic pre-migration backup. The backfill cost is one-time
and scales with the current working set rather than total blob count.
No client-side changes; existing search endpoints retain their shape.
The FTS5 trigram tokenizer ships with the SQLite version bundled by
better-sqlite3 12.x, no native dependency change.

Performance

Informal measurements on a large knowledge base of mixed text and
web-clipped notes:

Query shape	Before	After
Two-token query with title + body matches	seconds	<300 ms
Substring match against word body (e.g. ello)	missed	hit
Punctuation-only token (e.g. ++, ==)	crash	matches

Exact numbers depend on database size and content shape, but the order
of magnitude is consistent across the test databases checked.

gemini-code-assist

Code Review

This pull request introduces SQLite FTS5 full-text indexing over note blob content to optimize quick search performance, along with prefix-aware fuzzy matching for longer words. The review feedback highlights critical improvements to prevent crashes and runtime errors: filtering out punctuation-only tokens to avoid FTS5 syntax errors (with corresponding test coverage) and explicitly initializing snippet properties to empty strings for search results exceeding the snippet limit to prevent potential TypeErrors.

gemini-code-assist · 2026-05-27T16:14:29Z

+    const usableTokens = tokens
+        .map((t) => (t ?? "").trim())
+        .filter((t) => t.length >= 2);


Punctuation-only tokens (e.g., ++, ==, //) of length >= 2 will pass this filter but will be completely stripped by the FTS5 unicode61 tokenizer. Passing a query with only punctuation/empty terms to FTS5 (like ""*) results in an fts5: syntax error and crashes the search.

We should filter out tokens that do not contain any alphanumeric characters to prevent these syntax errors.

const unicodeAlphanumeric = /[\p{L}\p{N}]/u; const usableTokens = tokens .map((t) => (t ?? "").trim()) .filter((t) => t.length >= 2 && unicodeAlphanumeric.test(t));

Good catch — fixed in 62084f3. buildFtsMatchQuery now requires each token to contain at least one \p{L}/\p{N} code point (any unicode letter or number, so CJK/Cyrillic etc. still pass), and tokens like ++ / == are dropped before they reach FTS5. Regression test added.

gemini-code-assist · 2026-05-27T16:14:29Z

+    it("returns null when no usable tokens remain", () => {
+        expect(buildFtsMatchQuery("*=*", [])).toBeNull();
+        expect(buildFtsMatchQuery("*=*", ["a"])).toBeNull(); // single char filtered out
+        expect(buildFtsMatchQuery("*=*", ["", "  "])).toBeNull();
+    });


Add a test case to verify that punctuation-only tokens are correctly filtered out and do not cause FTS5 syntax errors.

Suggested change

it("returns null when no usable tokens remain", () => {

expect(buildFtsMatchQuery("*=*", [])).toBeNull();

expect(buildFtsMatchQuery("*=*", ["a"])).toBeNull(); // single char filtered out

expect(buildFtsMatchQuery("*=*", ["", " "])).toBeNull();

});

it("returns null when no usable tokens remain", () => {

expect(buildFtsMatchQuery("*=*", [])).toBeNull();

expect(buildFtsMatchQuery("*=*", ["a"])).toBeNull(); // single char filtered out

expect(buildFtsMatchQuery("*=*", ["", " "])).toBeNull();

expect(buildFtsMatchQuery("*=*", ["++", "=="])).toBeNull(); // punctuation-only filtered out

});

Added in 62084f3, plus an extra case asserting that mixed alphanumeric/punctuation tokens (v2.0) still flow through, so the new alphanumeric filter doesn't accidentally over-filter.

gemini-code-assist · 2026-05-27T16:14:29Z

+    const snippetCount = Math.min(trimmed.length, QUICK_SEARCH_SNIPPET_LIMIT);
+    for (let i = 0; i < snippetCount; i++) {
+        const result = trimmed[i];
        result.contentSnippet = searchService.extractContentSnippet(result.noteId, searchContext.highlightedTokens);
        result.attributeSnippet = searchService.extractAttributeSnippet(result.noteId, searchContext.highlightedTokens);
    }


For search results beyond the QUICK_SEARCH_SNIPPET_LIMIT, contentSnippet and attributeSnippet will remain undefined. If highlightSearchResults or the API mapping layer expects these properties to be strings, this could lead to runtime TypeErrors.

To ensure robustness, we should explicitly initialize these properties to empty strings for any results that exceed the snippet limit.

const snippetCount = Math.min(trimmed.length, QUICK_SEARCH_SNIPPET_LIMIT); for (let i = 0; i < trimmed.length; i++) { const result = trimmed[i]; if (i < snippetCount) { result.contentSnippet = searchService.extractContentSnippet(result.noteId, searchContext.highlightedTokens); result.attributeSnippet = searchService.extractAttributeSnippet(result.noteId, searchContext.highlightedTokens); } else { result.contentSnippet = ""; result.attributeSnippet = ""; } }

Done in 62084f3. contentSnippet and attributeSnippet are now explicitly set to "" for results beyond QUICK_SEARCH_SNIPPET_LIMIT. The downstream highlighter and client widget were already guarding the falsy case, but normalizing the response shape avoids the contract ambiguity.

contributor · 2026-05-29T15:10:02Z

 export const MIGRATIONS: (SqlMigration | JsMigration)[] = [
+    // Add FTS5 full-text index over blob content so quick search doesn't have to
+    // scan every note's body at query time. Triggers keep the index in sync.
+    // Blobs are content-addressed (immutable): new content -> new blobId, so we


Looks like it will it keep blobs from all note revisions in FTS index. While search only needs/uses the latest revision blob.

Good catch — addressed in ef126f2. The backfill now joins blobs against notes and only inserts rows for blobs reachable from a current, non-deleted text-content note, so revision-only and attachment-only blobs are skipped. The maintenance triggers also move from blobs to notes (insert when a text note appears; remove/re-insert on blobId/type/isDeleted changes; clean up on hard-delete), with the AFTER DELETE ON blobs trigger kept as a safety net for blob garbage collection. The search SQL is unchanged — it was already JOINing through notes — but the index it's matching against is now strictly the working set.

contributor · 2026-05-29T15:11:23Z

-            // Skip if word is too different in length for fuzzy matching
-            if (Math.abs(word.length - normalizedToken.length) > maxDistance) {
-                continue;
+            // Strategy 1: whole-word fuzzy match for similar-length words ("infra" ↔ "infa", typos).


I think fuzzy match is independent from FTS feature, and could/should be in separate PR?

Agreed — split out into #10005. text_utils.ts and its tests are reverted to upstream in ef126f2, so this PR is now purely the FTS5 work.

Quick search (and any *=* / ~= / ~* content query) previously scanned every note blob in JavaScript, dominating latency on large knowledge bases — multi-second on databases with several thousand text-content notes. - Add a SQLite FTS5 inverted index (notes_fts) over blob content, indexed with unicode61 + remove_diacritics for accent-insensitive matching, with 2/3-char prefix indexes for partial-word hits. - Sync the index via AFTER INSERT/DELETE triggers on blobs. Blobs are content-addressed (a content change writes a new row), so no UPDATE trigger is needed. - NoteContentFulltextExp narrows candidates through MATCH for operators FTS can express (*=*, ~=, ~*) and re-checks them with the existing JS scorer so fuzzy/normalize/operator semantics are unchanged. Operators that depend on positional anchors or negation (=, *=, =*, !=, %=) keep the legacy unfiltered scan. - Protected notes are always included as candidates so they remain searchable through findInText once the session is open. - Extend fuzzyMatchWordWithResult with prefix-aware matching: the previous whole-word check skipped words whose length differed from the token by more than MAX_EDIT_DISTANCE, so e.g. "infa" never reached "infrastructure". A second strategy now tries edit distance against a word's leading prefix. - Cap quickSearch snippet extraction to the first 15 results — the dropdown only displays a small window and snippets are the dominant per-result cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… fields - buildFtsMatchQuery now requires each token to contain at least one unicode letter or number (\p{L}/\p{N}). Tokens like "++" or "==" used to pass the length check but get stripped to an empty phrase by the unicode61 tokenizer, which FTS5 then rejects with a syntax error. Adds a regression test plus one verifying mixed alphanumeric tokens (e.g. "v2.0") still flow through. - quickSearch now explicitly assigns "" to contentSnippet and attributeSnippet for results beyond QUICK_SEARCH_SNIPPET_LIMIT instead of leaving them undefined. Downstream code already guarded against the falsy case, but the response shape is now consistent regardless of result position. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses two review comments on PR TriliumNext#9963: 1. The v239 backfill no longer indexes every text blob in the database. It now joins `blobs` against `notes` and keeps only blobs that are actually reachable from a current, non-deleted text-content note — blobs whose only references are historical revisions or attachments are skipped. The search SQL JOINs through `notes` anyway, so these blobs were never returned; indexing them just wasted space and slowed `MATCH`. 2. The maintenance triggers move from `blobs` (where, at INSERT time, no note yet references the row) to `notes`: insert into FTS when a text note appears, remove (and re-insert) on `blobId`/`type`/ `isDeleted` changes, and clean up on hard-delete. An `AFTER DELETE ON blobs` trigger stays as a safety net for blob garbage collection. The prefix-aware fuzzy match (text_utils.ts) is independent of the FTS work and is moved to a separate PR so each change can be reviewed on its own merits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-05-31T17:32:09Z

Greptile Summary

This PR introduces an FTS5 notes_fts virtual table (migration v239, trigram tokenizer) to accelerate note-content search by narrowing SQL candidates before the JavaScript findInText re-check, and caps quick-search snippet extraction to 15 items to avoid per-blob read overhead on the full 50-result window.

FTS5 index + triggers (migration v239): creates notes_fts with trigram tokenizer, backfills current non-deleted text-content blobs, and maintains the index through four triggers. Operators where FTS is not a safe superset fall back to a full blob scan.
buildFtsMatchQuery: new exported function translates operator + token list into a trigram MATCH phrase query with short-token and punctuation filtering; well-covered by new unit tests.
Quick-search snippet cap: limits blob reads for snippets to the first 15 results; results 16–50 receive empty strings so the API shape stays uniform.

Confidence Score: 4/5

Generally safe to merge with awareness that the FTS path can silently miss notes whose content contains HTML-encoded characters like &.

The FTS index is built over raw HTML blob bytes, but findInText operates on HTML-preprocessed text. A note containing R&D in raw HTML will not be found when searching for R&D because the trigrams of the decoded string do not appear contiguously in the encoded form — FTS excludes the note before findInText can run, silently dropping results for notes with HTML-encoded characters, which is routine in TipTap-produced content.

Both migrations.ts (backfill indexes raw HTML) and note_content_fulltext.ts (query uses raw FTS results) are involved in the superset mismatch.

Important Files Changed

Filename	Overview
packages/trilium-core/src/migrations/migrations.ts	Adds migration v239: FTS5 notes_fts virtual table with trigram tokenizer, backfill, and four maintenance triggers. The core concern is that raw HTML blob bytes are indexed rather than preprocessed text, breaking the superset guarantee for HTML entities.
packages/trilium-core/src/services/search/expressions/note_content_fulltext.ts	Introduces buildFtsMatchQuery translating operator + tokens into an FTS5 trigram MATCH query, returning null for operators where FTS is not a safe superset. The superset claim holds for plain-text but breaks for raw HTML blobs where HTML-encoded characters cause false negatives.
packages/trilium-core/src/routes/api/search.ts	Caps quick-search results to 50 and snippet extraction to the first 15, assigning empty strings for the remainder. Straightforward, low-risk optimization.
packages/trilium-core/src/services/search/expressions/note_content_fulltext.spec.ts	Adds unit tests for buildFtsMatchQuery covering operator gating, token length filtering, punctuation-only rejection, mixed-content retention, and quote-escaping. Well-structured and matches the implementation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    QS["quickSearch / NoteContentFulltextExp.execute()"] --> BFQ["buildFtsMatchQuery(operator, tokens)"]
    BFQ --> |"~=, ~*, !=, %=\nor tokens < 3 chars\nor punct-only"| NULL["returns null"]
    BFQ --> |"*=*, =, *=, =*\nwith valid tokens"| FTS_Q["FTS5 MATCH query\n(trigram phrases AND-ed)"]
    NULL --> FULL_SCAN["Full blob scan SQL"]
    FTS_Q --> FTS_SQL["SQL + blobId IN (SELECT FROM notes_fts MATCH ?)"]
    FTS_SQL --> PROTECTED["Always include isProtected=1 notes"]
    PROTECTED --> CANDIDATES["Candidate rows"]
    FULL_SCAN --> CANDIDATES
    CANDIDATES --> FIND["findInText: decrypt then preprocessContent then operator check"]
    FIND --> RESULT["resultNoteSet"]

_{Reviews (2): Last reviewed commit: "fix(search): switch FTS5 to trigram toke..." | Re-trigger Greptile}

Greptile flagged that `unicode61 + prefix wildcards` was not a strict superset of what `findInText` accepts for `*=*` / `*=` / `=*` / `=`: the FTS5 prefix syntax `"foo"*` only matches indexed words that *start with* "foo", so a substring query like `ello` would silently drop "hello" (whose tokenized word begins with `h`, not `e`). That made the fast path produce false negatives for the very pattern `*=*` is meant to handle. This commit: - Migration v239 now creates `notes_fts` with the `trigram` tokenizer, which indexes every contiguous 3-char window. A phrase query like `"ello"` matches every document containing the literal substring, making FTS a strict superset of substring/start/end/exact matching. The previous `prefix = '2 3'` is dropped — it was a unicode61 prefix-index option and isn't valid for trigram. - `buildFtsMatchQuery` switches operator handling accordingly: it now emits FTS queries for `*=*` / `=` / `*=` / `=*` (all substring supersets) and returns `null` for `~=` / `~*` (typos can produce zero trigram overlap with the target) plus `!=` / `%=` as before. Query syntax drops the `*` wildcard (trigram doesn't accept it) and raises the minimum token length to 3 codepoints — anything shorter has no trigram representation in the index. - The diacritic-folding behaviour from `remove_diacritics 1` is lost by switching tokenizers. Title and attribute matches still flow through `NoteFlatTextExp`, which normalizes diacritics in JS, so the practical impact is limited to body-content searches where the user types a non-diacritic form of an accented word; that case can be layered back on later via a pre-normalized indexed column. - Fixes the misleading `notes_fts_after_note_insert` comment that claimed UPDATE-style scenarios (blobId/type/isDeleted change, restoration from soft-delete) were handled there. They actually fire `notes_fts_after_note_update`; the INSERT trigger only runs for genuinely new note rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 27, 2026

gemini-code-assist Bot reviewed May 27, 2026

View reviewed changes

eliandoran added the merge-conflicts label May 27, 2026

xnohat force-pushed the feat/search-fts5 branch from 62084f3 to 12118a4 Compare May 27, 2026 16:27

eliandoran removed the merge-conflicts label May 27, 2026

contributor reviewed May 29, 2026

View reviewed changes

xnohat and others added 3 commits June 1, 2026 00:20

xnohat force-pushed the feat/search-fts5 branch from 12118a4 to ef126f2 Compare May 31, 2026 17:25

xnohat mentioned this pull request May 31, 2026

feat(search): prefix-aware fuzzy match for short tokens vs long words #10005

Open

3 tasks

greptile-apps Bot reviewed May 31, 2026

View reviewed changes

Comment thread packages/trilium-core/src/services/search/expressions/note_content_fulltext.ts

Comment thread packages/trilium-core/src/services/search/expressions/note_content_fulltext.ts

Comment thread packages/trilium-core/src/migrations/migrations.ts Outdated

Uh oh!

Conversation

xnohat commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Trade-off: diacritic folding

Test plan

Migration / upgrade notes

Performance

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

xnohat May 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

xnohat May 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

xnohat May 27, 2026

Choose a reason for hiding this comment

Uh oh!

contributor May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xnohat May 31, 2026

Choose a reason for hiding this comment

Uh oh!

contributor May 29, 2026

Choose a reason for hiding this comment

Uh oh!

xnohat May 31, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xnohat commented May 27, 2026 •

edited

Loading

contributor May 29, 2026 •

edited

Loading

greptile-apps Bot commented May 31, 2026 •

edited

Loading