Skip to content

perf(search): instant quick search via FTS5 + prefix-aware fuzzy#9963

Open
xnohat wants to merge 4 commits into
TriliumNext:mainfrom
xnohat:feat/search-fts5
Open

perf(search): instant quick search via FTS5 + prefix-aware fuzzy#9963
xnohat wants to merge 4 commits into
TriliumNext:mainfrom
xnohat:feat/search-fts5

Conversation

@xnohat

@xnohat xnohat commented May 27, 2026

Copy link
Copy Markdown

Summary

Quick search (and any *=* / *= / =* / = content query) used to scan
every note's blob in JavaScript, dominating latency on large knowledge bases
— multi-second response times on databases with several thousand text-content
notes. This PR introduces a SQLite FTS5 inverted index over blob content,
dropping quick search latency by roughly two orders of magnitude while
maintaining strict-superset correctness for the substring/start/end/exact
operators.

The prefix-aware fuzzy improvement that originally rode along here is
independent of the FTS work and now lives in #10005, as suggested in
review. text_utils.ts is therefore not part of this PR's diff.

What changed

  • notes_fts virtual table (migration v239) indexes blob content with
    FTS5's trigram tokenizer, which records every contiguous 3-codepoint
    window. A phrase query like "ello" then matches any document containing
    the literal substring, making the FTS candidate set a strict superset of
    what findInText accepts for *=* / *= / =* / = — no silent drops.
    (The earlier unicode61 + prefix-wildcard approach only matched
    word-start occurrences, so e.g. ello would have missed hello. The
    trigram switch addresses that review finding.)

  • The index is scoped to blobs that are currently referenced by a
    non-deleted text-content note
    — blobs that only exist because they back
    a historical revision or an attachment are skipped, since the search SQL
    JOINs through notes and would never return them anyway. Maintenance
    triggers fire on notes (insert when a new text note appears; remove +
    re-insert when its blobId / type / isDeleted change; clean up on
    hard-delete), with an AFTER DELETE ON blobs trigger as a safety net for
    blob garbage collection.

  • NoteContentFulltextExp narrows candidates via notes_fts MATCH for
    operators where trigram is a strict superset (*=*, =, *=, =*). The
    existing JS findInText re-checks each candidate so the operators' exact
    positional / boundary semantics are preserved. ~= / ~* fall back to
    the legacy scan because typos can produce zero trigram overlap with the
    target. != / %= continue to scan as before.

  • quickSearch route caps snippet extraction at 15 results — the
    dropdown only ever displays a small window, and snippets are the
    dominant per-result cost. Results beyond the snippet limit get
    empty-string snippets so the API response shape stays uniform.

Trade-off: diacritic folding

The trigram tokenizer does not ship with FTS5 remove_diacritics folding,
so body-content searches where the user types the unaccented form of an
accented word (duong looking for đường) now miss via the FTS path.
Title and attribute matches still go through NoteFlatTextExp, which
normalizes diacritics in JS, so most user queries are unaffected. A
follow-up can layer diacritic-insensitive content search back on by
maintaining a pre-normalized indexed column.

Test plan

  • Unit tests covering buildFtsMatchQuery — every operator path,
    token length / alphanumeric filtering, and FTS5 quote-escaping.
  • Full pnpm --filter server test search suite passes (348 tests).
  • pnpm typecheck clean.
  • Migration v239 runs cleanly under the server test harness.
  • Smoke-tested on a populated knowledge base of mixed text and
    web-clipped notes: quick search returns results in ~200–300 ms,
    substring queries surface body matches the prefix-only candidates
    would have dropped, punctuation-only queries fall back to the
    legacy scan rather than raising fts5: syntax error.

Migration / upgrade notes

  • Migration v239 runs inside the standard transactional migration flow,
    after the automatic pre-migration backup. The backfill cost is one-time
    and scales with the current working set rather than total blob count.
  • No client-side changes; existing search endpoints retain their shape.
  • The FTS5 trigram tokenizer ships with the SQLite version bundled by
    better-sqlite3 12.x, no native dependency change.

Performance

Informal measurements on a large knowledge base of mixed text and
web-clipped notes:

Query shape Before After
Two-token query with title + body matches seconds <300 ms
Substring match against word body (e.g. ello) missed hit
Punctuation-only token (e.g. ++, ==) crash matches

Exact numbers depend on database size and content shape, but the order
of magnitude is consistent across the test databases checked.

@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 27, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces SQLite FTS5 full-text indexing over note blob content to optimize quick search performance, along with prefix-aware fuzzy matching for longer words. The review feedback highlights critical improvements to prevent crashes and runtime errors: filtering out punctuation-only tokens to avoid FTS5 syntax errors (with corresponding test coverage) and explicitly initializing snippet properties to empty strings for search results exceeding the snippet limit to prevent potential TypeErrors.

Comment on lines +63 to +65
const usableTokens = tokens
.map((t) => (t ?? "").trim())
.filter((t) => t.length >= 2);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Punctuation-only tokens (e.g., ++, ==, //) of length >= 2 will pass this filter but will be completely stripped by the FTS5 unicode61 tokenizer. Passing a query with only punctuation/empty terms to FTS5 (like ""*) results in an fts5: syntax error and crashes the search.

We should filter out tokens that do not contain any alphanumeric characters to prevent these syntax errors.

    const unicodeAlphanumeric = /[\p{L}\p{N}]/u;
    const usableTokens = tokens
        .map((t) => (t ?? "").trim())
        .filter((t) => t.length >= 2 && unicodeAlphanumeric.test(t));

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 62084f3. buildFtsMatchQuery now requires each token to contain at least one \p{L}/\p{N} code point (any unicode letter or number, so CJK/Cyrillic etc. still pass), and tokens like ++ / == are dropped before they reach FTS5. Regression test added.

Comment on lines +36 to +40
it("returns null when no usable tokens remain", () => {
expect(buildFtsMatchQuery("*=*", [])).toBeNull();
expect(buildFtsMatchQuery("*=*", ["a"])).toBeNull(); // single char filtered out
expect(buildFtsMatchQuery("*=*", ["", " "])).toBeNull();
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add a test case to verify that punctuation-only tokens are correctly filtered out and do not cause FTS5 syntax errors.

Suggested change
it("returns null when no usable tokens remain", () => {
expect(buildFtsMatchQuery("*=*", [])).toBeNull();
expect(buildFtsMatchQuery("*=*", ["a"])).toBeNull(); // single char filtered out
expect(buildFtsMatchQuery("*=*", ["", " "])).toBeNull();
});
it("returns null when no usable tokens remain", () => {
expect(buildFtsMatchQuery("*=*", [])).toBeNull();
expect(buildFtsMatchQuery("*=*", ["a"])).toBeNull(); // single char filtered out
expect(buildFtsMatchQuery("*=*", ["", " "])).toBeNull();
expect(buildFtsMatchQuery("*=*", ["++", "=="])).toBeNull(); // punctuation-only filtered out
});

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 62084f3, plus an extra case asserting that mixed alphanumeric/punctuation tokens (v2.0) still flow through, so the new alphanumeric filter doesn't accidentally over-filter.

Comment on lines 69 to 74
const snippetCount = Math.min(trimmed.length, QUICK_SEARCH_SNIPPET_LIMIT);
for (let i = 0; i < snippetCount; i++) {
const result = trimmed[i];
result.contentSnippet = searchService.extractContentSnippet(result.noteId, searchContext.highlightedTokens);
result.attributeSnippet = searchService.extractAttributeSnippet(result.noteId, searchContext.highlightedTokens);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For search results beyond the QUICK_SEARCH_SNIPPET_LIMIT, contentSnippet and attributeSnippet will remain undefined. If highlightSearchResults or the API mapping layer expects these properties to be strings, this could lead to runtime TypeErrors.

To ensure robustness, we should explicitly initialize these properties to empty strings for any results that exceed the snippet limit.

    const snippetCount = Math.min(trimmed.length, QUICK_SEARCH_SNIPPET_LIMIT);
    for (let i = 0; i < trimmed.length; i++) {
        const result = trimmed[i];
        if (i < snippetCount) {
            result.contentSnippet = searchService.extractContentSnippet(result.noteId, searchContext.highlightedTokens);
            result.attributeSnippet = searchService.extractAttributeSnippet(result.noteId, searchContext.highlightedTokens);
        } else {
            result.contentSnippet = "";
            result.attributeSnippet = "";
        }
    }

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 62084f3. contentSnippet and attributeSnippet are now explicitly set to "" for results beyond QUICK_SEARCH_SNIPPET_LIMIT. The downstream highlighter and client widget were already guarding the falsy case, but normalizing the response shape avoids the contract ambiguity.

export const MIGRATIONS: (SqlMigration | JsMigration)[] = [
// Add FTS5 full-text index over blob content so quick search doesn't have to
// scan every note's body at query time. Triggers keep the index in sync.
// Blobs are content-addressed (immutable): new content -> new blobId, so we

@contributor contributor May 29, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it will it keep blobs from all note revisions in FTS index. While search only needs/uses the latest revision blob.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — addressed in ef126f2. The backfill now joins blobs against notes and only inserts rows for blobs reachable from a current, non-deleted text-content note, so revision-only and attachment-only blobs are skipped. The maintenance triggers also move from blobs to notes (insert when a text note appears; remove/re-insert on blobId/type/isDeleted changes; clean up on hard-delete), with the AFTER DELETE ON blobs trigger kept as a safety net for blob garbage collection. The search SQL is unchanged — it was already JOINing through notes — but the index it's matching against is now strictly the working set.

// Skip if word is too different in length for fuzzy matching
if (Math.abs(word.length - normalizedToken.length) > maxDistance) {
continue;
// Strategy 1: whole-word fuzzy match for similar-length words ("infra" ↔ "infa", typos).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fuzzy match is independent from FTS feature, and could/should be in separate PR?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — split out into #10005. text_utils.ts and its tests are reverted to upstream in ef126f2, so this PR is now purely the FTS5 work.

xnohat and others added 3 commits June 1, 2026 00:20
Quick search (and any *=* / ~= / ~* content query) previously scanned
every note blob in JavaScript, dominating latency on large knowledge
bases — multi-second on databases with several thousand text-content
notes.

- Add a SQLite FTS5 inverted index (notes_fts) over blob content,
  indexed with unicode61 + remove_diacritics for accent-insensitive
  matching, with 2/3-char prefix indexes for partial-word hits.
- Sync the index via AFTER INSERT/DELETE triggers on blobs. Blobs are
  content-addressed (a content change writes a new row), so no UPDATE
  trigger is needed.
- NoteContentFulltextExp narrows candidates through MATCH for operators
  FTS can express (*=*, ~=, ~*) and re-checks them with the existing JS
  scorer so fuzzy/normalize/operator semantics are unchanged. Operators
  that depend on positional anchors or negation (=, *=, =*, !=, %=)
  keep the legacy unfiltered scan.
- Protected notes are always included as candidates so they remain
  searchable through findInText once the session is open.
- Extend fuzzyMatchWordWithResult with prefix-aware matching: the
  previous whole-word check skipped words whose length differed from
  the token by more than MAX_EDIT_DISTANCE, so e.g. "infa" never
  reached "infrastructure". A second strategy now tries edit distance
  against a word's leading prefix.
- Cap quickSearch snippet extraction to the first 15 results — the
  dropdown only displays a small window and snippets are the dominant
  per-result cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… fields

- buildFtsMatchQuery now requires each token to contain at least one
  unicode letter or number (\p{L}/\p{N}). Tokens like "++" or "==" used
  to pass the length check but get stripped to an empty phrase by the
  unicode61 tokenizer, which FTS5 then rejects with a syntax error.
  Adds a regression test plus one verifying mixed alphanumeric tokens
  (e.g. "v2.0") still flow through.

- quickSearch now explicitly assigns "" to contentSnippet and
  attributeSnippet for results beyond QUICK_SEARCH_SNIPPET_LIMIT instead
  of leaving them undefined. Downstream code already guarded against
  the falsy case, but the response shape is now consistent regardless
  of result position.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses two review comments on PR TriliumNext#9963:

1. The v239 backfill no longer indexes every text blob in the database.
   It now joins `blobs` against `notes` and keeps only blobs that are
   actually reachable from a current, non-deleted text-content note —
   blobs whose only references are historical revisions or attachments
   are skipped. The search SQL JOINs through `notes` anyway, so these
   blobs were never returned; indexing them just wasted space and
   slowed `MATCH`.

2. The maintenance triggers move from `blobs` (where, at INSERT time,
   no note yet references the row) to `notes`: insert into FTS when a
   text note appears, remove (and re-insert) on `blobId`/`type`/
   `isDeleted` changes, and clean up on hard-delete. An `AFTER DELETE
   ON blobs` trigger stays as a safety net for blob garbage collection.

The prefix-aware fuzzy match (text_utils.ts) is independent of the
FTS work and is moved to a separate PR so each change can be reviewed
on its own merits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@greptile-apps

greptile-apps Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces an FTS5 notes_fts virtual table (migration v239, trigram tokenizer) to accelerate note-content search by narrowing SQL candidates before the JavaScript findInText re-check, and caps quick-search snippet extraction to 15 items to avoid per-blob read overhead on the full 50-result window.

  • FTS5 index + triggers (migration v239): creates notes_fts with trigram tokenizer, backfills current non-deleted text-content blobs, and maintains the index through four triggers. Operators where FTS is not a safe superset fall back to a full blob scan.
  • buildFtsMatchQuery: new exported function translates operator + token list into a trigram MATCH phrase query with short-token and punctuation filtering; well-covered by new unit tests.
  • Quick-search snippet cap: limits blob reads for snippets to the first 15 results; results 16–50 receive empty strings so the API shape stays uniform.

Confidence Score: 4/5

Generally safe to merge with awareness that the FTS path can silently miss notes whose content contains HTML-encoded characters like &.

The FTS index is built over raw HTML blob bytes, but findInText operates on HTML-preprocessed text. A note containing R&D in raw HTML will not be found when searching for R&D because the trigrams of the decoded string do not appear contiguously in the encoded form — FTS excludes the note before findInText can run, silently dropping results for notes with HTML-encoded characters, which is routine in TipTap-produced content.

Both migrations.ts (backfill indexes raw HTML) and note_content_fulltext.ts (query uses raw FTS results) are involved in the superset mismatch.

Important Files Changed

Filename Overview
packages/trilium-core/src/migrations/migrations.ts Adds migration v239: FTS5 notes_fts virtual table with trigram tokenizer, backfill, and four maintenance triggers. The core concern is that raw HTML blob bytes are indexed rather than preprocessed text, breaking the superset guarantee for HTML entities.
packages/trilium-core/src/services/search/expressions/note_content_fulltext.ts Introduces buildFtsMatchQuery translating operator + tokens into an FTS5 trigram MATCH query, returning null for operators where FTS is not a safe superset. The superset claim holds for plain-text but breaks for raw HTML blobs where HTML-encoded characters cause false negatives.
packages/trilium-core/src/routes/api/search.ts Caps quick-search results to 50 and snippet extraction to the first 15, assigning empty strings for the remainder. Straightforward, low-risk optimization.
packages/trilium-core/src/services/search/expressions/note_content_fulltext.spec.ts Adds unit tests for buildFtsMatchQuery covering operator gating, token length filtering, punctuation-only rejection, mixed-content retention, and quote-escaping. Well-structured and matches the implementation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    QS["quickSearch / NoteContentFulltextExp.execute()"] --> BFQ["buildFtsMatchQuery(operator, tokens)"]
    BFQ --> |"~=, ~*, !=, %=\nor tokens < 3 chars\nor punct-only"| NULL["returns null"]
    BFQ --> |"*=*, =, *=, =*\nwith valid tokens"| FTS_Q["FTS5 MATCH query\n(trigram phrases AND-ed)"]
    NULL --> FULL_SCAN["Full blob scan SQL"]
    FTS_Q --> FTS_SQL["SQL + blobId IN (SELECT FROM notes_fts MATCH ?)"]
    FTS_SQL --> PROTECTED["Always include isProtected=1 notes"]
    PROTECTED --> CANDIDATES["Candidate rows"]
    FULL_SCAN --> CANDIDATES
    CANDIDATES --> FIND["findInText: decrypt then preprocessContent then operator check"]
    FIND --> RESULT["resultNoteSet"]
Loading

Fix All in Claude Code

Reviews (2): Last reviewed commit: "fix(search): switch FTS5 to trigram toke..." | Re-trigger Greptile

Comment thread packages/trilium-core/src/migrations/migrations.ts Outdated
Greptile flagged that `unicode61 + prefix wildcards` was not a strict
superset of what `findInText` accepts for `*=*` / `*=` / `=*` / `=`:
the FTS5 prefix syntax `"foo"*` only matches indexed words that *start
with* "foo", so a substring query like `ello` would silently drop
"hello" (whose tokenized word begins with `h`, not `e`). That made the
fast path produce false negatives for the very pattern `*=*` is meant
to handle.

This commit:

- Migration v239 now creates `notes_fts` with the `trigram` tokenizer,
  which indexes every contiguous 3-char window. A phrase query like
  `"ello"` matches every document containing the literal substring,
  making FTS a strict superset of substring/start/end/exact matching.
  The previous `prefix = '2 3'` is dropped — it was a unicode61
  prefix-index option and isn't valid for trigram.

- `buildFtsMatchQuery` switches operator handling accordingly: it now
  emits FTS queries for `*=*` / `=` / `*=` / `=*` (all substring
  supersets) and returns `null` for `~=` / `~*` (typos can produce
  zero trigram overlap with the target) plus `!=` / `%=` as before.
  Query syntax drops the `*` wildcard (trigram doesn't accept it) and
  raises the minimum token length to 3 codepoints — anything shorter
  has no trigram representation in the index.

- The diacritic-folding behaviour from `remove_diacritics 1` is lost
  by switching tokenizers. Title and attribute matches still flow
  through `NoteFlatTextExp`, which normalizes diacritics in JS, so the
  practical impact is limited to body-content searches where the user
  types a non-diacritic form of an accented word; that case can be
  layered back on later via a pre-normalized indexed column.

- Fixes the misleading `notes_fts_after_note_insert` comment that
  claimed UPDATE-style scenarios (blobId/type/isDeleted change,
  restoration from soft-delete) were handled there. They actually
  fire `notes_fts_after_note_update`; the INSERT trigger only runs
  for genuinely new note rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants