perf(search): instant quick search via FTS5 + prefix-aware fuzzy#9963
perf(search): instant quick search via FTS5 + prefix-aware fuzzy#9963xnohat wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces SQLite FTS5 full-text indexing over note blob content to optimize quick search performance, along with prefix-aware fuzzy matching for longer words. The review feedback highlights critical improvements to prevent crashes and runtime errors: filtering out punctuation-only tokens to avoid FTS5 syntax errors (with corresponding test coverage) and explicitly initializing snippet properties to empty strings for search results exceeding the snippet limit to prevent potential TypeErrors.
| const usableTokens = tokens | ||
| .map((t) => (t ?? "").trim()) | ||
| .filter((t) => t.length >= 2); |
There was a problem hiding this comment.
Punctuation-only tokens (e.g., ++, ==, //) of length >= 2 will pass this filter but will be completely stripped by the FTS5 unicode61 tokenizer. Passing a query with only punctuation/empty terms to FTS5 (like ""*) results in an fts5: syntax error and crashes the search.
We should filter out tokens that do not contain any alphanumeric characters to prevent these syntax errors.
const unicodeAlphanumeric = /[\p{L}\p{N}]/u;
const usableTokens = tokens
.map((t) => (t ?? "").trim())
.filter((t) => t.length >= 2 && unicodeAlphanumeric.test(t));There was a problem hiding this comment.
Good catch — fixed in 62084f3. buildFtsMatchQuery now requires each token to contain at least one \p{L}/\p{N} code point (any unicode letter or number, so CJK/Cyrillic etc. still pass), and tokens like ++ / == are dropped before they reach FTS5. Regression test added.
| it("returns null when no usable tokens remain", () => { | ||
| expect(buildFtsMatchQuery("*=*", [])).toBeNull(); | ||
| expect(buildFtsMatchQuery("*=*", ["a"])).toBeNull(); // single char filtered out | ||
| expect(buildFtsMatchQuery("*=*", ["", " "])).toBeNull(); | ||
| }); |
There was a problem hiding this comment.
Add a test case to verify that punctuation-only tokens are correctly filtered out and do not cause FTS5 syntax errors.
| it("returns null when no usable tokens remain", () => { | |
| expect(buildFtsMatchQuery("*=*", [])).toBeNull(); | |
| expect(buildFtsMatchQuery("*=*", ["a"])).toBeNull(); // single char filtered out | |
| expect(buildFtsMatchQuery("*=*", ["", " "])).toBeNull(); | |
| }); | |
| it("returns null when no usable tokens remain", () => { | |
| expect(buildFtsMatchQuery("*=*", [])).toBeNull(); | |
| expect(buildFtsMatchQuery("*=*", ["a"])).toBeNull(); // single char filtered out | |
| expect(buildFtsMatchQuery("*=*", ["", " "])).toBeNull(); | |
| expect(buildFtsMatchQuery("*=*", ["++", "=="])).toBeNull(); // punctuation-only filtered out | |
| }); |
There was a problem hiding this comment.
Added in 62084f3, plus an extra case asserting that mixed alphanumeric/punctuation tokens (v2.0) still flow through, so the new alphanumeric filter doesn't accidentally over-filter.
| const snippetCount = Math.min(trimmed.length, QUICK_SEARCH_SNIPPET_LIMIT); | ||
| for (let i = 0; i < snippetCount; i++) { | ||
| const result = trimmed[i]; | ||
| result.contentSnippet = searchService.extractContentSnippet(result.noteId, searchContext.highlightedTokens); | ||
| result.attributeSnippet = searchService.extractAttributeSnippet(result.noteId, searchContext.highlightedTokens); | ||
| } |
There was a problem hiding this comment.
For search results beyond the QUICK_SEARCH_SNIPPET_LIMIT, contentSnippet and attributeSnippet will remain undefined. If highlightSearchResults or the API mapping layer expects these properties to be strings, this could lead to runtime TypeErrors.
To ensure robustness, we should explicitly initialize these properties to empty strings for any results that exceed the snippet limit.
const snippetCount = Math.min(trimmed.length, QUICK_SEARCH_SNIPPET_LIMIT);
for (let i = 0; i < trimmed.length; i++) {
const result = trimmed[i];
if (i < snippetCount) {
result.contentSnippet = searchService.extractContentSnippet(result.noteId, searchContext.highlightedTokens);
result.attributeSnippet = searchService.extractAttributeSnippet(result.noteId, searchContext.highlightedTokens);
} else {
result.contentSnippet = "";
result.attributeSnippet = "";
}
}There was a problem hiding this comment.
Done in 62084f3. contentSnippet and attributeSnippet are now explicitly set to "" for results beyond QUICK_SEARCH_SNIPPET_LIMIT. The downstream highlighter and client widget were already guarding the falsy case, but normalizing the response shape avoids the contract ambiguity.
| export const MIGRATIONS: (SqlMigration | JsMigration)[] = [ | ||
| // Add FTS5 full-text index over blob content so quick search doesn't have to | ||
| // scan every note's body at query time. Triggers keep the index in sync. | ||
| // Blobs are content-addressed (immutable): new content -> new blobId, so we |
There was a problem hiding this comment.
Looks like it will it keep blobs from all note revisions in FTS index. While search only needs/uses the latest revision blob.
There was a problem hiding this comment.
Good catch — addressed in ef126f2. The backfill now joins blobs against notes and only inserts rows for blobs reachable from a current, non-deleted text-content note, so revision-only and attachment-only blobs are skipped. The maintenance triggers also move from blobs to notes (insert when a text note appears; remove/re-insert on blobId/type/isDeleted changes; clean up on hard-delete), with the AFTER DELETE ON blobs trigger kept as a safety net for blob garbage collection. The search SQL is unchanged — it was already JOINing through notes — but the index it's matching against is now strictly the working set.
| // Skip if word is too different in length for fuzzy matching | ||
| if (Math.abs(word.length - normalizedToken.length) > maxDistance) { | ||
| continue; | ||
| // Strategy 1: whole-word fuzzy match for similar-length words ("infra" ↔ "infa", typos). |
There was a problem hiding this comment.
I think fuzzy match is independent from FTS feature, and could/should be in separate PR?
Quick search (and any *=* / ~= / ~* content query) previously scanned every note blob in JavaScript, dominating latency on large knowledge bases — multi-second on databases with several thousand text-content notes. - Add a SQLite FTS5 inverted index (notes_fts) over blob content, indexed with unicode61 + remove_diacritics for accent-insensitive matching, with 2/3-char prefix indexes for partial-word hits. - Sync the index via AFTER INSERT/DELETE triggers on blobs. Blobs are content-addressed (a content change writes a new row), so no UPDATE trigger is needed. - NoteContentFulltextExp narrows candidates through MATCH for operators FTS can express (*=*, ~=, ~*) and re-checks them with the existing JS scorer so fuzzy/normalize/operator semantics are unchanged. Operators that depend on positional anchors or negation (=, *=, =*, !=, %=) keep the legacy unfiltered scan. - Protected notes are always included as candidates so they remain searchable through findInText once the session is open. - Extend fuzzyMatchWordWithResult with prefix-aware matching: the previous whole-word check skipped words whose length differed from the token by more than MAX_EDIT_DISTANCE, so e.g. "infa" never reached "infrastructure". A second strategy now tries edit distance against a word's leading prefix. - Cap quickSearch snippet extraction to the first 15 results — the dropdown only displays a small window and snippets are the dominant per-result cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… fields
- buildFtsMatchQuery now requires each token to contain at least one
unicode letter or number (\p{L}/\p{N}). Tokens like "++" or "==" used
to pass the length check but get stripped to an empty phrase by the
unicode61 tokenizer, which FTS5 then rejects with a syntax error.
Adds a regression test plus one verifying mixed alphanumeric tokens
(e.g. "v2.0") still flow through.
- quickSearch now explicitly assigns "" to contentSnippet and
attributeSnippet for results beyond QUICK_SEARCH_SNIPPET_LIMIT instead
of leaving them undefined. Downstream code already guarded against
the falsy case, but the response shape is now consistent regardless
of result position.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses two review comments on PR TriliumNext#9963: 1. The v239 backfill no longer indexes every text blob in the database. It now joins `blobs` against `notes` and keeps only blobs that are actually reachable from a current, non-deleted text-content note — blobs whose only references are historical revisions or attachments are skipped. The search SQL JOINs through `notes` anyway, so these blobs were never returned; indexing them just wasted space and slowed `MATCH`. 2. The maintenance triggers move from `blobs` (where, at INSERT time, no note yet references the row) to `notes`: insert into FTS when a text note appears, remove (and re-insert) on `blobId`/`type`/ `isDeleted` changes, and clean up on hard-delete. An `AFTER DELETE ON blobs` trigger stays as a safety net for blob garbage collection. The prefix-aware fuzzy match (text_utils.ts) is independent of the FTS work and is moved to a separate PR so each change can be reviewed on its own merits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR introduces an FTS5
Confidence Score: 4/5Generally safe to merge with awareness that the FTS path can silently miss notes whose content contains HTML-encoded characters like &. The FTS index is built over raw HTML blob bytes, but findInText operates on HTML-preprocessed text. A note containing R&D in raw HTML will not be found when searching for R&D because the trigrams of the decoded string do not appear contiguously in the encoded form — FTS excludes the note before findInText can run, silently dropping results for notes with HTML-encoded characters, which is routine in TipTap-produced content. Both migrations.ts (backfill indexes raw HTML) and note_content_fulltext.ts (query uses raw FTS results) are involved in the superset mismatch. Important Files Changed
|
Greptile flagged that `unicode61 + prefix wildcards` was not a strict superset of what `findInText` accepts for `*=*` / `*=` / `=*` / `=`: the FTS5 prefix syntax `"foo"*` only matches indexed words that *start with* "foo", so a substring query like `ello` would silently drop "hello" (whose tokenized word begins with `h`, not `e`). That made the fast path produce false negatives for the very pattern `*=*` is meant to handle. This commit: - Migration v239 now creates `notes_fts` with the `trigram` tokenizer, which indexes every contiguous 3-char window. A phrase query like `"ello"` matches every document containing the literal substring, making FTS a strict superset of substring/start/end/exact matching. The previous `prefix = '2 3'` is dropped — it was a unicode61 prefix-index option and isn't valid for trigram. - `buildFtsMatchQuery` switches operator handling accordingly: it now emits FTS queries for `*=*` / `=` / `*=` / `=*` (all substring supersets) and returns `null` for `~=` / `~*` (typos can produce zero trigram overlap with the target) plus `!=` / `%=` as before. Query syntax drops the `*` wildcard (trigram doesn't accept it) and raises the minimum token length to 3 codepoints — anything shorter has no trigram representation in the index. - The diacritic-folding behaviour from `remove_diacritics 1` is lost by switching tokenizers. Title and attribute matches still flow through `NoteFlatTextExp`, which normalizes diacritics in JS, so the practical impact is limited to body-content searches where the user types a non-diacritic form of an accented word; that case can be layered back on later via a pre-normalized indexed column. - Fixes the misleading `notes_fts_after_note_insert` comment that claimed UPDATE-style scenarios (blobId/type/isDeleted change, restoration from soft-delete) were handled there. They actually fire `notes_fts_after_note_update`; the INSERT trigger only runs for genuinely new note rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Quick search (and any
*=*/*=/=*/=content query) used to scanevery note's blob in JavaScript, dominating latency on large knowledge bases
— multi-second response times on databases with several thousand text-content
notes. This PR introduces a SQLite FTS5 inverted index over blob content,
dropping quick search latency by roughly two orders of magnitude while
maintaining strict-superset correctness for the substring/start/end/exact
operators.
What changed
notes_ftsvirtual table (migration v239) indexes blob content withFTS5's
trigramtokenizer, which records every contiguous 3-codepointwindow. A phrase query like
"ello"then matches any document containingthe literal substring, making the FTS candidate set a strict superset of
what
findInTextaccepts for*=*/*=/=*/=— no silent drops.(The earlier
unicode61+ prefix-wildcard approach only matchedword-start occurrences, so e.g.
ellowould have missedhello. Thetrigram switch addresses that review finding.)
The index is scoped to blobs that are currently referenced by a
non-deleted text-content note — blobs that only exist because they back
a historical revision or an attachment are skipped, since the search SQL
JOINs through
notesand would never return them anyway. Maintenancetriggers fire on
notes(insert when a new text note appears; remove +re-insert when its
blobId/type/isDeletedchange; clean up onhard-delete), with an
AFTER DELETE ON blobstrigger as a safety net forblob garbage collection.
NoteContentFulltextExpnarrows candidates vianotes_fts MATCHforoperators where trigram is a strict superset (
*=*,=,*=,=*). Theexisting JS
findInTextre-checks each candidate so the operators' exactpositional / boundary semantics are preserved.
~=/~*fall back tothe legacy scan because typos can produce zero trigram overlap with the
target.
!=/%=continue to scan as before.quickSearchroute caps snippet extraction at 15 results — thedropdown only ever displays a small window, and snippets are the
dominant per-result cost. Results beyond the snippet limit get
empty-string snippets so the API response shape stays uniform.
Trade-off: diacritic folding
The trigram tokenizer does not ship with FTS5
remove_diacriticsfolding,so body-content searches where the user types the unaccented form of an
accented word (
duonglooking forđường) now miss via the FTS path.Title and attribute matches still go through
NoteFlatTextExp, whichnormalizes diacritics in JS, so most user queries are unaffected. A
follow-up can layer diacritic-insensitive content search back on by
maintaining a pre-normalized indexed column.
Test plan
buildFtsMatchQuery— every operator path,token length / alphanumeric filtering, and FTS5 quote-escaping.
pnpm --filter server test searchsuite passes (348 tests).pnpm typecheckclean.web-clipped notes: quick search returns results in ~200–300 ms,
substring queries surface body matches the prefix-only candidates
would have dropped, punctuation-only queries fall back to the
legacy scan rather than raising
fts5: syntax error.Migration / upgrade notes
after the automatic pre-migration backup. The backfill cost is one-time
and scales with the current working set rather than total blob count.
better-sqlite312.x, no native dependency change.Performance
Informal measurements on a large knowledge base of mixed text and
web-clipped notes:
Exact numbers depend on database size and content shape, but the order
of magnitude is consistent across the test databases checked.