Skip to content

fix(rag): reuse per-file embeddings during indexing#1306

Open
fallintoplace wants to merge 3 commits into
amd:mainfrom
fallintoplace:fix/rag-embedding-reuse-1305
Open

fix(rag): reuse per-file embeddings during indexing#1306
fallintoplace wants to merge 3 commits into
amd:mainfrom
fallintoplace:fix/rag-embedding-reuse-1305

Conversation

@fallintoplace
Copy link
Copy Markdown

Summary

  • stop rebuilding the global RAG embeddings from old_chunks + new_chunks every time a document is added
  • reuse the same per-file embedding pass for both the global FAISS index and the per-file FAISS index
  • add regressions that prove each indexed file is embedded once on both the fresh-index and cache-load paths

Why

The old indexing path did two expensive things in the normal multi-document flow:

  1. it rebuilt embeddings for all previously indexed chunks whenever a new document was added
  2. it then embedded the new document again to build the per-file index

The cache-load path had the same shape of duplication.

That made indexing cost grow much faster than it should as more documents were added, and it was directly visible in the code path behind #1305.

This change keeps the behavior the same from the outside, but switches the steady-state path to:

  • encode the new file once
  • append those embeddings to the existing global FAISS index
  • reuse that same embedding array for the per-file FAISS index

Closes #1305.

Testing

  • uv run pytest tests/test_rag.py -q -k 'indexing_two_documents_embeds_each_file_once or cache_load_embeds_each_cached_file_once or cache_load_tracks_access_times or cache_functionality or document_indexing'
  • uv run pytest tests/test_rag.py -q -k 'query_uses_consistent_snapshot_during_state_swap or remove_document_preserves_state_when_rebuild_fails'
  • uv run pytest tests/test_rag.py -q
  • uv run python -m compileall src/gaia/rag/sdk.py tests/test_rag.py

@fallintoplace fallintoplace marked this pull request as ready for review May 30, 2026 11:37
@github-actions github-actions Bot added rag RAG system changes tests Test changes performance Performance-critical changes labels May 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Solid fix with the right invariant verified by the right tests. The algorithm is correct and the optimization is meaningful — the steady-state multi-document path now encodes each file exactly once instead of re-encoding the entire accumulated corpus on every addition.

Summary

The change removes two sources of redundant embedding work: (1) rebuilding embeddings for all previous chunks every time a new document is added, and (2) encoding the new file a second time to build the per-file index. Both the fresh-index and cache-load paths are updated consistently, and two regression tests nail the exact invariant. No security concerns. One correctness subtlety and a pair of style nits worth addressing.

Issues Found

🟡 Misleading variable name conceals in-place mutation (sdk.py:2151, sdk.py:2321)

In the else branch (global index already exists), both the cache-load and fresh-indexing paths do:

rebuilt_index = self.index          # alias, not a copy
...
if self.index is not None:
    rebuilt_index.add(file_embeddings.astype("float32"))  # mutates self.index in place
self.index = rebuilt_index           # no-op — same object

rebuilt_index reads like a new object but is a reference to self.index. The self.index = rebuilt_index line becomes a silent no-op, which obscures intent and could mislead a future reader into thinking it guards against a failed build. Because FAISS IndexFlatL2.add is incremental and the whole block runs under _state_lock, the mutation is thread-safe — this is a readability concern, not a correctness bug. Suggest renaming to make the append-in-place semantics explicit:

                        self.index.add(file_embeddings.astype("float32"))

(Remove rebuilt_index = self.index and self.index = rebuilt_index in the else branch; keep rebuilt_index only for the if self.index is None branch where it really is a new object.)

🟢 Broad except Exception on _create_faiss_index in cache path (sdk.py:2155)

CLAUDE.md requires that touching a file with a pre-existing broad-except means fixing it in the same commit. The old block had the same pattern, and the new code carries it forward. _create_faiss_index raises only two distinct failure modes — dimension ≤ 0 (FAISS ValueError) or a malformed array (likely TypeError). Tightening the catch also prevents masking unexpected regressions:

                    except (ValueError, RuntimeError) as _e:

The file_index = None fallback with debug logging is otherwise acceptable degraded behaviour here (the global index still works; per-file search falls back to slower paths).

🟢 Multi-line comments should be single-line (sdk.py:2119, sdk.py:2286)

CLAUDE.md: "Keep WHY comments to one short line." Two new comment blocks are 2–3 lines. Examples:

                    # Encode once; reuse for both the global and per-file index.
                # Encode once; reuse for both the global and per-file index.

Strengths

  • Tests target the exact invariant. Both test_indexing_two_documents_embeds_each_file_once and test_cache_load_embeds_each_cached_file_once intercept _encode_texts at the right level and assert the call list explicitly — this is a clean way to prove an embedding-count guarantee without coupling to FAISS internals.
  • _create_faiss_index extraction is reusable and clean. Factoring the repeated three-line FAISS constructor + add into _create_faiss_index eliminates duplication across the fresh, cache, and per-file paths in a single logical unit.
  • Backward-compatible signature extension. Adding return_embeddings=False to _create_vector_index keeps the remove_document caller at line 1736 working without any change.

Verdict

Approve with suggestions. No blocking issues. The 🟡 variable naming nit is worth cleaning up before merge since it's a trap for the next person to read this path; the other two are minor. The algorithm, thread safety, and test coverage are all solid.

@github-actions
Copy link
Copy Markdown
Contributor

The embedding work for multi-document indexing drops from O(total chunks) to O(new chunks) per addition — a real win that's already visible in the #1305 complaint. The implementation is correct and the two new tests give precise regression coverage for both indexing paths.


Issues

🟡 Broad except Exception carried forward without fix (sdk.py:2152)

CLAUDE.md requires fixing except Exception: pass-style blocks when you touch the file that contains them. The try/except around _create_faiss_index in the cache-load path was reorganised by this PR but not tightened:

try:
    file_index = self._create_faiss_index(file_embeddings)
except Exception as _e:  # pylint: disable=broad-except
    self.log.debug(
        "Couldn't pre-build per-file index for %s: %s",
        file_path,
        _e,
    )
    file_index = None

_create_faiss_index only raises if embeddings is the wrong dtype/shape — a ValueError from numpy or a FAISS C-extension error. Catching Exception broadly also silently swallows memory errors, keyboard interrupts forwarded as exceptions, etc. Narrowing and upgrading the log level makes the fallback explicit:

        try:
            file_index = self._create_faiss_index(file_embeddings)
        except (ValueError, RuntimeError) as _e:
            self.log.warning(
                "Couldn't pre-build per-file index for %s: %s — retrieval will rebuild on demand",
                file_path,
                _e,
            )
            file_index = None

The same fix applies at the symmetric location in the fresh-index path — but the fresh-index path has no try/except at all (line 2322), so if _create_faiss_index ever fails there it propagates as an unhandled exception while the cache-load path silently degrades. If the intent is that per-file index failure should never abort indexing, the fresh-index path needs the same guard; if the intent is "fail loudly on fresh index", the cache-load path should re-raise too. Pick one and apply it consistently.


🟢 _create_faiss_index is missing a return type annotation (sdk.py:1603)

All nearby private helpers have return type annotations. The new method is missing one:

    def _create_faiss_index(self, embeddings: "np.ndarray") -> "faiss.IndexFlatL2":
        """Build a FAISS L2 index from precomputed embeddings."""

🟢 _create_vector_index return annotation is now stale (sdk.py:1612)

The signature says -> tuple but the overloaded return is now tuple[faiss.IndexFlatL2, list[str], np.ndarray] | tuple[faiss.IndexFlatL2, list[str]]. Even a minimal improvement is better than the bare tuple:

    def _create_vector_index(
        self, chunks: List[str], return_embeddings: bool = False
    ) -> "tuple[faiss.IndexFlatL2, list[str]] | tuple[faiss.IndexFlatL2, list[str], np.ndarray]":

Strengths

  • The core correctness is solid. FAISS IndexFlatL2.add() preserves existing vector IDs, so the appended vectors stay aligned with self.chunks = old_chunks + new_chunks. The per-file FAISS index stores only the new file's embeddings, so local index lookups in _retrieve_chunks_from_file remain correct.
  • _create_faiss_index extraction removes three copy-pasted blocks in a file that already had too many. Clean and worth doing on its own.
  • The two new tests are well-targeted. Patching _encode_texts and asserting on encode_calls is exactly the right level of abstraction — verifies the invariant without coupling to FAISS internals. The seeded-cache test correctly uses AssertionError on _extract_text_from_file to make the cache-hit assertion self-proving.

Verdict

Approve with suggestions. The performance win is real, the logic is correct, and the tests are solid. The 🟡 issue (error-handling asymmetry / broad-except) should be addressed before merge — it's a short fix and CLAUDE.md explicitly requires it when touching a file that already has a broad-except block. The two 🟢 items can go in the same pass.

@itomek itomek self-assigned this Jun 1, 2026
@itomek
Copy link
Copy Markdown
Collaborator

itomek commented Jun 1, 2026

Thanks @fallintoplace — this is a clean, well-targeted fix for #1305, and the two new regression tests are the right way to lock it in: asserting encode_calls == [["doc1 chunk"], ["doc2 chunk"]] proves the O(N) → O(1-per-file) claim by counting embedding passes directly, rather than just checking that indexing succeeds. Verified locally: black + isort clean, 42/42 tests pass including test_indexing_two_documents_embeds_each_file_once and test_cache_load_embeds_each_cached_file_once. Heads up that this is a fork PR, so the full pytest/lint suite did not run in CI until now — the verification above is from a local run of the PR branch, and CI has now completed green.

I'm not approving yet — there is one cleanup I'd like resolved before merge. I'm deliberately not restating the items the two automated reviews already covered; the notes below are only what they missed or got stale on.

Resolved — the stale rebuilt_index alias comment can be ignored. The first bot comment flagged a misleading rebuilt_index alias. The second commit removed it entirely: both else branches now call self.index.add(file_embeddings.astype("float32")) directly. No action needed — just flagging so the earlier bot comment doesn't send you looking for something that's already fixed.

Required before merge — make the two _create_faiss_index call sites symmetric. The cache-load path wraps the call in a guard that degrades to None:

try:
    file_index = self._create_faiss_index(file_embeddings)
except Exception as _e:  # pylint: disable=broad-except
    self.log.debug("Couldn't pre-build per-file index for %s: %s", file_path, _e)
    file_index = None

while the fresh-index path (sdk.py:2322) calls it bare with no guard at all. Both call sites are introduced/reshaped by this PR, so this falls under CLAUDE.md's "when you touch a file with a broad-except, fix it in the same commit" rule. The same _create_faiss_index failure currently produces two different outcomes depending on path taken: silent file_index = None on cache load, unhandled propagation on fresh index.

My recommendation is to resolve this by removing the broad guard, not by adding a second one — that matches the fail-loudly rule. If _create_faiss_index can fail in a way you genuinely want to tolerate, catch that specific condition explicitly; otherwise let it raise on both paths so a real FAISS failure surfaces instead of silently dropping the per-file index on one path only.

Once the call sites are symmetric, this is good to merge from my side.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

The optimization is correct and safe — IndexFlatL2 supports incremental .add() calls with identical query results to a full rebuild, so appending new vectors instead of re-encoding all prior chunks is semantically equivalent. The regression tests are well-targeted and directly verify the embedding-count invariant that matters.


Issues

🟢 Minor — new_index is unbound in the else branch (sdk.py ~line 2130 & ~line 2325)

Both the cache-load path and the new-file path declare new_index only inside if self.index is None:, then reference it again in a second if self.index is None: check below. Python won't raise at runtime because the two conditions are correlated, but pylint / mypy will flag new_index as potentially undefined — and a reader has to carry that correlation in their head.

One clean fix: collapse the two if self.index is None checks into one by moving the self.index assignment directly into each branch:

                if self.index is None:
                    new_index, rebuilt_chunks, file_embeddings = (
                        self._create_vector_index(
                            list(cached_chunks), return_embeddings=True
                        )
                    )
                    file_chunk_indices = list(range(len(cached_chunks)))
                    rebuilt_chunk_to_file = {
                        idx: file_path for idx in file_chunk_indices
                    }
                    file_index = self._create_faiss_index(file_embeddings)
                    self.index = new_index
                else:

(Same pattern for the new-file path.) This removes the second if self.index is None: block entirely and makes the control flow obvious.

🟢 Minor — return_embeddings=False default is never exercised (sdk.py:1612)

Every call site now passes return_embeddings=True, so the False default is dead code within the file. Since _create_vector_index is a private method, no external caller can use the default. Consider dropping the parameter and always returning the triple — or at minimum add a brief docstring note explaining the default exists for backward compatibility with any unknown callers.

🟢 Minor — seed phase of test_cache_load_embeds_each_cached_file_once silently relies on mock_dependencies for encoding (test_rag.py ~line 294)

The seed block patches _extract_text_from_file and _split_text_into_chunks but not _encode_texts. It works correctly because mock_dependencies mocks SentenceTransformer with mock_embedder.encode.return_value = np.array([[0.1, 0.2, 0.3, 0.4]]) — but a reader has to trace the fixture to know that. A one-line comment on the seed block would make the intent explicit:

            with (
                patch.object(
                    seed_rag,
                    "_extract_text_from_file",
                    side_effect=[("doc1 text", {}), ("doc2 text", {})],
                ),
                patch.object(
                    seed_rag,
                    "_split_text_into_chunks",
                    side_effect=[["cached doc1 chunk"], ["cached doc2 chunk"]],
                ),
                # _encode_texts uses mock_dependencies' SentenceTransformer mock — no patch needed here
            ):

Strengths

  • Correct optimization: IndexFlatL2.add() is commutative — appending vectors produces the same exhaustive search results as a full rebuild. The change is safe given the current index type.
  • Silent-failure removal: The old try/except Exception as _e: log.debug(...) block that silently ate per-file index errors is gone, replaced by direct calls that will raise loudly. This is exactly what CLAUDE.md asks for.
  • Clean helper extraction: _create_faiss_index eliminates three copies of the IndexFlatL2 / .add() boilerplate and gives the operation a name.
  • Targeted regression tests: Both new tests spy on _encode_texts call arguments rather than just checking success, so they'll catch a regression back to double-encoding rather than just confirming the code path ran.

Verdict

Approve. No blocking issues. The two structural nits (collapsed if self.index is None and dead default) are optional cleanup — ship as-is or address in a follow-up.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 1, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 13 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@3ed27c3). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/gaia/rag/sdk.py 0.00% 13 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1306   +/-   ##
=======================================
  Coverage        ?   39.91%           
=======================================
  Files           ?      308           
  Lines           ?    52319           
  Branches        ?        0           
=======================================
  Hits            ?    20881           
  Misses          ?    31438           
  Partials        ?        0           
Flag Coverage Δ
unit 39.91% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-critical changes rag RAG system changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: RAG indexing rebuilds old embeddings on every document add and re-embeds the new file

4 participants