Skip to content

fix(rag): reuse per-file embeddings during indexing#1306

Open
fallintoplace wants to merge 2 commits into
amd:mainfrom
fallintoplace:fix/rag-embedding-reuse-1305
Open

fix(rag): reuse per-file embeddings during indexing#1306
fallintoplace wants to merge 2 commits into
amd:mainfrom
fallintoplace:fix/rag-embedding-reuse-1305

Conversation

@fallintoplace
Copy link
Copy Markdown

Summary

  • stop rebuilding the global RAG embeddings from old_chunks + new_chunks every time a document is added
  • reuse the same per-file embedding pass for both the global FAISS index and the per-file FAISS index
  • add regressions that prove each indexed file is embedded once on both the fresh-index and cache-load paths

Why

The old indexing path did two expensive things in the normal multi-document flow:

  1. it rebuilt embeddings for all previously indexed chunks whenever a new document was added
  2. it then embedded the new document again to build the per-file index

The cache-load path had the same shape of duplication.

That made indexing cost grow much faster than it should as more documents were added, and it was directly visible in the code path behind #1305.

This change keeps the behavior the same from the outside, but switches the steady-state path to:

  • encode the new file once
  • append those embeddings to the existing global FAISS index
  • reuse that same embedding array for the per-file FAISS index

Closes #1305.

Testing

  • uv run pytest tests/test_rag.py -q -k 'indexing_two_documents_embeds_each_file_once or cache_load_embeds_each_cached_file_once or cache_load_tracks_access_times or cache_functionality or document_indexing'
  • uv run pytest tests/test_rag.py -q -k 'query_uses_consistent_snapshot_during_state_swap or remove_document_preserves_state_when_rebuild_fails'
  • uv run pytest tests/test_rag.py -q
  • uv run python -m compileall src/gaia/rag/sdk.py tests/test_rag.py

@fallintoplace fallintoplace marked this pull request as ready for review May 30, 2026 11:37
@github-actions github-actions Bot added rag RAG system changes tests Test changes performance Performance-critical changes labels May 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Solid fix with the right invariant verified by the right tests. The algorithm is correct and the optimization is meaningful — the steady-state multi-document path now encodes each file exactly once instead of re-encoding the entire accumulated corpus on every addition.

Summary

The change removes two sources of redundant embedding work: (1) rebuilding embeddings for all previous chunks every time a new document is added, and (2) encoding the new file a second time to build the per-file index. Both the fresh-index and cache-load paths are updated consistently, and two regression tests nail the exact invariant. No security concerns. One correctness subtlety and a pair of style nits worth addressing.

Issues Found

🟡 Misleading variable name conceals in-place mutation (sdk.py:2151, sdk.py:2321)

In the else branch (global index already exists), both the cache-load and fresh-indexing paths do:

rebuilt_index = self.index          # alias, not a copy
...
if self.index is not None:
    rebuilt_index.add(file_embeddings.astype("float32"))  # mutates self.index in place
self.index = rebuilt_index           # no-op — same object

rebuilt_index reads like a new object but is a reference to self.index. The self.index = rebuilt_index line becomes a silent no-op, which obscures intent and could mislead a future reader into thinking it guards against a failed build. Because FAISS IndexFlatL2.add is incremental and the whole block runs under _state_lock, the mutation is thread-safe — this is a readability concern, not a correctness bug. Suggest renaming to make the append-in-place semantics explicit:

                        self.index.add(file_embeddings.astype("float32"))

(Remove rebuilt_index = self.index and self.index = rebuilt_index in the else branch; keep rebuilt_index only for the if self.index is None branch where it really is a new object.)

🟢 Broad except Exception on _create_faiss_index in cache path (sdk.py:2155)

CLAUDE.md requires that touching a file with a pre-existing broad-except means fixing it in the same commit. The old block had the same pattern, and the new code carries it forward. _create_faiss_index raises only two distinct failure modes — dimension ≤ 0 (FAISS ValueError) or a malformed array (likely TypeError). Tightening the catch also prevents masking unexpected regressions:

                    except (ValueError, RuntimeError) as _e:

The file_index = None fallback with debug logging is otherwise acceptable degraded behaviour here (the global index still works; per-file search falls back to slower paths).

🟢 Multi-line comments should be single-line (sdk.py:2119, sdk.py:2286)

CLAUDE.md: "Keep WHY comments to one short line." Two new comment blocks are 2–3 lines. Examples:

                    # Encode once; reuse for both the global and per-file index.
                # Encode once; reuse for both the global and per-file index.

Strengths

  • Tests target the exact invariant. Both test_indexing_two_documents_embeds_each_file_once and test_cache_load_embeds_each_cached_file_once intercept _encode_texts at the right level and assert the call list explicitly — this is a clean way to prove an embedding-count guarantee without coupling to FAISS internals.
  • _create_faiss_index extraction is reusable and clean. Factoring the repeated three-line FAISS constructor + add into _create_faiss_index eliminates duplication across the fresh, cache, and per-file paths in a single logical unit.
  • Backward-compatible signature extension. Adding return_embeddings=False to _create_vector_index keeps the remove_document caller at line 1736 working without any change.

Verdict

Approve with suggestions. No blocking issues. The 🟡 variable naming nit is worth cleaning up before merge since it's a trap for the next person to read this path; the other two are minor. The algorithm, thread safety, and test coverage are all solid.

@github-actions
Copy link
Copy Markdown
Contributor

The embedding work for multi-document indexing drops from O(total chunks) to O(new chunks) per addition — a real win that's already visible in the #1305 complaint. The implementation is correct and the two new tests give precise regression coverage for both indexing paths.


Issues

🟡 Broad except Exception carried forward without fix (sdk.py:2152)

CLAUDE.md requires fixing except Exception: pass-style blocks when you touch the file that contains them. The try/except around _create_faiss_index in the cache-load path was reorganised by this PR but not tightened:

try:
    file_index = self._create_faiss_index(file_embeddings)
except Exception as _e:  # pylint: disable=broad-except
    self.log.debug(
        "Couldn't pre-build per-file index for %s: %s",
        file_path,
        _e,
    )
    file_index = None

_create_faiss_index only raises if embeddings is the wrong dtype/shape — a ValueError from numpy or a FAISS C-extension error. Catching Exception broadly also silently swallows memory errors, keyboard interrupts forwarded as exceptions, etc. Narrowing and upgrading the log level makes the fallback explicit:

        try:
            file_index = self._create_faiss_index(file_embeddings)
        except (ValueError, RuntimeError) as _e:
            self.log.warning(
                "Couldn't pre-build per-file index for %s: %s — retrieval will rebuild on demand",
                file_path,
                _e,
            )
            file_index = None

The same fix applies at the symmetric location in the fresh-index path — but the fresh-index path has no try/except at all (line 2322), so if _create_faiss_index ever fails there it propagates as an unhandled exception while the cache-load path silently degrades. If the intent is that per-file index failure should never abort indexing, the fresh-index path needs the same guard; if the intent is "fail loudly on fresh index", the cache-load path should re-raise too. Pick one and apply it consistently.


🟢 _create_faiss_index is missing a return type annotation (sdk.py:1603)

All nearby private helpers have return type annotations. The new method is missing one:

    def _create_faiss_index(self, embeddings: "np.ndarray") -> "faiss.IndexFlatL2":
        """Build a FAISS L2 index from precomputed embeddings."""

🟢 _create_vector_index return annotation is now stale (sdk.py:1612)

The signature says -> tuple but the overloaded return is now tuple[faiss.IndexFlatL2, list[str], np.ndarray] | tuple[faiss.IndexFlatL2, list[str]]. Even a minimal improvement is better than the bare tuple:

    def _create_vector_index(
        self, chunks: List[str], return_embeddings: bool = False
    ) -> "tuple[faiss.IndexFlatL2, list[str]] | tuple[faiss.IndexFlatL2, list[str], np.ndarray]":

Strengths

  • The core correctness is solid. FAISS IndexFlatL2.add() preserves existing vector IDs, so the appended vectors stay aligned with self.chunks = old_chunks + new_chunks. The per-file FAISS index stores only the new file's embeddings, so local index lookups in _retrieve_chunks_from_file remain correct.
  • _create_faiss_index extraction removes three copy-pasted blocks in a file that already had too many. Clean and worth doing on its own.
  • The two new tests are well-targeted. Patching _encode_texts and asserting on encode_calls is exactly the right level of abstraction — verifies the invariant without coupling to FAISS internals. The seeded-cache test correctly uses AssertionError on _extract_text_from_file to make the cache-hit assertion self-proving.

Verdict

Approve with suggestions. The performance win is real, the logic is correct, and the tests are solid. The 🟡 issue (error-handling asymmetry / broad-except) should be addressed before merge — it's a short fix and CLAUDE.md explicitly requires it when touching a file that already has a broad-except block. The two 🟢 items can go in the same pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-critical changes rag RAG system changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: RAG indexing rebuilds old embeddings on every document add and re-embeds the new file

1 participant