fix(rag): reuse per-file embeddings during indexing by fallintoplace · Pull Request #1306 · amd/gaia

fallintoplace · 2026-05-30T11:37:13Z

Summary

stop rebuilding the global RAG embeddings from old_chunks + new_chunks every time a document is added
reuse the same per-file embedding pass for both the global FAISS index and the per-file FAISS index
add regressions that prove each indexed file is embedded once on both the fresh-index and cache-load paths

Why

The old indexing path did two expensive things in the normal multi-document flow:

it rebuilt embeddings for all previously indexed chunks whenever a new document was added
it then embedded the new document again to build the per-file index

The cache-load path had the same shape of duplication.

That made indexing cost grow much faster than it should as more documents were added, and it was directly visible in the code path behind #1305.

This change keeps the behavior the same from the outside, but switches the steady-state path to:

encode the new file once
append those embeddings to the existing global FAISS index
reuse that same embedding array for the per-file FAISS index

Closes #1305.

Testing

uv run pytest tests/test_rag.py -q -k 'indexing_two_documents_embeds_each_file_once or cache_load_embeds_each_cached_file_once or cache_load_tracks_access_times or cache_functionality or document_indexing'
uv run pytest tests/test_rag.py -q -k 'query_uses_consistent_snapshot_during_state_swap or remove_document_preserves_state_when_rebuild_fails'
uv run pytest tests/test_rag.py -q
uv run python -m compileall src/gaia/rag/sdk.py tests/test_rag.py

github-actions · 2026-05-30T11:42:18Z

Solid fix with the right invariant verified by the right tests. The algorithm is correct and the optimization is meaningful — the steady-state multi-document path now encodes each file exactly once instead of re-encoding the entire accumulated corpus on every addition.

Summary

The change removes two sources of redundant embedding work: (1) rebuilding embeddings for all previous chunks every time a new document is added, and (2) encoding the new file a second time to build the per-file index. Both the fresh-index and cache-load paths are updated consistently, and two regression tests nail the exact invariant. No security concerns. One correctness subtlety and a pair of style nits worth addressing.

Issues Found

🟡 Misleading variable name conceals in-place mutation (`sdk.py:2151`, `sdk.py:2321`)

In the else branch (global index already exists), both the cache-load and fresh-indexing paths do:

rebuilt_index = self.index          # alias, not a copy
...
if self.index is not None:
    rebuilt_index.add(file_embeddings.astype("float32"))  # mutates self.index in place
self.index = rebuilt_index           # no-op — same object

rebuilt_index reads like a new object but is a reference to self.index. The self.index = rebuilt_index line becomes a silent no-op, which obscures intent and could mislead a future reader into thinking it guards against a failed build. Because FAISS IndexFlatL2.add is incremental and the whole block runs under _state_lock, the mutation is thread-safe — this is a readability concern, not a correctness bug. Suggest renaming to make the append-in-place semantics explicit:

                        self.index.add(file_embeddings.astype("float32"))

(Remove rebuilt_index = self.index and self.index = rebuilt_index in the else branch; keep rebuilt_index only for the if self.index is None branch where it really is a new object.)

🟢 Broad `except Exception` on `_create_faiss_index` in cache path (`sdk.py:2155`)

CLAUDE.md requires that touching a file with a pre-existing broad-except means fixing it in the same commit. The old block had the same pattern, and the new code carries it forward. _create_faiss_index raises only two distinct failure modes — dimension ≤ 0 (FAISS ValueError) or a malformed array (likely TypeError). Tightening the catch also prevents masking unexpected regressions:

                    except (ValueError, RuntimeError) as _e:

The file_index = None fallback with debug logging is otherwise acceptable degraded behaviour here (the global index still works; per-file search falls back to slower paths).

🟢 Multi-line comments should be single-line (`sdk.py:2119`, `sdk.py:2286`)

CLAUDE.md: "Keep WHY comments to one short line." Two new comment blocks are 2–3 lines. Examples:

                    # Encode once; reuse for both the global and per-file index.

                # Encode once; reuse for both the global and per-file index.

Strengths

Tests target the exact invariant. Both test_indexing_two_documents_embeds_each_file_once and test_cache_load_embeds_each_cached_file_once intercept _encode_texts at the right level and assert the call list explicitly — this is a clean way to prove an embedding-count guarantee without coupling to FAISS internals.
_create_faiss_index extraction is reusable and clean. Factoring the repeated three-line FAISS constructor + add into _create_faiss_index eliminates duplication across the fresh, cache, and per-file paths in a single logical unit.
Backward-compatible signature extension. Adding return_embeddings=False to _create_vector_index keeps the remove_document caller at line 1736 working without any change.

Verdict

Approve with suggestions. No blocking issues. The 🟡 variable naming nit is worth cleaning up before merge since it's a trap for the next person to read this path; the other two are minor. The algorithm, thread safety, and test coverage are all solid.

github-actions · 2026-05-30T12:03:27Z

The embedding work for multi-document indexing drops from O(total chunks) to O(new chunks) per addition — a real win that's already visible in the #1305 complaint. The implementation is correct and the two new tests give precise regression coverage for both indexing paths.

Issues

🟡 Broad `except Exception` carried forward without fix (`sdk.py:2152`)

CLAUDE.md requires fixing except Exception: pass-style blocks when you touch the file that contains them. The try/except around _create_faiss_index in the cache-load path was reorganised by this PR but not tightened:

try:
    file_index = self._create_faiss_index(file_embeddings)
except Exception as _e:  # pylint: disable=broad-except
    self.log.debug(
        "Couldn't pre-build per-file index for %s: %s",
        file_path,
        _e,
    )
    file_index = None

_create_faiss_index only raises if embeddings is the wrong dtype/shape — a ValueError from numpy or a FAISS C-extension error. Catching Exception broadly also silently swallows memory errors, keyboard interrupts forwarded as exceptions, etc. Narrowing and upgrading the log level makes the fallback explicit:

        try:
            file_index = self._create_faiss_index(file_embeddings)
        except (ValueError, RuntimeError) as _e:
            self.log.warning(
                "Couldn't pre-build per-file index for %s: %s — retrieval will rebuild on demand",
                file_path,
                _e,
            )
            file_index = None

The same fix applies at the symmetric location in the fresh-index path — but the fresh-index path has no try/except at all (line 2322), so if _create_faiss_index ever fails there it propagates as an unhandled exception while the cache-load path silently degrades. If the intent is that per-file index failure should never abort indexing, the fresh-index path needs the same guard; if the intent is "fail loudly on fresh index", the cache-load path should re-raise too. Pick one and apply it consistently.

🟢 `_create_faiss_index` is missing a return type annotation (`sdk.py:1603`)

All nearby private helpers have return type annotations. The new method is missing one:

    def _create_faiss_index(self, embeddings: "np.ndarray") -> "faiss.IndexFlatL2":
        """Build a FAISS L2 index from precomputed embeddings."""

🟢 `_create_vector_index` return annotation is now stale (`sdk.py:1612`)

The signature says -> tuple but the overloaded return is now tuple[faiss.IndexFlatL2, list[str], np.ndarray] | tuple[faiss.IndexFlatL2, list[str]]. Even a minimal improvement is better than the bare tuple:

    def _create_vector_index(
        self, chunks: List[str], return_embeddings: bool = False
    ) -> "tuple[faiss.IndexFlatL2, list[str]] | tuple[faiss.IndexFlatL2, list[str], np.ndarray]":

Strengths

The core correctness is solid. FAISS IndexFlatL2.add() preserves existing vector IDs, so the appended vectors stay aligned with self.chunks = old_chunks + new_chunks. The per-file FAISS index stores only the new file's embeddings, so local index lookups in _retrieve_chunks_from_file remain correct.
_create_faiss_index extraction removes three copy-pasted blocks in a file that already had too many. Clean and worth doing on its own.
The two new tests are well-targeted. Patching _encode_texts and asserting on encode_calls is exactly the right level of abstraction — verifies the invariant without coupling to FAISS internals. The seeded-cache test correctly uses AssertionError on _extract_text_from_file to make the cache-hit assertion self-proving.

Verdict

Approve with suggestions. The performance win is real, the logic is correct, and the tests are solid. The 🟡 issue (error-handling asymmetry / broad-except) should be addressed before merge — it's a short fix and CLAUDE.md explicitly requires it when touching a file that already has a broad-except block. The two 🟢 items can go in the same pass.

fix(rag): reuse per-file embeddings during indexing

bc8b6b6

fallintoplace marked this pull request as ready for review May 30, 2026 11:37

fallintoplace requested a review from kovtcharov-amd as a code owner May 30, 2026 11:37

github-actions Bot added rag RAG system changes tests Test changes performance Performance-critical changes labels May 30, 2026

refactor(rag): clarify incremental index updates

7245138

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rag): reuse per-file embeddings during indexing#1306

fix(rag): reuse per-file embeddings during indexing#1306
fallintoplace wants to merge 2 commits into
amd:mainfrom
fallintoplace:fix/rag-embedding-reuse-1305

fallintoplace commented May 30, 2026

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fallintoplace commented May 30, 2026

Summary

Why

Testing

Uh oh!

github-actions Bot commented May 30, 2026

Summary

Issues Found

🟡 Misleading variable name conceals in-place mutation (sdk.py:2151, sdk.py:2321)

🟢 Broad except Exception on _create_faiss_index in cache path (sdk.py:2155)

🟢 Multi-line comments should be single-line (sdk.py:2119, sdk.py:2286)

Strengths

Verdict

Uh oh!

github-actions Bot commented May 30, 2026

Issues

🟡 Broad except Exception carried forward without fix (sdk.py:2152)

🟢 _create_faiss_index is missing a return type annotation (sdk.py:1603)

🟢 _create_vector_index return annotation is now stale (sdk.py:1612)

Strengths

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🟡 Misleading variable name conceals in-place mutation (`sdk.py:2151`, `sdk.py:2321`)

🟢 Broad `except Exception` on `_create_faiss_index` in cache path (`sdk.py:2155`)

🟢 Multi-line comments should be single-line (`sdk.py:2119`, `sdk.py:2286`)

🟡 Broad `except Exception` carried forward without fix (`sdk.py:2152`)

🟢 `_create_faiss_index` is missing a return type annotation (`sdk.py:1603`)

🟢 `_create_vector_index` return annotation is now stale (`sdk.py:1612`)