Skip to content

feat: add Codebase Intelligence — repo map with PageRank-ranked structural summaries#543

Closed
gnanam1990 wants to merge 3 commits intomainfrom
feat/codebase-intelligence-repo-map
Closed

feat: add Codebase Intelligence — repo map with PageRank-ranked structural summaries#543
gnanam1990 wants to merge 3 commits intomainfrom
feat/codebase-intelligence-repo-map

Conversation

@gnanam1990
Copy link
Copy Markdown
Collaborator

@gnanam1990 gnanam1990 commented Apr 9, 2026

Summary

  • Adds a new module that builds a structural map of the repository by parsing source files with tree-sitter, building a cross-file reference graph weighted by IDF, ranking files with PageRank, and rendering a token-budgeted summary of the most important files and their signatures
  • Registers a RepoMap tool the model can call on-demand during sessions, with support for focus_files and focus_symbols to narrow the ranking
  • Adds a /repomap slash command for users to inspect, tune, and invalidate the map
  • Wires auto-injection of the map into the session system context behind a REPO_MAP feature flag (off by default)

What's included

Component Files Purpose
Core module src/context/repoMap/ (12 files) Symbol extraction, graph building, PageRank, token-budgeted rendering, disk cache
Tree-sitter queries src/context/repoMap/queries/ (3 .scm files) Tag queries for TypeScript, JavaScript, Python (from Aider, MIT licensed)
Test fixtures src/context/repoMap/__fixtures__/mini-repo/ (5 files) 5-file TypeScript fixture with known import graph for deterministic test assertions
RepoMap tool src/tools/RepoMapTool/ (4 files) buildTool wrapper registered in src/tools.ts, read-only, concurrency-safe
Slash command src/commands/repomap/ (3 files) /repomap, --tokens, --focus, --stats, --invalidate
Context injection src/context.ts getRepoMapContext() memoized, gated behind feature('REPO_MAP')
Feature flag scripts/build.ts REPO_MAP: false — off by default
Documentation docs/repo-map.md, README.md Full user-facing docs and README blurb

How it works

git ls-files → tree-sitter WASM parse → extract defs/refs → IDF-weighted directed graph → PageRank → render top files with signatures → stop at token budget

Files imported by many others rank highest. Common symbol names (get, set, map, value) are down-weighted via IDF. Results are cached to disk keyed by (path, mtime, size) — only changed files are re-parsed.

Configuration

# Slash command (always available, no flag needed)
/repomap                        # Default 2048 token budget
/repomap --tokens 4096          # Larger map
/repomap --focus src/tools/     # Boost specific paths
/repomap --stats                # Cache info
/repomap --invalidate           # Clear cache and rebuild

# Auto-injection into session context (requires flag)
# Set REPO_MAP: true in scripts/build.ts and rebuild

Supported languages

TypeScript, JavaScript, Python. Additional grammars in a follow-up.

Dependencies added

web-tree-sitter, tree-sitter-wasms, graphology, graphology-pagerank, graphology-operators, js-tiktoken (~80MB in node_modules)

Test plan

  • bun install — clean
  • bun test — 621 pass, 0 fail (32 new tests)
  • bun run build — success
  • bun run smoke — 0.1.8 (Open Claude)
  • Manual CLI verification of /repomap, --tokens, --focus, --stats, --invalidate
  • Manual verification of flag-on auto-injection and flag-off regression

Known limitations

  • Cold build ~25s on 2100-file repos (WASM parsing). Warm cache <100ms.
  • TypeScript query captures type refs but not function calls — ranking favors type-heavy hub files
  • Feature flag defaults to off — flip to true after internal validation

Comment thread src/context/repoMap/gitFiles.ts Fixed
Comment thread src/context/repoMap/pagerank.ts Fixed
Comment thread src/context/repoMap/repoMap.test.ts Fixed
Comment thread src/context.repoMap.test.ts Fixed
Comment thread src/context/repoMap/repoMap.test.ts Fixed
Comment thread src/tools/RepoMapTool/RepoMapTool.test.ts Fixed
Comment thread src/tools/RepoMapTool/RepoMapTool.test.ts Fixed
Comment thread src/tools/RepoMapTool/RepoMapTool.test.ts Fixed
Comment thread src/tools/RepoMapTool/RepoMapTool.test.ts Fixed
Copy link
Copy Markdown
Collaborator

@Vasanthdev2004 Vasanthdev2004 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a genuinely interesting addition, and I like that you kept the auto-injected path behind a flag while also adding a direct /repomap command and tool surface. The tests and green CI help here too.

I do still see one blocker on the current head though:

  • The PR documents and implies that repo-map auto-injection can be enabled with REPO_MAP=1 openclaude, but the actual gating in the open build is still compile-time via feature('REPO_MAP') from bun:bundle, and scripts/build.ts still hardcodes REPO_MAP: false. On the current head, setting the runtime env var alone will not make getRepoMapContext() start injecting anything into session context. So right now the user-facing docs and the shipped behavior disagree.

Concretely, the current surface looks like this:

  • src/context.ts only enables auto-injection when feature('REPO_MAP') is true
  • scripts/build.ts still sets REPO_MAP: false
  • docs/repo-map.md tells users to enable it with a runtime env var

I think this needs one of two fixes before approval:

  1. either wire the feature so the documented runtime enablement actually works in the open build, or
  2. narrow the docs and PR messaging so they clearly say only /repomap and the tool are available for now, and that auto-injection is not user-enableable in the current open build yet.

Once that mismatch is fixed on the current head, I’m happy to re-review.

@gnanam1990
Copy link
Copy Markdown
Collaborator Author

@Vasanthdev2004 Good catch — you're right, the docs and the actual gate disagreed. Fixed in 5919dde.

getRepoMapContext now enables auto-injection when either the compile-time feature('REPO_MAP') flag is true or the runtime REPO_MAP env var is truthy. Chose option 1 (wire the runtime enablement) since it keeps the documented UX working without requiring users to edit scripts/build.ts and rebuild.

const runtimeEnabled = isEnvTruthy(process.env.REPO_MAP)
if (!feature('REPO_MAP') && !runtimeEnabled) return null

scripts/build.ts still keeps REPO_MAP: false so the compile-time default stays off, but users running the open build can now flip it on with REPO_MAP=1 openclaude as the docs advertise. Verified locally that auto-injection fires with the env var set and does nothing without it. 653 tests pass.

@gnanam1990 gnanam1990 force-pushed the feat/codebase-intelligence-repo-map branch from 5919dde to 43886cc Compare April 10, 2026 08:59
Copy link
Copy Markdown
Collaborator

@Vasanthdev2004 Vasanthdev2004 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up here. I rechecked the current head 43886ccbf8ab1605ea8d948e5218bd7c5af386e9 against the actual GitHub PR surface, the latest commits, the earlier review thread, and the current check state.

This is a targeted re-review of the earlier blocker around the repo-map enablement path.

What I rechecked:

  • src/context.ts now enables repo-map auto-injection when either the compile-time feature('REPO_MAP') flag is on or the runtime REPO_MAP env var is truthy
  • scripts/build.ts still keeps the compile-time default off (REPO_MAP: false)
  • docs/repo-map.md now matches the shipped open-build behavior for runtime enablement
  • current checks are green on this head

That fixes the blocker I raised earlier. The documented REPO_MAP=1 openclaude path now actually matches the gate in the open build, instead of being a no-op.

Verdict: Approve-ready

I do not see a remaining blocker on the current head.

Vasanthdev2004
Vasanthdev2004 previously approved these changes Apr 10, 2026
@kevincodex1
Copy link
Copy Markdown
Contributor

hello bro @gnanam1990 please fix conflicts when you can so we can merge this

Copy link
Copy Markdown
Collaborator

@Vasanthdev2004 Vasanthdev2004 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up. I rechecked the current head cf32497730ace65028d897e91ef4638a3f582306 against the earlier review state, the commits added since the earlier repo-map re-review, the current PR surface, and the current check status.

This is a targeted re-review of the stale head, not a fresh full review of the entire PR.

Verdict: Needs changes

Blocking issue:

  1. src/tools/WebSearchTool/providers/duckduckgo.ts — since the earlier repo-map re-review, the branch has picked up a new commit that changes DuckDuckGo web-search error handling. That is outside the stated scope of this PR, which is still framed as the repo-map feature, and it touches a higher-scrutiny network/tool-behavior surface. I do not want to re-approve the current head under the earlier repo-map-only approval context while that unrelated change is bundled in here.

Non-blocking notes:

  • The earlier repo-map blocker still appears fixed on the current head.
  • Current GitHub checks are green.
  • If the DDG change is split out or dropped from this branch, I would expect the repo-map PR to be back in approve-ready shape from my side.

gnanam1990 and others added 3 commits April 28, 2026 11:57
…tural summaries

Add a new module that builds a structural map of the repository by parsing
source files with tree-sitter, building a cross-file reference graph
weighted by IDF, ranking files with PageRank, and rendering a
token-budgeted summary of the most important files and their signatures.

Stage 1 — Core module (src/context/repoMap/):
  Symbol extraction via web-tree-sitter WASM, IDF-weighted reference graph
  via graphology, PageRank ranking, token-budgeted rendering via js-tiktoken
  cl100k_base, disk cache with mtime invalidation. Supports TypeScript,
  JavaScript, and Python. 10 tests.

Stage 2 — RepoMap tool (src/tools/RepoMapTool/):
  buildTool wrapper registered in src/tools.ts. Read-only, concurrency-safe.
  Supports focus_files, focus_symbols, and max_tokens parameters. 9 tests.

Stage 3 — Integration:
  Auto-injection into session context behind REPO_MAP feature flag (off by
  default). /repomap slash command with --tokens, --focus, --stats, and
  --invalidate flags. User-facing docs in docs/repo-map.md. 13 tests.

With the flag off, the system context is byte-identical to previous behavior.

Dependencies: web-tree-sitter, tree-sitter-wasms, graphology,
graphology-pagerank, graphology-operators, js-tiktoken

Tests: 32 new, 621 total passing, 0 failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses review feedback from @Vasanthdev2004: the docs advertised
REPO_MAP=1 openclaude as the enablement path, but the gate in
getRepoMapContext only checked feature('REPO_MAP'), which is compile-time
and hardcoded to false in the open build. The env var was effectively
a no-op.

Now getRepoMapContext enables auto-injection when EITHER the compile-time
flag is true OR the runtime env var REPO_MAP is truthy. This makes the
documented enablement path actually work without requiring users to edit
scripts/build.ts and rebuild.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gnanam1990 gnanam1990 force-pushed the feat/codebase-intelligence-repo-map branch from cf32497 to 8c8ec7c Compare April 28, 2026 06:29
@gnanam1990
Copy link
Copy Markdown
Collaborator Author

@Vasanthdev2004 ready for re-review.

Addressed your April 24 feedback by rebasing the branch onto current main:

  • Dropped the unrelated DuckDuckGo error-handling commit (cf32497) — it had already merged separately as part of PR fix(web-search): surface actionable error when DDG is rate-limited #834 (3c4d843), so the rebase auto-detected it as already-applied and skipped it. The branch now contains only repo-map commits:
    8c8ec7c fix: honor REPO_MAP runtime env var in addition to compile-time flag
    a366390 fix: remove unused imports and variables flagged by CodeQL
    8189661 feat: add Codebase Intelligence — repo map with PageRank-ranked structural summaries
    
  • Resolved rebase conflicts in scripts/build.ts (kept main's modern feature-flag layout + added REPO_MAP: false under a new "Disabled by default, opt-in via runtime env var" section) and src/tools.ts (kept the RepoMapTool import; dropped the TungstenTool import that main already removed).

Verified locally on Linux:

  • bun run build passes (v0.7.0)
  • bun test src/tools/RepoMapTool/ src/context/repoMap/ src/commands/repomap/ src/context.repoMap.test.ts32/32 pass
  • bun test (full) — 1664 pass; the 4 remaining failures (StartupScreen.test.ts, thinking.test.ts) reproduce on main and are unrelated to this PR
  • Bundled dist/cli.mjs includes isEnvTruthy(process.env.REPO_MAP) runtime gate, RepoMap tool registration, and ~/.openclaude/repomap-cache cache path

The runtime REPO_MAP=1 openclaude enablement still works as documented (your earlier blocker check confirmed this on the prior head; preserved through the rebase). Build flag stays false so users opt in explicitly.

@gnanam1990
Copy link
Copy Markdown
Collaborator Author

@Vasanthdev2004 Thanks for the re-review. Quick status update on the stale-head concern:

The DDG change is no longer in this PR's diff. Current head is 8c8ec7c, and git diff origin/main...8c8ec7c is repo-map-only (35 files, all under src/context/repoMap/, src/commands/repomap/, src/tools/RepoMapTool/, plus docs/repo-map.md and a few wiring touchpoints in src/context.ts / src/tools.ts / src/commands.ts).

src/tools/WebSearchTool/providers/duckduckgo.ts now lives on main (landed separately via #834), so it shows up on the branch tip but is not part of this PR's diff anymore. Verified:

  • gh pr view 543 --json files lists no WebSearchTool paths
  • git diff origin/main...pr-543-head -- '*duckduckgo*' is empty

CI is green. Could you take another look when you have a moment?

Copy link
Copy Markdown
Collaborator

@Vasanthdev2004 Vasanthdev2004 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the rebase and cleanup. I rechecked current head 8c8ec7c.

Scope: Targeted re-review of the latest head after the stale DuckDuckGo concern.

Verdict: Needs changes

Good news: the earlier stale-head blocker is cleared. The current PR file list is repo-map-only, I do not see any WebSearchTool / DuckDuckGo paths in the diff, and GitHub checks are green.

Blocking issue:

  1. The repo-map feature appears to work from the source checkout, but it will not work correctly from the published npm package because the runtime tree-sitter query files are not shipped. src/context/repoMap/parser.ts reads src/context/repoMap/queries/*-tags.scm at runtime, but package.json only publishes bin/, dist/cli.mjs, and README.md. I verified this with npm pack --dry-run; the tarball contents are only LICENSE, README.md, bin/import-specifier.mjs, bin/import-specifier.test.mjs, bin/openclaude, dist/cli.mjs, and package.json. So after npm install -g @gitlawb/openclaude, /repomap and auto-injection would not have the query files available and symbol extraction would silently return empty results.

What I checked:

  • gh pr view 543 current head: 8c8ec7c
  • gh pr diff 543 --name-only: no WebSearchTool / DuckDuckGo files
  • bun install --frozen-lockfile
  • bun test ./src/tools/RepoMapTool ./src/context/repoMap ./src/commands/repomap ./src/context.repoMap.test.ts passed 32/32
  • bun run build passed
  • npm pack --dry-run confirmed the query assets are not included in the package

Suggested fix: either embed the .scm query text into the bundle, or ship the query files in the package and resolve them from a packaged path. Please also add a small packaging test/check so this does not regress.

Non-blocking note:

  • The /repomap docs say the default is 1024 tokens, but parseArgs() defaults to 2048 and the test name says 1024 while asserting 2048. Worth aligning while touching this area, but the packaging issue above is the blocker.

@gnanam1990
Copy link
Copy Markdown
Collaborator Author

Superseded by #966.

Re-opened on a fresh branch (feat/repo-map-bundle-queries) off current main with the npm-package shipping fix:

  • Tree-sitter tag queries are now inlined as string constants in src/context/repoMap/queries.ts. loadQuery() reads from those constants instead of readFileSync-ing the .scm files at runtime.
  • The .scm files stay in-repo as canonical source (Aider attribution preserved) and a drift-guard test asserts byte-equality between the inlined strings and the .scm files.
  • Verified npm pack --dry-run still ships only dist/cli.mjs and no .scm files are needed at runtime — the queries are embedded in the bundle.

Closing this in favor of #966 to keep the review thread clean. Thanks @Vasanthdev2004 for the careful catch on the package contents — that was a real bug that would have shipped a no-op /repomap to npm users.

@gnanam1990 gnanam1990 closed this May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants