Skip to content

feat: Codebase Intelligence — repo map with PageRank (queries bundled in dist)#966

Open
gnanam1990 wants to merge 1 commit intomainfrom
feat/repo-map-bundle-queries
Open

feat: Codebase Intelligence — repo map with PageRank (queries bundled in dist)#966
gnanam1990 wants to merge 1 commit intomainfrom
feat/repo-map-bundle-queries

Conversation

@gnanam1990
Copy link
Copy Markdown
Collaborator

Summary

Re-opens the repo-map feature from #543 with the npm-package shipping fix that surfaced in @Vasanthdev2004's last review.

What changed vs #543

The blocker on #543 was that src/context/repoMap/parser.ts read tree-sitter tag queries via readFileSync('./queries/*-tags.scm') at runtime, but package.json's files allowlist only ships bin/, dist/cli.mjs, and README.md. npm pack --dry-run confirmed the .scm files were missing from the tarball, so symbol extraction would silently return empty results after npm install -g @gitlawb/openclaude.

Fix: the queries are now inlined as string constants in src/context/repoMap/queries.ts and loadQuery() reads from those constants instead of the filesystem. The .scm files remain in the repo as the canonical source-of-truth (preserving the Aider MIT attribution and keeping them readable as standalone tree-sitter queries), and a drift-guard test (queries.test.ts) asserts byte-for-byte equality between the inlined strings and the .scm source files. If anyone edits a .scm and forgets to mirror the change, that test fails.

Verified the queries now ship inside the bundle:

$ bun run build
✓ Built openclaude v0.7.0 → dist/cli.mjs
$ npm pack --dry-run
npm notice 27.0MB dist/cli.mjs
$ grep -c 'function_signature\|class_definition\|generator_function' dist/cli.mjs
4

No .scm files are required at runtime. readFileSync/existsSync imports and the getQueryPath() helper are removed from parser.ts.

Why a new PR instead of pushing to #543

#543's branch carried a stale-merge concern in an earlier review and the iteration history was getting hard to follow. Cleaner to land this as a fresh branch off current main with a single squashed commit. Closing #543 in favor of this once it's reviewed.

Surface (unchanged from #543)

Component Files Purpose
Core module src/context/repoMap/ (13 files incl. queries.ts) Symbol extraction, graph building, PageRank, token-budgeted rendering, disk cache
Tree-sitter queries queries/*.scm (canonical) + queries.ts (inlined) Tag queries for TypeScript, JavaScript, Python (from Aider, MIT licensed)
Test fixtures __fixtures__/mini-repo/ (5 files) 5-file TypeScript fixture with known import graph
RepoMap tool src/tools/RepoMapTool/ (4 files) Read-only, concurrency-safe, registered in src/tools.ts
Slash command src/commands/repomap/ (3 files) /repomap, --tokens, --focus, --stats, --invalidate
Context injection src/context.ts getRepoMapContext() memoized; gated by feature('REPO_MAP') OR process.env.REPO_MAP truthy
Feature flag scripts/build.ts REPO_MAP: false — compile-time off; users opt in with REPO_MAP=1 openclaude
Documentation docs/repo-map.md, README.md Full user-facing docs

How it works

git ls-files → tree-sitter WASM parse → extract defs/refs → IDF-weighted directed graph → PageRank → render top files with signatures → stop at token budget

Files imported by many others rank highest. Common symbol names (get, set, map, value) are down-weighted via IDF. Results are cached to disk keyed by (path, mtime, size) — only changed files are re-parsed.

Configuration

# Slash command (always available, no flag needed)
/repomap                        # Default 2048 token budget
/repomap --tokens 4096          # Larger map
/repomap --focus src/tools/     # Boost specific paths
/repomap --stats                # Cache info
/repomap --invalidate           # Clear cache and rebuild

# Auto-injection into session context
REPO_MAP=1 openclaude           # Runtime opt-in

Dependencies added

web-tree-sitter, tree-sitter-wasms, graphology, graphology-pagerank, graphology-operators, js-tiktoken (~80MB in node_modules; only dist/cli.mjs ships).

Test plan

  • bun install — clean
  • bun test src/context/repoMap src/tools/RepoMapTool src/commands/repomap src/context.repoMap.test.ts — 36 pass / 0 fail
  • bun test src/context/repoMap/queries.test.ts — 4 pass (drift guard verifies inlined strings match .scm files byte-for-byte)
  • bun test (full suite) — 1749 pass / 2 fail; the 2 failures (detectProvider — modelOverride from --model flag) reproduce on main and are unrelated to this PR
  • bun run build — success
  • npm pack --dry-run — confirmed only dist/cli.mjs ships, no .scm files needed
  • Verified queries embedded: grep -c 'function_signature\|class_definition\|generator_function' dist/cli.mjs returns 4
  • Manual CLI verification of /repomap, --tokens, --focus, --stats, --invalidate
  • Manual verification of flag-on auto-injection and flag-off regression

Supported languages

TypeScript, JavaScript, Python. Additional grammars in a follow-up.

Known limitations

  • Cold build ~25s on 2100-file repos (WASM parsing). Warm cache <100ms.
  • TypeScript query captures type refs but not function calls — ranking favors type-heavy hub files.
  • Compile-time feature('REPO_MAP') defaults to off; users opt in via REPO_MAP=1.

Closes

Supersedes #543 (will close that one once this is reviewed).

…tural summaries

Adds a new module that builds a structural map of the repository by parsing
source files with tree-sitter, building a cross-file reference graph weighted
by IDF, ranking files with PageRank, and rendering a token-budgeted summary
of the most important files and their signatures.

Surface:
- RepoMap tool the model can call on-demand, with focus_files / focus_symbols
- /repomap slash command with --tokens, --focus, --stats, --invalidate
- Auto-injection into session system context, gated by REPO_MAP=1 env var
  (compile-time feature('REPO_MAP') flag stays off in scripts/build.ts)

How it works:
  git ls-files → tree-sitter WASM parse → extract defs/refs →
  IDF-weighted directed graph → PageRank → render top files until token budget

Files imported by many others rank highest. Common symbol names (get, set,
map, value) are down-weighted via IDF. Results cached to disk keyed by
(path, mtime, size) — only changed files are re-parsed.

Supported languages: TypeScript, JavaScript, Python.

Tree-sitter tag queries are inlined as string constants in queries.ts so
they ship inside dist/cli.mjs and work after npm install — the .scm source
files are kept for readability/Aider attribution but are not required at
runtime. A drift-guard test (queries.test.ts) asserts byte-equality between
the inlined strings and the .scm source files.

Dependencies added: web-tree-sitter, tree-sitter-wasms, graphology,
graphology-pagerank, graphology-operators, js-tiktoken.
Copy link
Copy Markdown
Collaborator

@Vasanthdev2004 Vasanthdev2004 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reopening this cleanly after #543. The packaging fix direction is good ? inlining the .scm queries into queries.ts does address the npm tarball/runtime asset issue I previously blocked on. I did a current-head review at d469b76 and found two blockers before this should merge.

Verdict: Needs changes

Blocking issues:

  1. The rendered repo-map cache can return stale maps after file edits. computeMapHash() only includes the file list, token budget, and focus files, and buildRepoMap() checks the rendered __rendered__${mapHash} entry before validating per-file mtimes/sizes. That means if a source file changes but the file list stays the same, /repomap can return the previous rendered map forever until manual --invalidate.

Minimal repro on current head:

bun --eval "import { mkdtempSync, writeFileSync, rmSync } from 'fs'; import { tmpdir } from 'os'; import { join } from 'path'; import { buildRepoMap, invalidateCache } from './src/context/repoMap/index.ts'; const root = mkdtempSync(join(tmpdir(), 'repomap-stale-')); try { writeFileSync(join(root, 'main.ts'), 'export function oldName(): void {}\n'); invalidateCache(root); const first = await buildRepoMap({ root, maxTokens: 1024 }); writeFileSync(join(root, 'main.ts'), 'export function newName(): void {}\n'); const second = await buildRepoMap({ root, maxTokens: 1024 }); console.log(JSON.stringify({ firstCacheHit: first.cacheHit, secondCacheHit: second.cacheHit, secondHasOld: second.map.includes('oldName'), secondHasNew: second.map.includes('newName') }, null, 2)); } finally { invalidateCache(root); rmSync(root, { recursive: true, force: true }); }"

Current output:

{
  "firstCacheHit": false,
  "secondCacheHit": true,
  "secondHasOld": true,
  "secondHasNew": false
}

The rendered-cache key needs to include a source fingerprint/metadata fingerprint, or the rendered cache should be validated after per-file cache checks rather than before them. Please add a regression test that edits a file and confirms the second map reflects the new symbol without requiring manual invalidation.

  1. src/context/repoMap/queries.test.ts fails on Windows because the byte-for-byte drift guard is line-ending sensitive. On my Windows checkout, the .scm files are read with CRLF while the inlined constants are LF, so all three language drift checks fail even though the visible content is the same.

Local result:

bun test src/context/repoMap/queries.test.ts
# 1 pass / 3 fail

Please normalize line endings in the test before comparison, or enforce LF for the .scm query files via .gitattributes. Since OpenClaude has active Windows users, the drift guard should pass on Windows checkouts too.

What I checked:

  • Current head d469b76
  • parser.ts / queries.ts / queries.test.ts packaging fix
  • buildRepoMap() cache path and rendered-cache keying
  • /repomap command and RepoMapTool surfaces
  • bun test src/context.repoMap.test.ts passed 4/4 isolated
  • bun test src/context/repoMap/queries.test.ts failed 3/4 on Windows as described

Happy to re-review once those two are fixed. The overall feature shape still looks useful; these are correctness/test-portability issues rather than objections to the direction.

3kin0x added a commit to 3kin0x/openclaude that referenced this pull request May 2, 2026
…est line endings

- Update computeMapHash to include file mtime and size in the hash key. This ensures that editing a file invalidates the rendered repo-map cache even if the file list remains the same.
- Normalize line endings (\r\n -> \n) in queries.test.ts before comparison to ensure drift guards pass on Windows checkouts.

Addresses reviewer blockers for PR Gitlawb#966.
@3kin0x 3kin0x mentioned this pull request May 2, 2026
3 tasks
@3kin0x
Copy link
Copy Markdown
Contributor

3kin0x commented May 2, 2026

Hello, just helped you there : #989

Best regards,
Chris

3kin0x added a commit to 3kin0x/openclaude that referenced this pull request May 3, 2026
…est line endings

- Update computeMapHash to include file mtime and size in the hash key. This ensures that editing a file invalidates the rendered repo-map cache even if the file list remains the same.
- Normalize line endings (\r\n -> \n) in queries.test.ts before comparison to ensure drift guards pass on Windows checkouts.

Addresses reviewer blockers for PR Gitlawb#966.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants