diff --git a/CHANGELOG.md b/CHANGELOG.md index 0d01379b1..f2bf99e92 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,21 @@ Full release notes with details on each version: [GitHub Releases](https://github.com/safishamsi/graphify/releases) +## 0.9.2 (2026-06-29) + +- Feat: type-aware Ruby member-call resolution (#1499, thanks @vamsipavanmahesh). `p.run` is now resolved by the inferred type of the receiver (`p = Processor.new` ⇒ `Processor#run`) instead of by globally-unique method name, so the edge survives name collisions (an unrelated `Worker#run` no longer makes it ambiguous) and never points at the wrong method. Introduces a small resolver-registry framework that the existing Swift (#1356) and Python (#1446) cross-file passes register into. Receiver types are inferred only from unambiguous local `var = ClassName.new` bindings; a call whose receiver type can't be proven resolves to nothing rather than to a guess — a deliberate precision-over-recall change for Ruby member calls. +- Feat: resolve workspace imports through the package's `exports` map (#1308, thanks @guyoron1). A subpath import like `import { x } from "@scope/pkg/browser"` now resolves through the package.json `exports` map (string values, condition objects, nested conditions, and `./*` wildcard patterns) instead of falling back to a bare path string, falling back to the existing bare-path/index resolution when there's no exports map or no match. `default` is consulted last (Node's catch-all), and an export target that escapes the package directory is rejected. +- Fix: import edges silently dropped on codebases using tsconfig path aliases or workspace packages (#1529), a regression from the 0.9.0 full-repo-relative node-ID change. Relative imports resolve to repo-relative paths and matched fine, but alias (`@/lib/utils`) and workspace imports resolve to absolute paths, so the import-target ID baked in the on-disk prefix and no longer matched the repo-relative definition node — the edge was dropped at build (common on Next.js/SvelteKit). The id-remap post-pass now also registers the absolute-resolved form, so alias/workspace import targets land on the real node again. +- Fix: tsconfig `compilerOptions.paths` fallback targets are now honored (#1531, thanks @oleksii-tumanov). A `paths` value is an ordered list (`"@app/*": ["src/app/*", "lib/app/*"]`) that `tsc` tries in turn; graphify kept only the first entry, so an import whose file lived at a later target was dropped or misresolved. Each target is now tried in order and the first that resolves to a real file wins (no false edge when none exist). +- Fix: the semantic (LLM) extraction cache is now pruned (#1527, thanks @mwolter805). The AST cache was version-swept but the content-hash-keyed semantic cache had no cleanup, so every content change or file deletion left an orphan entry and `graphify-out/cache/semantic/` grew unbounded. Orphan entries are now removed at the end of `extract`, computed against the full live document set (not the incremental changed subset, which would have evicted still-valid entries) and only touching `cache/semantic/`; the cache stays unversioned so releases never re-bill LLM extraction. +- Fix: three Objective-C extractor bugs (#1475, thanks @JabberYQ for the detailed report and test repo). (1) `.h` headers using `NS_ASSUME_NONNULL_BEGIN` before `@interface` produced no class node — tree-sitter-objc can't expand the argument-less macro and fails to emit a `class_interface` node at all, so the macro is now blanked (offset-preserving) before parsing. (2) Quoted `#import "X.h"` edges dangled once a `.h`/`.m` pair existed (the bare-stem target was salted away during id-disambiguation); imports now resolve to the real header file node, fixing the equivalent latent C `#include` bug too. (3) `[[Foo alloc] init]` now emits a `references` edge to the allocated class, resolved only to an unambiguous class (no false edges). Dot-syntax property accesses and `@selector(...)` target-action edges remain follow-ups. +- Fix: Swift type-qualified static calls now resolve as EXTRACTED rather than INFERRED (#1533, thanks @JabberYQ). `SessionType.staticMethod()` / `Singleton.shared.method()` name the receiver type explicitly in source, so the resolved edge is an exact reference, matching the Python qualified-class-method pass; instance calls typed via local inference (`obj.method()`) stay INFERRED. +- Fix: enforce the API timeout in the secondary LLM dispatch path (#1442, thanks @DhruvTilva). `_call_llm` (used by the dedup LLM tiebreaker) built its Anthropic/OpenAI clients without `timeout`, so requests there ignored `GRAPHIFY_API_TIMEOUT` and could hang — it now passes the timeout like the primary extraction paths. +- Fix: `to_graphml` no longer raises `ValueError` on a node/edge with a `None` attribute value — null fields are coerced to `""` before writing (#1502, thanks @antonioscarinci). +- Feat: `graphify save-result` accepts `--answer-file` as an alternative to `--answer`, so a long or multi-line answer can be read from a file instead of an inline shell argument (#1502, thanks @antonioscarinci). +- Fix: generated install/skill guidance is now host-generic (#1530, thanks @ari-mitophane). The wording no longer tells agents to invoke a literal `skill` tool with `skill: "graphify"` (host-specific and invalid in many environments); it now points to the installed graphify skill or instructions. +- Security: bump `msgpack` to 1.2.1 (GHSA-6v7p-g79w-8964) and `pydantic-settings` to 2.14.2 (GHSA-4xgf-cpjx-pc3j), and drop the unused `safety` dev dependency, which only pulled in `nltk` (an unpatched HIGH advisory). All transitive; the two HIGH-severity ones were dev-tooling only and never in the published wheel. `pip-audit` (already run in CI) continues to provide dependency-CVE scanning. + ## 0.9.1 (2026-06-28) - Fix: rate-limited (HTTP 429) extraction chunks are now retried instead of dropped (#1523, thanks @bercedev). The provider SDKs back off and honor `Retry-After`, but the SDK default of 2 retries was too low for strict per-org concurrency/RPM caps (e.g. Moonshot/kimi), so a parallel `extract` 429'd, each chunk logged `chunk N failed`, and was silently lost (incomplete graph + console spam). The OpenAI-compatible, Azure, and Anthropic clients are now built with a higher `max_retries` (default 6, override via `GRAPHIFY_MAX_RETRIES`). For very tight accounts, `--max-concurrency 1` further reduces the concurrency that triggers org-level limits. diff --git a/graphify/extract.py b/graphify/extract.py index 3ee27dd59..7d45a6329 100644 --- a/graphify/extract.py +++ b/graphify/extract.py @@ -7883,6 +7883,24 @@ def _disambiguate_colliding_node_ids( if len(candidates) == 1: unambiguous_remaps[old_id] = next(iter(candidates)) + # A C/ObjC/C++ `#include "foo.h"` / `#import "foo.h"` resolves to the header's + # file node, but `foo.h` and its sibling `foo.c`/`foo.m`/`foo.cpp` collapse to + # the same `foo` file id, so disambiguation salts them apart by path. A + # cross-file import edge from a THIRD file carries neither salt's source_key, so + # the (target, edge_source_key) lookup misses and the edge dangles on the now + # dead `foo` id. Repoint those import edges to the HEADER variant (the include + # always targeted the header), keyed by the original colliding id (#1475). + _HEADER_SUFFIXES = (".h", ".hpp", ".hh", ".hxx") + header_remaps: dict[str, str] = {} + for old_id in ambiguous_ids: + for node in by_id.get(old_id, []): + sk = _node_disambiguation_source_key(node, root) + if sk and Path(sk).suffix.lower() in _HEADER_SUFFIXES: + new_id = remap.get((old_id, sk)) + if new_id: + header_remaps[old_id] = new_id + break + for edge in edges: edge_source_key = _source_key(str(edge.get("source_file", "")), root) source_key = (edge.get("source", ""), edge_source_key) @@ -7891,7 +7909,15 @@ def _disambiguate_colliding_node_ids( edge["source"] = remap[source_key] elif edge.get("source") in unambiguous_remaps: edge["source"] = unambiguous_remaps[str(edge["source"])] - if target_key in remap: + # imports/imports_from always target a header file, so they must resolve to + # the header variant BEFORE the same-source-file salt is considered. Keying + # the import target by the importer's own source file mis-points a `.m` + # importing its own `.h` back at itself (self-loop), and is wrong for any + # cross-file import whose importer shares the colliding id (#1475). + if (edge.get("relation") in ("imports", "imports_from") + and edge.get("target") in header_remaps): + edge["target"] = header_remaps[str(edge["target"])] + elif target_key in remap: edge["target"] = remap[target_key] elif edge.get("target") in unambiguous_remaps: edge["target"] = unambiguous_remaps[str(edge["target"])] @@ -9466,9 +9492,10 @@ def _resolve_swift_member_calls( (#543/#1219). Swift extractors record the receiver of each member call and a per-file ``name -> type`` table (``swift_type_table``); this pass uses them to type the receiver, then emits an edge ONLY when that type name resolves to - exactly one definition. Everything it adds is INFERRED (type inference, not an - explicit import), and the line-12503 drop stays intact: this is purely - additive and fires only on receiver-typed Swift calls. + exactly one definition. A type-qualified call (``Type.staticMethod()``) is + EXTRACTED (the type is named explicitly in source); an instance call typed via + local inference (``obj.method()``) is INFERRED. The shared-pass member-call drop + stays intact: this is purely additive and fires only on receiver-typed Swift calls. Must run after id-disambiguation so node ids and caller_nids are final. """ @@ -9525,8 +9552,10 @@ def _key(label: str) -> str: # declaring file's local type table. if receiver[:1].isupper(): type_name = receiver + type_qualified = True else: type_name = type_table_by_file.get(rc.get("source_file", ""), {}).get(receiver) + type_qualified = False if not type_name: continue type_defs = type_def_nids.get(_key(type_name), []) @@ -9542,13 +9571,17 @@ def _key(label: str) -> str: if target == caller or (caller, target) in existing_pairs: continue existing_pairs.add((caller, target)) + # A type-qualified call (`Type.staticMethod()`) names the receiver type + # explicitly in source, so it is an exact reference — EXTRACTED, matching + # the Python qualified-class-method pass (#1533). An instance call whose + # receiver type came from local inference (`obj.method()`) stays INFERRED. all_edges.append({ "source": caller, "target": target, "relation": relation, "context": "call", - "confidence": "INFERRED", - "confidence_score": 0.8, + "confidence": "EXTRACTED" if type_qualified else "INFERRED", + "confidence_score": 1.0 if type_qualified else 0.8, "source_file": rc.get("source_file", ""), "source_location": rc.get("source_location"), "weight": 1.0, @@ -9673,6 +9706,13 @@ def extract_objc(path: Path) -> dict: language = Language(tsobjc.language()) parser = Parser(language) source = path.read_bytes() + # tree-sitter-objc cannot expand these argument-less annotation macros (no + # trailing ';'), and their presence before @interface makes the parser fail to + # emit a class_interface node (#1475). Blank them to equal-length spaces so byte + # offsets / line numbers are preserved and the interface parses. + _OBJC_BLANK_MACROS = (b"NS_ASSUME_NONNULL_BEGIN", b"NS_ASSUME_NONNULL_END") + for _m in _OBJC_BLANK_MACROS: + source = source.replace(_m, b" " * len(_m)) tree = parser.parse(source) root = tree.root_node except Exception as e: @@ -9765,10 +9805,18 @@ def walk(node, parent_nid: str | None = None) -> None: for sub in child.children: if sub.type == "string_content": raw = _read(sub) - module = raw.split("/")[-1].replace(".h", "") - if module: - tgt_nid = _make_id(module) - add_edge(file_nid, tgt_nid, "imports", line, context="import") + # Resolve the quoted include to a real file so the target id + # matches the (possibly disambiguated) node id _make_id gives + # that file; the bare-stem id never survives + # _disambiguate_colliding_node_ids when a .h/.m pair exists, + # so the edge dangled and was dropped (#1475). + resolved = _resolve_c_include_path(raw, str_path) + if resolved is not None: + add_edge(file_nid, _make_id(str(resolved)), "imports", line, context="import") + else: + module = raw.split("/")[-1].replace(".h", "") + if module: + add_edge(file_nid, _make_id(module), "imports", line, context="import") return if t == "module_import": @@ -9903,6 +9951,23 @@ def walk(node, parent_nid: str | None = None) -> None: for caller_nid, body_node in method_bodies: def walk_calls(n) -> None: if n.type == "message_expression": + # `[[Foo alloc] init]` is a message_expression whose method is the + # identifier `alloc` and whose receiver is the bare class identifier + # `Foo`; resolve that class name and emit a `references` edge so the + # allocating method links to the allocated type. ensure_named_node + # emits a sourceless stub for unknown names, which the corpus rewire + # collapses ONLY when exactly one real class of that name exists, so an + # unknown/ambiguous class produces no false resolved edge (#1475). + meth = n.child_by_field_name("method") + recv = n.child_by_field_name("receiver") + if (meth is not None and meth.type == "identifier" and _read(meth) == "alloc" + and recv is not None and recv.type == "identifier"): + tname = _read(recv) + ref_line = n.start_point[0] + 1 + type_nid = ensure_named_node(tname, ref_line) + if type_nid != caller_nid: + edges.append(_semantic_reference_edge( + caller_nid, type_nid, "type", str_path, ref_line)) # [receiver sel] and [receiver kw1:a kw2:b] both parse to a # message_expression whose selector parts carry the field name # "method" (one for a simple selector, several for a compound one); diff --git a/pyproject.toml b/pyproject.toml index e6c68b83d..7e76e4f6f 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "graphifyy" -version = "0.9.1" +version = "0.9.2" description = "AI coding assistant skill (Claude Code, CodeBuddy, Codex, OpenCode, Kilo Code, Cursor, Gemini CLI, Aider, OpenClaw, Factory Droid, Trae, Hermes, Kiro, Pi, Devin CLI, Google Antigravity) - turn any folder of code, docs, papers, images, or videos into a queryable knowledge graph" readme = "README.md" license = { file = "LICENSE" } diff --git a/tests/test_languages.py b/tests/test_languages.py index c76933d04..a7d3ea9aa 100644 --- a/tests/test_languages.py +++ b/tests/test_languages.py @@ -1102,6 +1102,118 @@ def test_objc_header_dispatch_routes_objc_not_c(tmp_path): assert _get_extractor(c_h) is _ec +def test_objc_ns_assume_nonnull_macro_does_not_break_parsing(tmp_path): + """`NS_ASSUME_NONNULL_BEGIN` before `@interface` made tree-sitter-objc fail to + emit a class_interface node, swallowing the whole interface; blanking the + argument-less macro restores it (#1475).""" + p = tmp_path / "AlertManager.h" + p.write_text( + "#import \n" + "NS_ASSUME_NONNULL_BEGIN\n" + "@class Other;\n" + "@interface AlertManager : NSObject\n" + "- (void)show;\n" + "@end\n" + "NS_ASSUME_NONNULL_END\n" + ) + r = extract_objc(p) + labels = {n["label"] for n in r["nodes"]} + assert "AlertManager" in labels + assert ("AlertManager", "NSObject") in _edge_labels(r, "inherits") + # `@class Other;` is only a forward declaration; it must not mint a class node. + assert "Other" not in labels + + +def test_objc_macro_free_header_unchanged(tmp_path): + """A macro-free header still parses exactly as before (regression).""" + p = tmp_path / "Plain.h" + p.write_text( + "@interface Plain : NSObject\n" + "- (void)go;\n" + "@end\n" + ) + r = extract_objc(p) + labels = {n["label"] for n in r["nodes"]} + assert "Plain" in labels + assert ("Plain", "NSObject") in _edge_labels(r, "inherits") + + +def test_objc_quoted_import_edges_resolve_to_real_nodes(tmp_path): + """Quoted `#import "X.h"` edges must target the real (disambiguated) file node id, + not the bare stem, which gets salted away when a `.h`/`.m` pair exists and left + the import edge dangling (#1475).""" + from graphify.extract import extract + (tmp_path / "Product.h").write_text("@interface Product : NSObject\n@end\n") + (tmp_path / "Product.m").write_text("#import \"Product.h\"\n@implementation Product\n@end\n") + (tmp_path / "Order.h").write_text("@interface Order : NSObject\n@end\n") + (tmp_path / "Order.m").write_text("#import \"Order.h\"\n@implementation Order\n@end\n") + consumer_a = tmp_path / "ConsumerA.m" + consumer_a.write_text("#import \"Product.h\"\n@implementation ConsumerA\n@end\n") + consumer_b = tmp_path / "ConsumerB.m" + consumer_b.write_text("#import \"Order.h\"\n@implementation ConsumerB\n@end\n") + files = [ + tmp_path / "Product.h", tmp_path / "Product.m", + tmp_path / "Order.h", tmp_path / "Order.m", + consumer_a, consumer_b, + ] + r = extract(files, parallel=False) + node_ids = {n["id"] for n in r["nodes"]} + id_to_label = {n["id"]: n.get("label", "") for n in r["nodes"]} + import_edges = [e for e in r["edges"] if e["relation"] in ("imports", "imports_from")] + assert import_edges + for e in import_edges: + # No dangling targets... + assert e["target"] in node_ids, f"dangling import target: {e}" + # ...and no self-loops: a `.m` importing its own `.h` must resolve to the + # header file node, not get salted back to the importing `.m` (#1475). + assert e["source"] != e["target"], f"self-loop import edge: {e}" + # every quoted import targets a header (.h) file node + assert str(id_to_label.get(e["target"], "")).endswith(".h"), ( + f"import target is not a header file node: {e} -> {id_to_label.get(e['target'])}" + ) + # the self-import (Product.m -> Product.h) specifically lands on the .h variant + prod_imports = [e for e in import_edges if id_to_label.get(e["source"], "").endswith("Product.m")] + assert prod_imports and all(id_to_label.get(e["target"]) == "Product.h" for e in prod_imports), ( + f"Product.m should import the Product.h node, got {[(id_to_label.get(e['source']), id_to_label.get(e['target'])) for e in prod_imports]}" + ) + + +def test_objc_alloc_init_emits_type_reference(tmp_path): + """`[[Foo alloc] init]` must emit a `references` edge to the project class Foo (#1475).""" + from graphify.extract import extract + (tmp_path / "Foo.h").write_text("@interface Foo : NSObject\n@end\n") + (tmp_path / "Foo.m").write_text("#import \"Foo.h\"\n@implementation Foo\n@end\n") + user = tmp_path / "User.m" + user.write_text( + "#import \"Foo.h\"\n" + "@implementation User\n" + "- (void)build { Foo *x = [[Foo alloc] init]; }\n" + "@end\n" + ) + r = extract([tmp_path / "Foo.h", tmp_path / "Foo.m", user], parallel=False) + assert ("-build", "Foo") in _edge_labels(r, "references") + + +def test_objc_alloc_init_unknown_class_no_resolved_edge(tmp_path): + """`[[Unknown alloc] init]` with no such class must not produce a resolved + reference edge (the sourceless stub is collapsed only when a real class exists).""" + p = tmp_path / "Caller.m" + p.write_text( + "@implementation Caller\n" + "- (void)build { id x = [[Unknown alloc] init]; }\n" + "- (void)other { [self build]; [x doStuff]; }\n" + "@end\n" + ) + r = extract_objc(p) + # The single-file extractor emits the edge to a sourceless stub; assert there is + # no resolved reference to a *real* (sourced) Unknown node and that ordinary + # selector sends ([self build] / [x doStuff]) produce no alloc reference. + sourced_ids = {n["id"] for n in r["nodes"] if n.get("source_file")} + refs = [e for e in r["edges"] if e["relation"] == "references"] + for e in refs: + assert e["target"] not in sourced_ids, f"unexpected resolved ref: {e}" + + # --------------------------------------------------------------------------- # Go # --------------------------------------------------------------------------- diff --git a/tests/test_swift_cross_file_calls.py b/tests/test_swift_cross_file_calls.py index 464fc5cef..552581998 100644 --- a/tests/test_swift_cross_file_calls.py +++ b/tests/test_swift_cross_file_calls.py @@ -70,32 +70,42 @@ def test_swift_cross_file_member_calls_resolve(tmp_path: Path): assert (".go()", "calls", ".method()") in edges # Singleton.shared.method() -def test_swift_cross_file_member_calls_are_inferred_and_resolve_to_real_nodes(tmp_path: Path): - # The new edges must be INFERRED (type inference, not an explicit import) and - # land on real definition nodes so build_from_json keeps them. +def test_swift_cross_file_member_calls_have_correct_confidence_and_resolve(tmp_path: Path): + # Instance calls typed via local inference (vm.update(), self.svc.fetch()) are + # INFERRED; type-qualified static calls (SessionType.staticMethod(), + # Singleton.shared.method()) name the receiver type explicitly in source, so + # they are EXTRACTED, matching the Python qualified-class-method pass (#1533). + # All must land on real definition nodes so build_from_json keeps them. files = _issue_fixture(tmp_path / "src") result = extract(files, cache_root=tmp_path / "cache") node_ids = {n["id"] for n in result["nodes"]} src_by_id = {n["id"]: n.get("source_file") for n in result["nodes"]} - member_targets = {".update()", ".fetch()", ".staticMethod()", ".method()"} - seen_targets: set[str] = set() + inferred_targets = {".update()", ".fetch()"} + extracted_targets = {".staticMethod()", ".method()"} + seen_inferred: set[str] = set() + seen_extracted: set[str] = set() for e in result["edges"]: tgt_label = _label(result, e["target"]) - if e.get("relation") == "calls" and tgt_label in member_targets: - assert e["confidence"] == "INFERRED" - assert e["confidence_score"] == 0.8 - assert e["target"] in node_ids - assert src_by_id.get(e["target"]) # resolved to a real, source-backed def - seen_targets.add(tgt_label) - assert seen_targets == member_targets + if e.get("relation") != "calls": + continue + if tgt_label in inferred_targets: + assert e["confidence"] == "INFERRED" and e["confidence_score"] == 0.8 + assert e["target"] in node_ids and src_by_id.get(e["target"]) + seen_inferred.add(tgt_label) + elif tgt_label in extracted_targets: + assert e["confidence"] == "EXTRACTED" and e["confidence_score"] == 1.0 + assert e["target"] in node_ids and src_by_id.get(e["target"]) + seen_extracted.add(tgt_label) + assert seen_inferred == inferred_targets + assert seen_extracted == extracted_targets # Edges survive graph construction (no dangling targets pruned). g = build_from_json(result) surviving = sum( 1 for _, _, d in g.edges(data=True) - if d.get("confidence") == "INFERRED" and d.get("relation") == "calls" + if d.get("relation") == "calls" and d.get("confidence") in ("INFERRED", "EXTRACTED") ) assert surviving >= 5 diff --git a/uv.lock b/uv.lock index c7f712f67..c9fe20699 100644 --- a/uv.lock +++ b/uv.lock @@ -1090,7 +1090,7 @@ wheels = [ [[package]] name = "graphifyy" -version = "0.9.1" +version = "0.9.2" source = { editable = "." } dependencies = [ { name = "networkx", version = "3.4.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.11'" },