feat: add shortest path return to BFS and SSSP algorithms#574
Open
longbinlai wants to merge 9 commits into
Open
feat: add shortest path return to BFS and SSSP algorithms#574longbinlai wants to merge 9 commits into
longbinlai wants to merge 9 commits into
Conversation
Add `path` output column to BFS and SSSP GDS algorithms, allowing users
to retrieve the actual shortest path (vertex + edge sequence) via the
standard Cypher YIELD clause.
## Cypher API
```cypher
-- Distance only (backward compatible, zero overhead)
CALL bfs('g', {source: '0'}) YIELD node, distance RETURN node.id, distance;
-- With path return (path computation triggered by YIELDing path)
CALL bfs('g', {source: '0'}) YIELD node, distance, path
RETURN node.id, distance, path;
```
## Key Design Decisions
- **Pure YIELD-based API**: no config options; path computation only
happens when the `path` column is explicitly YIELDed
- **Zero overhead when not requested**: predecessor array is not
allocated when path is not in the YIELD clause
- **Native kPath type**: returns standard NeuG Path objects with real
edge data pointers looked up from the CSR graph view, supporting all
Cypher path functions (nodes(), relationships(), length())
- **Struct-wrapped DataType**: the path column uses a properly
constructed StructTypeInfo (with _NODES and _RELS fields) to match
the type converter's expectations, via a bind-time wrapper
## Implementation
- Adds `predecessors_[]` array to BFS/BFSPred/SSSP/SSSPPred (O(V) memory)
- BFS parallel: predecessor write inside CAS-success branch (no extra sync)
- SSSP parallel: predecessor write after successful distance relaxation
- Sequential variants (pred): plain writes, no atomics needed
- Path reconstruction in sink() walks predecessor chain, looks up real
edge data from CSR, builds PathColumn via PathColumnBuilder
- New `path_utils.h` with `reconstruct_path()`, `build_path_from_chain()`,
`buildPathDataType()`, and `wrapTableBindFuncWithPathFix()`
## Testing
- 16 new path return tests (test_gds_path.py): basic, backward compat,
predicates, custom graphs, edge cases
- All 76 existing tests pass (24 Graphalytics + 36 GDS + 16 path)
- LDBC Graphalytics conformance verified with and without path return
The convert_path_to_json function was serializing each vertex/edge to a JSON string, then parsing it back into a rapidjson Document, then copying it into the parent array — three operations where one suffices. Add build_vertex_json_value and build_edge_json_value helpers that construct rapidjson objects directly into the provided allocator, and have convert_path_to_json use them instead of the string-based convert_vertex_to_json / convert_edge_to_json. The original string-returning functions are kept unchanged for their standalone callers in add_column (kVertex/kEdge cases). Measured on LDBC SNB SF10 (65K nodes, 1.9M edges, 62K paths): BFS with path return: 1.63s → 1.22s (25% faster)
…ncoding
Two optimizations to convert_path_to_json in sink.cc:
1. Direct rapidjson build: replace the serialize→parse→copy string
round-trip with build_vertex_json_value / build_edge_json_value
helpers that construct rapidjson objects directly into the allocator.
2. Lightweight path encoding: path output now encodes only _ID, _LABEL,
and PK for nodes (plus _SRC_ID, _DST_ID for edges), skipping all
non-PK property lookups. Users can retrieve full node/edge details
with a separate MATCH query when needed.
Measured on LDBC SNB SF10 (65K nodes, 1.9M edges, 62K paths, 3-round avg):
BFS yield+return path: 1.63s → 0.50s (3.3x faster, ↓ 70%)
Breakdown of improvement:
- String round-trip elimination: 1.63s → 1.31s (↓ 20%)
- Lightweight encoding (skip property I/O): 1.31s → 0.50s (↓ 62%)
All 94 GDS tests pass. No regressions in other test suites.
…path encoding
Add configurable path encoding mode to control performance vs completeness:
- **Lightweight mode (default)**: Only encodes structural info (_ID, _LABEL, PK for nodes; _ID, _LABEL, _SRC_ID, _DST_ID for edges)
- Performance: ~0.54s on LDBC SNB SF10 (65K nodes, 1.9M edges, 62K paths)
- **Full mode**: Encodes all vertex and edge properties
- Performance: ~1.26s (2.34x slower than lightweight)
Usage:
CALL bfs('graph', {source: '0'}) -- lightweight (default)
CALL bfs('graph', {source: '0', path_properties: 'full'}) -- full
Implementation:
- Add thread-local flag in sink.cc to control encoding mode
- Add set_path_full_encoding()/get_path_full_encoding() API in sink.h
- Add configure_path_encoding() helper in path_utils.h
- Parse path_properties option in bfs.cc and sssp.cc
- Update build_vertex_json_value() and build_edge_json_value() to check flag
- Add 5 new tests in test_gds_path.py::TestPathEncodingModes
- Update spec document with configuration details
All 57 GDS tests pass. No regressions in other test suites.
- Add 'Shortest Path Return' section to doc/source/extensions/load_gds.md - Document path_properties option (lightweight/full modes) - Update BFS and SSSP sections with path examples - Update Algorithm Summary table - Remove technical spec file (not for users)
9b83082 to
0262596
Compare
….1 + clang-format 10.0.1) - Remove build_vertex_json_value_light and build_edge_json_value_light (defined but not used, causes -Werror=unused-function in CI) - Re-run isort with pinned version 5.10.1 (CI uses 5.10.1, not 6.x) - Re-run black on Python test files
f394613 to
a59f3d6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add shortest path return to GDS BFS and SSSP algorithms, plus path serialization optimizations that benefit all path-returning queries.
Changes
Commit 1: feat(gds): add shortest path return to BFS and SSSP algorithms
pathoutput column via YIELD clause (pure YIELD-based, no config options)extension/gds/— no system code modifiedCommit 2-3: perf: optimize path JSON serialization
convert_path_to_json(sink.cc)build_vertex_json_value/build_edge_json_valueconvert_vertex_to_json/convert_edge_to_jsonunchanged for standalone callersCommit 4: feat(gds): add path_properties configuration
path_properties: 'full'): All vertex and edge propertiesUsage
Testing
Performance (LDBC SNB SF10: 65K nodes, 1.9M edges, 62K paths)
Files Changed
extension/gds/— 12 files (path return implementation + configuration)include/neug/execution/common/operators/retrieve/sink.h— encoding mode APIsrc/execution/common/operators/retrieve/sink.cc— serialization optimizationtools/python_bind/tests/— 3 new test files, 21 new tests