Skip to content

feat: add shortest path return to BFS and SSSP algorithms#574

Open
longbinlai wants to merge 9 commits into
alibaba:mainfrom
longbinlai:feat/gds-shortest-path-return
Open

feat: add shortest path return to BFS and SSSP algorithms#574
longbinlai wants to merge 9 commits into
alibaba:mainfrom
longbinlai:feat/gds-shortest-path-return

Conversation

@longbinlai

Copy link
Copy Markdown
Collaborator

Summary

Add shortest path return to GDS BFS and SSSP algorithms, plus path serialization optimizations that benefit all path-returning queries.

Changes

Commit 1: feat(gds): add shortest path return to BFS and SSSP algorithms

  • path output column via YIELD clause (pure YIELD-based, no config options)
  • Predecessor tracking in BFS/SSSP/BFSPred/SSSPPred (O(V) memory)
  • Real edge data lookup from CSR in path reconstruction
  • Struct-wrapped DataType for kPath compatibility with type converter
  • Zero overhead when path is not YIELDed
  • All changes confined to extension/gds/ — no system code modified

Commit 2-3: perf: optimize path JSON serialization

  • Eliminate serialize→parse→copy string round-trip in convert_path_to_json (sink.cc)
  • Add direct rapidjson build helpers: build_vertex_json_value / build_edge_json_value
  • Backward compatible: convert_vertex_to_json / convert_edge_to_json unchanged for standalone callers

Commit 4: feat(gds): add path_properties configuration

  • Lightweight mode (default): Only _ID, _LABEL, PK for nodes; _ID, _LABEL, _SRC_ID, _DST_ID for edges
    • Performance: ~0.54s on LDBC SNB SF10 (65K nodes, 1.9M edges, 62K paths)
  • Full mode (path_properties: 'full'): All vertex and edge properties
    • Performance: ~1.26s (2.34x slower)
  • Thread-local encoding flag in sink.cc with public API in sink.h

Usage

-- Distance only (backward compatible, zero overhead)
CALL bfs('g', {source: '0'}) YIELD node, distance RETURN node.id, distance;

-- With path return (lightweight, default)
CALL bfs('g', {source: '0'}) YIELD node, distance, path RETURN node.id, distance, path;

-- With path return (full properties)
CALL bfs('g', {source: '0', path_properties: 'full'}) YIELD node, distance, path RETURN node.id, distance, path;

Testing

Suite Tests Result
GDS path return (test_gds_path.py) 21 ✅ all pass
GDS path encoding modes 5 ✅ all pass
LDBC SF10 path tests (test_gds_path_ldbc.py) 18 ✅ all pass
Existing GDS tests (test_gds.py) 36 ✅ all pass
LDBC Graphalytics conformance (test_graphalytics.py) 24 ✅ all pass
Other test suites (test_db_cases, test_query, etc.) 300+ ✅ no regressions

Performance (LDBC SNB SF10: 65K nodes, 1.9M edges, 62K paths)

Configuration Time vs Baseline
BFS no path 0.025s
BFS + path (lightweight) 0.54s 6.5x from no-path
BFS + path (full) 1.26s 2.34x from lightweight

Files Changed

  • extension/gds/ — 12 files (path return implementation + configuration)
  • include/neug/execution/common/operators/retrieve/sink.h — encoding mode API
  • src/execution/common/operators/retrieve/sink.cc — serialization optimization
  • tools/python_bind/tests/ — 3 new test files, 21 new tests

Add `path` output column to BFS and SSSP GDS algorithms, allowing users
to retrieve the actual shortest path (vertex + edge sequence) via the
standard Cypher YIELD clause.

## Cypher API

```cypher
-- Distance only (backward compatible, zero overhead)
CALL bfs('g', {source: '0'}) YIELD node, distance RETURN node.id, distance;

-- With path return (path computation triggered by YIELDing path)
CALL bfs('g', {source: '0'}) YIELD node, distance, path
RETURN node.id, distance, path;
```

## Key Design Decisions

- **Pure YIELD-based API**: no config options; path computation only
  happens when the `path` column is explicitly YIELDed
- **Zero overhead when not requested**: predecessor array is not
  allocated when path is not in the YIELD clause
- **Native kPath type**: returns standard NeuG Path objects with real
  edge data pointers looked up from the CSR graph view, supporting all
  Cypher path functions (nodes(), relationships(), length())
- **Struct-wrapped DataType**: the path column uses a properly
  constructed StructTypeInfo (with _NODES and _RELS fields) to match
  the type converter's expectations, via a bind-time wrapper

## Implementation

- Adds `predecessors_[]` array to BFS/BFSPred/SSSP/SSSPPred (O(V) memory)
- BFS parallel: predecessor write inside CAS-success branch (no extra sync)
- SSSP parallel: predecessor write after successful distance relaxation
- Sequential variants (pred): plain writes, no atomics needed
- Path reconstruction in sink() walks predecessor chain, looks up real
  edge data from CSR, builds PathColumn via PathColumnBuilder
- New `path_utils.h` with `reconstruct_path()`, `build_path_from_chain()`,
  `buildPathDataType()`, and `wrapTableBindFuncWithPathFix()`

## Testing

- 16 new path return tests (test_gds_path.py): basic, backward compat,
  predicates, custom graphs, edge cases
- All 76 existing tests pass (24 Graphalytics + 36 GDS + 16 path)
- LDBC Graphalytics conformance verified with and without path return
The convert_path_to_json function was serializing each vertex/edge to a
JSON string, then parsing it back into a rapidjson Document, then copying
it into the parent array — three operations where one suffices.

Add build_vertex_json_value and build_edge_json_value helpers that
construct rapidjson objects directly into the provided allocator, and
have convert_path_to_json use them instead of the string-based
convert_vertex_to_json / convert_edge_to_json.

The original string-returning functions are kept unchanged for their
standalone callers in add_column (kVertex/kEdge cases).

Measured on LDBC SNB SF10 (65K nodes, 1.9M edges, 62K paths):
  BFS with path return: 1.63s → 1.22s (25% faster)
…ncoding

Two optimizations to convert_path_to_json in sink.cc:

1. Direct rapidjson build: replace the serialize→parse→copy string
   round-trip with build_vertex_json_value / build_edge_json_value
   helpers that construct rapidjson objects directly into the allocator.

2. Lightweight path encoding: path output now encodes only _ID, _LABEL,
   and PK for nodes (plus _SRC_ID, _DST_ID for edges), skipping all
   non-PK property lookups. Users can retrieve full node/edge details
   with a separate MATCH query when needed.

Measured on LDBC SNB SF10 (65K nodes, 1.9M edges, 62K paths, 3-round avg):

  BFS yield+return path:  1.63s → 0.50s  (3.3x faster, ↓ 70%)

  Breakdown of improvement:
    - String round-trip elimination:  1.63s → 1.31s  (↓ 20%)
    - Lightweight encoding (skip property I/O):  1.31s → 0.50s  (↓ 62%)

All 94 GDS tests pass. No regressions in other test suites.
…path encoding

Add configurable path encoding mode to control performance vs completeness:

- **Lightweight mode (default)**: Only encodes structural info (_ID, _LABEL, PK for nodes; _ID, _LABEL, _SRC_ID, _DST_ID for edges)
  - Performance: ~0.54s on LDBC SNB SF10 (65K nodes, 1.9M edges, 62K paths)

- **Full mode**: Encodes all vertex and edge properties
  - Performance: ~1.26s (2.34x slower than lightweight)

Usage:
  CALL bfs('graph', {source: '0'})  -- lightweight (default)
  CALL bfs('graph', {source: '0', path_properties: 'full'})  -- full

Implementation:
- Add thread-local flag in sink.cc to control encoding mode
- Add set_path_full_encoding()/get_path_full_encoding() API in sink.h
- Add configure_path_encoding() helper in path_utils.h
- Parse path_properties option in bfs.cc and sssp.cc
- Update build_vertex_json_value() and build_edge_json_value() to check flag
- Add 5 new tests in test_gds_path.py::TestPathEncodingModes
- Update spec document with configuration details

All 57 GDS tests pass. No regressions in other test suites.
@CLAassistant

CLAassistant commented Jun 20, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@longbinlai longbinlai changed the title feat(gds): add shortest path return to BFS and SSSP algorithms feat: add shortest path return to BFS and SSSP algorithms Jun 20, 2026
- Add 'Shortest Path Return' section to doc/source/extensions/load_gds.md
- Document path_properties option (lightweight/full modes)
- Update BFS and SSSP sections with path examples
- Update Algorithm Summary table
- Remove technical spec file (not for users)
@longbinlai longbinlai force-pushed the feat/gds-shortest-path-return branch from 9b83082 to 0262596 Compare June 20, 2026 04:54
….1 + clang-format 10.0.1)

- Remove build_vertex_json_value_light and build_edge_json_value_light
  (defined but not used, causes -Werror=unused-function in CI)
- Re-run isort with pinned version 5.10.1 (CI uses 5.10.1, not 6.x)
- Re-run black on Python test files
@longbinlai longbinlai force-pushed the feat/gds-shortest-path-return branch from f394613 to a59f3d6 Compare June 20, 2026 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants