Skip to content

feat(duckdb,sqlite): add invalidate_instance for snapshot-driven cache invalidation#635

Merged
phillipleblanc merged 2 commits intospiceai-52from
phillip/invalidate-instance
May 3, 2026
Merged

feat(duckdb,sqlite): add invalidate_instance for snapshot-driven cache invalidation#635
phillipleblanc merged 2 commits intospiceai-52from
phillip/invalidate-instance

Conversation

@phillipleblanc
Copy link
Copy Markdown
Collaborator

Summary

DuckDBTableProviderFactory and SqliteTableProviderFactory cache pool instances by DbInstanceKey (file path or memory). There is currently no way for a caller to evict an entry from that registry, so a caller that legitimately needs to swap the underlying database file out-of-band — for example, after restoring it from a snapshot in object storage — has no way to force the next provider build to open a fresh pool. Subsequent providers reuse the prior pool's already-open connections, which keep observing the previously-opened file inode rather than the replacement on disk.

This PR adds a small, targeted invalidation API on both factories:

  • DuckDBTableProviderFactory::invalidate_instance(&DbInstanceKey) -> Option<DuckDbConnectionPool>
  • DuckDBTableProviderFactory::invalidate_file_instance(impl Into<Arc<str>>) -> Option<DuckDbConnectionPool>
  • SqliteTableProviderFactory::invalidate_instance(&DbInstanceKey) -> Option<SqliteConnectionPool>
  • SqliteTableProviderFactory::invalidate_file_instance(impl Into<Arc<str>>) -> Option<SqliteConnectionPool>

Semantics:

  • Removes the cached pool entry and returns it (so the caller can hold the prior pool alive for in-flight queries to drain) without disturbing other entries.
  • Re-invalidating an absent key is a no-op (returns None).
  • Existing connections held by other clones of the previously returned pool keep operating against the file descriptor they opened; once they are dropped the OS releases the prior inode. New providers built after invalidation will open the file fresh and observe the replacement contents.

Motivation

Used by spiceai/spiceai#10565 (refresh_mode: snapshot) to allow accelerator providers to be cleanly reloaded after a snapshot is restored to the local file. Manual end-to-end testing without this API confirmed that the swap path produces a fresh provider but queries continue to read stale data from the cached connection pool.

Tests

  • duckdb::tests::invalidate_instance_drops_cached_pool — insertion, idempotent reuse, eviction, no-op missing-key, repopulation.
  • sqlite::tests::invalidate_instance_drops_cached_pool — same coverage including the file-path convenience wrapper.

Both pass: cargo test --lib --features duckdb,sqlite,duckdb-federation invalidate_instance. Clippy clean.

Note on backporting

Branched from the spiceai-52 rev currently consumed by Spice (d7fae5f0). Targeting main here as the canonical upstream landing; will request a backport to spiceai-52 once accepted.

phillipleblanc added a commit to spiceai/spiceai that referenced this pull request Apr 28, 2026
…eshes

Address Copilot review comments and bugs found via manual testing
against an S3 snapshot bucket:

- runtime-acceleration: download_to_file now uses temp-file + atomic
  rename so concurrent readers in refresh_mode: snapshot never observe
  a partially written file (download is also fsync'd before rename for
  crash safety).
- runtime-acceleration: download_if_newer uses strict >, never regresses
  to an older remote snapshot id; warns when remote is older than local.
- runtime: SnapshotRefreshState::current_snapshot_id is now
  Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot
  id 0 (a valid first-snapshot id) is no longer indistinguishable from
  "no snapshot loaded".
- build_snapshot_refresh_state: bootstrap_loaded_id is passed through
  directly without sentinel collapsing.
- swappable: doc comments now describe poison-recovery behavior
  accurately.
- New test: download_if_newer rejects regression to older remote id.

Bugs surfaced via manual S3 testing:

- build_snapshot_refresh_state was constructing SnapshotManager without
  a checkpointer factory, so refresh-time download_latest_snapshot
  failed with "Dataset checkpointer factory not set". Now wires the
  same factory used by the bootstrap path.
- refresh_from_snapshot validated snapshot schema against the federated
  schema. For snapshot mode the federated source is intentionally not
  consulted at refresh time and may diverge (e.g. empty stub source).
  Now validates against the SwappableTableProvider's cached schema.
- Schema validation was using strict arrow Schema equality, which
  rejected logically-identical schemas that differ only in
  engine-attached metadata or nullability flags. Replaced with
  schemas_compatible() that compares field count, ordered names, and
  data types only.

Pool cache invalidation:

- DuckDB and SQLite reload_from_snapshot now call the new upstream
  invalidate_file_instance / invalidate_instance APIs on the
  TableProviderFactory before rebuilding. Without this eviction the
  newly-built provider would reuse the prior pool's open connections
  (which keep observing the previously-opened file inode) and queries
  would return stale data even after the file is replaced on disk.
- Cargo.toml bumped to consume the upstream PR
  (datafusion-contrib/datafusion-table-providers#635).

Original architecture:

A new refresh mode that, when paired with snapshots, drives accelerator
data exclusively from snapshots in object storage. The federated source
is never queried for refreshes; instead the runtime polls the snapshot
store on a configurable cadence and reloads the accelerator only when
a strictly newer snapshot is available.

- spicepod / runtime: new RefreshMode::Snapshot enum variant.
- runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new
  remote_current_snapshot_id() and download_if_newer().
- DataAccelerator trait: new supports_snapshot_reload() and
  reload_from_snapshot(source, previous_provider, factory). Implemented
  for DuckDB, SQLite, Cayenne, Turso.
- New SwappableTableProvider wrapper for atomic provider swap.
- AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry
  optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot()
  short-circuits the federated fetch path.
- create_accelerated_table validates configuration and auto-applies a
  default refresh_check_interval (1m) for snapshot mode.

Manual end-to-end test against an S3 snapshot bucket confirms the
full flow: writer creates snapshots, reader polls, detects newer ids,
downloads atomically, evicts the cached pool, rebuilds the provider,
swaps under SwappableTableProvider, and serves the new data on
subsequent queries.
@phillipleblanc phillipleblanc force-pushed the phillip/invalidate-instance branch from 6bf50e6 to 577d85f Compare April 28, 2026 13:38
phillipleblanc added a commit to spiceai/spiceai that referenced this pull request Apr 28, 2026
…eshes

Address Copilot review comments and bugs found via manual testing
against an S3 snapshot bucket:

- runtime-acceleration: download_to_file now uses temp-file + atomic
  rename so concurrent readers in refresh_mode: snapshot never observe
  a partially written file (download is also fsync'd before rename for
  crash safety).
- runtime-acceleration: download_if_newer uses strict >, never regresses
  to an older remote snapshot id; warns when remote is older than local.
- runtime: SnapshotRefreshState::current_snapshot_id is now
  Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot
  id 0 (a valid first-snapshot id) is no longer indistinguishable from
  "no snapshot loaded".
- build_snapshot_refresh_state: bootstrap_loaded_id is passed through
  directly without sentinel collapsing.
- swappable: doc comments now describe poison-recovery behavior
  accurately.
- New test: download_if_newer rejects regression to older remote id.

Bugs surfaced via manual S3 testing:

- build_snapshot_refresh_state was constructing SnapshotManager without
  a checkpointer factory, so refresh-time download_latest_snapshot
  failed with "Dataset checkpointer factory not set". Now wires the
  same factory used by the bootstrap path.
- refresh_from_snapshot validated snapshot schema against the federated
  schema. For snapshot mode the federated source is intentionally not
  consulted at refresh time and may diverge (e.g. empty stub source).
  Now validates against the SwappableTableProvider's cached schema.
- Schema validation was using strict arrow Schema equality, which
  rejected logically-identical schemas that differ only in
  engine-attached metadata or nullability flags. Replaced with
  schemas_compatible() that compares field count, ordered names, and
  data types only.

Pool cache invalidation:

- DuckDB and SQLite reload_from_snapshot now call the new upstream
  invalidate_file_instance / invalidate_instance APIs on the
  TableProviderFactory before rebuilding. Without this eviction the
  newly-built provider would reuse the prior pool's open connections
  (which keep observing the previously-opened file inode) and queries
  would return stale data even after the file is replaced on disk.
- Cargo.toml bumped to consume the upstream PR
  (datafusion-contrib/datafusion-table-providers#635).

Original architecture:

A new refresh mode that, when paired with snapshots, drives accelerator
data exclusively from snapshots in object storage. The federated source
is never queried for refreshes; instead the runtime polls the snapshot
store on a configurable cadence and reloads the accelerator only when
a strictly newer snapshot is available.

- spicepod / runtime: new RefreshMode::Snapshot enum variant.
- runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new
  remote_current_snapshot_id() and download_if_newer().
- DataAccelerator trait: new supports_snapshot_reload() and
  reload_from_snapshot(source, previous_provider, factory). Implemented
  for DuckDB, SQLite, Cayenne, Turso.
- New SwappableTableProvider wrapper for atomic provider swap.
- AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry
  optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot()
  short-circuits the federated fetch path.
- create_accelerated_table validates configuration and auto-applies a
  default refresh_check_interval (1m) for snapshot mode.

Manual end-to-end test against an S3 snapshot bucket confirms the
full flow: writer creates snapshots, reader polls, detects newer ids,
downloads atomically, evicts the cached pool, rebuilds the provider,
swaps under SwappableTableProvider, and serves the new data on
subsequent queries.
phillipleblanc added a commit to spiceai/spiceai that referenced this pull request Apr 28, 2026
…eshes

Address Copilot review comments and bugs found via manual testing
against an S3 snapshot bucket:

- runtime-acceleration: download_to_file now uses temp-file + atomic
  rename so concurrent readers in refresh_mode: snapshot never observe
  a partially written file (download is also fsync'd before rename for
  crash safety). Adds an explicit AlreadyExists fallback for older
  Windows configurations where MoveFileExW does not honor the
  REPLACE_EXISTING flag.
- runtime-acceleration: download_if_newer uses strict >, never regresses
  to an older remote snapshot id; warns when remote is older than local.
- runtime: SnapshotRefreshState::current_snapshot_id is now
  Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot
  id 0 (a valid first-snapshot id) is no longer indistinguishable from
  "no snapshot loaded".
- build_snapshot_refresh_state: bootstrap_loaded_id is passed through
  directly without sentinel collapsing.
- swappable: doc comments now describe poison-recovery behavior
  accurately.
- New test: download_if_newer rejects regression to older remote id.

Bugs surfaced via manual S3 testing:

- build_snapshot_refresh_state was constructing SnapshotManager without
  a checkpointer factory, so refresh-time download_latest_snapshot
  failed with "Dataset checkpointer factory not set". Now wires the
  same factory used by the bootstrap path.
- refresh_from_snapshot validated snapshot schema against the federated
  schema. For snapshot mode the federated source is intentionally not
  consulted at refresh time and may diverge (e.g. empty stub source).
  Now validates against the SwappableTableProvider's cached schema.
- Schema validation was using strict arrow Schema equality, which
  rejected logically-identical schemas that differ only in
  engine-attached metadata or nullability flags. Replaced with
  schemas_compatible() that compares field count, ordered names, and
  data types only.

Pool cache invalidation:

- DuckDB and SQLite reload_from_snapshot now call the new upstream
  invalidate_file_instance / invalidate_instance APIs on the
  TableProviderFactory before rebuilding. Without this eviction the
  newly-built provider would reuse the prior pool's open connections
  (which keep observing the previously-opened file inode) and queries
  would return stale data even after the file is replaced on disk.
- Cargo.toml bumped to consume the upstream PR
  (datafusion-contrib/datafusion-table-providers#635).

Error context preservation:

- build_snapshot_refresh_state no longer swallows FilePathError from
  get_acceleration_layout; a new SnapshotRefreshModeLayoutUnavailable
  variant carries the original error so snapshot-mode config issues
  surface with their root cause (which engine, which path).

Original architecture:

A new refresh mode that, when paired with snapshots, drives accelerator
data exclusively from snapshots in object storage. The federated source
is never queried for refreshes; instead the runtime polls the snapshot
store on a configurable cadence and reloads the accelerator only when
a strictly newer snapshot is available.

- spicepod / runtime: new RefreshMode::Snapshot enum variant.
- runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new
  remote_current_snapshot_id() and download_if_newer().
- DataAccelerator trait: new supports_snapshot_reload() and
  reload_from_snapshot(source, previous_provider, factory). Implemented
  for DuckDB, SQLite, Cayenne, Turso. Engines that opt out (e.g.
  TableModePartitionedDuckDB) are explicitly rejected at config
  validation with a clear error.
- New SwappableTableProvider wrapper for atomic provider swap.
- AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry
  optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot()
  short-circuits the federated fetch path.
- create_accelerated_table validates configuration and auto-applies a
  default refresh_check_interval (1m) for snapshot mode.

Manual end-to-end test against an S3 snapshot bucket confirms the
full flow: writer creates snapshots, reader polls, detects newer ids,
downloads atomically, evicts the cached pool, rebuilds the provider,
swaps under SwappableTableProvider, and serves the new data on
subsequent queries.
phillipleblanc added a commit to spiceai/spiceai that referenced this pull request Apr 28, 2026
…eshes

Address Copilot review comments and bugs found via manual testing
against an S3 snapshot bucket:

- runtime-acceleration: download_to_file now uses temp-file + atomic
  rename so concurrent readers in refresh_mode: snapshot never observe
  a partially written file (download is also fsync'd before rename for
  crash safety). Adds an explicit AlreadyExists fallback for older
  Windows configurations where MoveFileExW does not honor the
  REPLACE_EXISTING flag.
- runtime-acceleration: download_if_newer uses strict >, never regresses
  to an older remote snapshot id; warns when remote is older than local.
- runtime: SnapshotRefreshState::current_snapshot_id is now
  Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot
  id 0 (a valid first-snapshot id) is no longer indistinguishable from
  "no snapshot loaded".
- build_snapshot_refresh_state: bootstrap_loaded_id is passed through
  directly without sentinel collapsing.
- swappable: doc comments now describe poison-recovery behavior
  accurately.
- New test: download_if_newer rejects regression to older remote id.

Bugs surfaced via manual S3 testing:

- build_snapshot_refresh_state was constructing SnapshotManager without
  a checkpointer factory, so refresh-time download_latest_snapshot
  failed with "Dataset checkpointer factory not set". Now wires the
  same factory used by the bootstrap path.
- refresh_from_snapshot validated snapshot schema against the federated
  schema. For snapshot mode the federated source is intentionally not
  consulted at refresh time and may diverge (e.g. empty stub source).
  Now validates against the SwappableTableProvider's cached schema.
- Schema validation was using strict arrow Schema equality, which
  rejected logically-identical schemas that differ only in
  engine-attached metadata or nullability flags. Replaced with
  schemas_compatible() that compares field count, ordered names, and
  data types only.

Pool cache invalidation:

- DuckDB and SQLite reload_from_snapshot now call the new upstream
  invalidate_file_instance / invalidate_instance APIs on the
  TableProviderFactory before rebuilding. Without this eviction the
  newly-built provider would reuse the prior pool's open connections
  (which keep observing the previously-opened file inode) and queries
  would return stale data even after the file is replaced on disk.
- Cargo.toml bumped to consume the upstream PR
  (datafusion-contrib/datafusion-table-providers#635).

Error context preservation:

- build_snapshot_refresh_state no longer swallows FilePathError from
  get_acceleration_layout; a new SnapshotRefreshModeLayoutUnavailable
  variant carries the original error so snapshot-mode config issues
  surface with their root cause (which engine, which path).

Original architecture:

A new refresh mode that, when paired with snapshots, drives accelerator
data exclusively from snapshots in object storage. The federated source
is never queried for refreshes; instead the runtime polls the snapshot
store on a configurable cadence and reloads the accelerator only when
a strictly newer snapshot is available.

- spicepod / runtime: new RefreshMode::Snapshot enum variant.
- runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new
  remote_current_snapshot_id() and download_if_newer().
- DataAccelerator trait: new supports_snapshot_reload() and
  reload_from_snapshot(source, previous_provider, factory). Implemented
  for DuckDB, SQLite, Cayenne, Turso. Engines that opt out (e.g.
  TableModePartitionedDuckDB) are explicitly rejected at config
  validation with a clear error.
- New SwappableTableProvider wrapper for atomic provider swap.
- AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry
  optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot()
  short-circuits the federated fetch path.
- create_accelerated_table validates configuration and auto-applies a
  default refresh_check_interval (1m) for snapshot mode.

Manual end-to-end test against an S3 snapshot bucket confirms the
full flow: writer creates snapshots, reader polls, detects newer ids,
downloads atomically, evicts the cached pool, rebuilds the provider,
swaps under SwappableTableProvider, and serves the new data on
subsequent queries.
@phillipleblanc phillipleblanc changed the base branch from main to spiceai-52 April 29, 2026 00:54
phillipleblanc added a commit to spiceai/spiceai that referenced this pull request Apr 29, 2026
…eshes

Address Copilot review comments and bugs found via manual testing
against an S3 snapshot bucket:

- runtime-acceleration: download_to_file now uses temp-file + atomic
  rename so concurrent readers in refresh_mode: snapshot never observe
  a partially written file (download is also fsync'd before rename for
  crash safety). Adds an explicit AlreadyExists fallback for older
  Windows configurations where MoveFileExW does not honor the
  REPLACE_EXISTING flag.
- runtime-acceleration: download_if_newer uses strict >, never regresses
  to an older remote snapshot id; warns when remote is older than local.
- runtime: SnapshotRefreshState::current_snapshot_id is now
  Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot
  id 0 (a valid first-snapshot id) is no longer indistinguishable from
  "no snapshot loaded".
- build_snapshot_refresh_state: bootstrap_loaded_id is passed through
  directly without sentinel collapsing.
- swappable: doc comments now describe poison-recovery behavior
  accurately.
- New test: download_if_newer rejects regression to older remote id.

Bugs surfaced via manual S3 testing:

- build_snapshot_refresh_state was constructing SnapshotManager without
  a checkpointer factory, so refresh-time download_latest_snapshot
  failed with "Dataset checkpointer factory not set". Now wires the
  same factory used by the bootstrap path.
- refresh_from_snapshot validated snapshot schema against the federated
  schema. For snapshot mode the federated source is intentionally not
  consulted at refresh time and may diverge (e.g. empty stub source).
  Now validates against the SwappableTableProvider's cached schema.
- Schema validation was using strict arrow Schema equality, which
  rejected logically-identical schemas that differ only in
  engine-attached metadata or nullability flags. Replaced with
  schemas_compatible() that compares field count, ordered names, and
  data types only.

Pool cache invalidation:

- DuckDB and SQLite reload_from_snapshot now call the new upstream
  invalidate_file_instance / invalidate_instance APIs on the
  TableProviderFactory before rebuilding. Without this eviction the
  newly-built provider would reuse the prior pool's open connections
  (which keep observing the previously-opened file inode) and queries
  would return stale data even after the file is replaced on disk.
- Cargo.toml bumped to consume the upstream PR
  (datafusion-contrib/datafusion-table-providers#635).

Error context preservation:

- build_snapshot_refresh_state no longer swallows FilePathError from
  get_acceleration_layout; a new SnapshotRefreshModeLayoutUnavailable
  variant carries the original error so snapshot-mode config issues
  surface with their root cause (which engine, which path).

Original architecture:

A new refresh mode that, when paired with snapshots, drives accelerator
data exclusively from snapshots in object storage. The federated source
is never queried for refreshes; instead the runtime polls the snapshot
store on a configurable cadence and reloads the accelerator only when
a strictly newer snapshot is available.

- spicepod / runtime: new RefreshMode::Snapshot enum variant.
- runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new
  remote_current_snapshot_id() and download_if_newer().
- DataAccelerator trait: new supports_snapshot_reload() and
  reload_from_snapshot(source, previous_provider, factory). Implemented
  for DuckDB, SQLite, Cayenne, Turso. Engines that opt out (e.g.
  TableModePartitionedDuckDB) are explicitly rejected at config
  validation with a clear error.
- New SwappableTableProvider wrapper for atomic provider swap.
- AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry
  optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot()
  short-circuits the federated fetch path.
- create_accelerated_table validates configuration and auto-applies a
  default refresh_check_interval (1m) for snapshot mode.

Manual end-to-end test against an S3 snapshot bucket confirms the
full flow: writer creates snapshots, reader polls, detects newer ids,
downloads atomically, evicts the cached pool, rebuilds the provider,
swaps under SwappableTableProvider, and serves the new data on
subsequent queries.
@phillipleblanc phillipleblanc marked this pull request as draft April 30, 2026 01:04
phillipleblanc added a commit to spiceai/spiceai that referenced this pull request Apr 30, 2026
…eshes

Address Copilot review comments and bugs found via manual testing
against an S3 snapshot bucket:

- runtime-acceleration: download_to_file now uses temp-file + atomic
  rename so concurrent readers in refresh_mode: snapshot never observe
  a partially written file (download is also fsync'd before rename for
  crash safety). Adds an explicit AlreadyExists fallback for older
  Windows configurations where MoveFileExW does not honor the
  REPLACE_EXISTING flag.
- runtime-acceleration: download_if_newer uses strict >, never regresses
  to an older remote snapshot id; warns when remote is older than local.
- runtime: SnapshotRefreshState::current_snapshot_id is now
  Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot
  id 0 (a valid first-snapshot id) is no longer indistinguishable from
  "no snapshot loaded".
- build_snapshot_refresh_state: bootstrap_loaded_id is passed through
  directly without sentinel collapsing.
- swappable: doc comments now describe poison-recovery behavior
  accurately.
- New test: download_if_newer rejects regression to older remote id.

Bugs surfaced via manual S3 testing:

- build_snapshot_refresh_state was constructing SnapshotManager without
  a checkpointer factory, so refresh-time download_latest_snapshot
  failed with "Dataset checkpointer factory not set". Now wires the
  same factory used by the bootstrap path.
- refresh_from_snapshot validated snapshot schema against the federated
  schema. For snapshot mode the federated source is intentionally not
  consulted at refresh time and may diverge (e.g. empty stub source).
  Now validates against the SwappableTableProvider's cached schema.
- Schema validation was using strict arrow Schema equality, which
  rejected logically-identical schemas that differ only in
  engine-attached metadata or nullability flags. Replaced with
  schemas_compatible() that compares field count, ordered names, and
  data types only.

Pool cache invalidation:

- DuckDB and SQLite reload_from_snapshot now call the new upstream
  invalidate_file_instance / invalidate_instance APIs on the
  TableProviderFactory before rebuilding. Without this eviction the
  newly-built provider would reuse the prior pool's open connections
  (which keep observing the previously-opened file inode) and queries
  would return stale data even after the file is replaced on disk.
- Cargo.toml bumped to consume the upstream PR
  (datafusion-contrib/datafusion-table-providers#635).

Error context preservation:

- build_snapshot_refresh_state no longer swallows FilePathError from
  get_acceleration_layout; a new SnapshotRefreshModeLayoutUnavailable
  variant carries the original error so snapshot-mode config issues
  surface with their root cause (which engine, which path).

Original architecture:

A new refresh mode that, when paired with snapshots, drives accelerator
data exclusively from snapshots in object storage. The federated source
is never queried for refreshes; instead the runtime polls the snapshot
store on a configurable cadence and reloads the accelerator only when
a strictly newer snapshot is available.

- spicepod / runtime: new RefreshMode::Snapshot enum variant.
- runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new
  remote_current_snapshot_id() and download_if_newer().
- DataAccelerator trait: new supports_snapshot_reload() and
  reload_from_snapshot(source, previous_provider, factory). Implemented
  for DuckDB, SQLite, Cayenne, Turso. Engines that opt out (e.g.
  TableModePartitionedDuckDB) are explicitly rejected at config
  validation with a clear error.
- New SwappableTableProvider wrapper for atomic provider swap.
- AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry
  optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot()
  short-circuits the federated fetch path.
- create_accelerated_table validates configuration and auto-applies a
  default refresh_check_interval (1m) for snapshot mode.

Manual end-to-end test against an S3 snapshot bucket confirms the
full flow: writer creates snapshots, reader polls, detects newer ids,
downloads atomically, evicts the cached pool, rebuilds the provider,
swaps under SwappableTableProvider, and serves the new data on
subsequent queries.
@phillipleblanc phillipleblanc force-pushed the phillip/invalidate-instance branch from 577d85f to e2ecafa Compare April 30, 2026 08:58
phillipleblanc added a commit to spiceai/spiceai that referenced this pull request Apr 30, 2026
…eshes

Address Copilot review comments and bugs found via manual testing
against an S3 snapshot bucket:

- runtime-acceleration: download_to_file now uses temp-file + atomic
  rename so concurrent readers in refresh_mode: snapshot never observe
  a partially written file (download is also fsync'd before rename for
  crash safety). Adds an explicit AlreadyExists fallback for older
  Windows configurations where MoveFileExW does not honor the
  REPLACE_EXISTING flag.
- runtime-acceleration: download_if_newer uses strict >, never regresses
  to an older remote snapshot id; warns when remote is older than local.
- runtime: SnapshotRefreshState::current_snapshot_id is now
  Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot
  id 0 (a valid first-snapshot id) is no longer indistinguishable from
  "no snapshot loaded".
- build_snapshot_refresh_state: bootstrap_loaded_id is passed through
  directly without sentinel collapsing.
- swappable: doc comments now describe poison-recovery behavior
  accurately.
- New test: download_if_newer rejects regression to older remote id.

Bugs surfaced via manual S3 testing:

- build_snapshot_refresh_state was constructing SnapshotManager without
  a checkpointer factory, so refresh-time download_latest_snapshot
  failed with "Dataset checkpointer factory not set". Now wires the
  same factory used by the bootstrap path.
- refresh_from_snapshot validated snapshot schema against the federated
  schema. For snapshot mode the federated source is intentionally not
  consulted at refresh time and may diverge (e.g. empty stub source).
  Now validates against the SwappableTableProvider's cached schema.
- Schema validation was using strict arrow Schema equality, which
  rejected logically-identical schemas that differ only in
  engine-attached metadata or nullability flags. Replaced with
  schemas_compatible() that compares field count, ordered names, and
  data types only.

Pool cache invalidation:

- DuckDB and SQLite reload_from_snapshot now call the new upstream
  invalidate_file_instance / invalidate_instance APIs on the
  TableProviderFactory before rebuilding. Without this eviction the
  newly-built provider would reuse the prior pool's open connections
  (which keep observing the previously-opened file inode) and queries
  would return stale data even after the file is replaced on disk.
- Cargo.toml bumped to consume the upstream PR
  (datafusion-contrib/datafusion-table-providers#635).

Error context preservation:

- build_snapshot_refresh_state no longer swallows FilePathError from
  get_acceleration_layout; a new SnapshotRefreshModeLayoutUnavailable
  variant carries the original error so snapshot-mode config issues
  surface with their root cause (which engine, which path).

Original architecture:

A new refresh mode that, when paired with snapshots, drives accelerator
data exclusively from snapshots in object storage. The federated source
is never queried for refreshes; instead the runtime polls the snapshot
store on a configurable cadence and reloads the accelerator only when
a strictly newer snapshot is available.

- spicepod / runtime: new RefreshMode::Snapshot enum variant.
- runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new
  remote_current_snapshot_id() and download_if_newer().
- DataAccelerator trait: new supports_snapshot_reload() and
  reload_from_snapshot(source, previous_provider, factory). Implemented
  for DuckDB, SQLite, Cayenne, Turso. Engines that opt out (e.g.
  TableModePartitionedDuckDB) are explicitly rejected at config
  validation with a clear error.
- New SwappableTableProvider wrapper for atomic provider swap.
- AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry
  optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot()
  short-circuits the federated fetch path.
- create_accelerated_table validates configuration and auto-applies a
  default refresh_check_interval (1m) for snapshot mode.

Manual end-to-end test against an S3 snapshot bucket confirms the
full flow: writer creates snapshots, reader polls, detects newer ids,
downloads atomically, evicts the cached pool, rebuilds the provider,
swaps under SwappableTableProvider, and serves the new data on
subsequent queries.
@phillipleblanc phillipleblanc force-pushed the phillip/invalidate-instance branch from e2ecafa to 346381c Compare April 30, 2026 15:26
…e invalidation

DuckDBTableProviderFactory and SqliteTableProviderFactory cache pool
instances by DbInstanceKey (file path or memory). Without an
invalidation API, callers that legitimately need to swap the underlying
database file out-of-band — for example, after restoring it from a
snapshot in object storage — have no way to force the next provider
build to open a fresh pool. Subsequent providers reuse the prior pool's
already-open connections, which keep observing the previously-opened
file inode rather than the replacement on disk.

Add:

- DuckDBTableProviderFactory::invalidate_instance(&DbInstanceKey)
- DuckDBTableProviderFactory::invalidate_file_instance(path)
- SqliteTableProviderFactory::invalidate_instance(&DbInstanceKey)
- SqliteTableProviderFactory::invalidate_file_instance(path)

Each removes the cached pool entry and returns it (so the caller can
hold the prior pool alive for in-flight queries to drain) without
disturbing other entries. Re-invalidating an absent key is a no-op.

Tests cover: insertion, idempotent reuse, eviction, no-op missing-key,
and re-population.
@phillipleblanc phillipleblanc force-pushed the phillip/invalidate-instance branch from 346381c to 96482d2 Compare April 30, 2026 15:27
phillipleblanc added a commit to spiceai/spiceai that referenced this pull request Apr 30, 2026
…eshes

Address Copilot review comments and bugs found via manual testing
against an S3 snapshot bucket:

- runtime-acceleration: download_to_file now uses temp-file + atomic
  rename so concurrent readers in refresh_mode: snapshot never observe
  a partially written file (download is also fsync'd before rename for
  crash safety). Adds an explicit AlreadyExists fallback for older
  Windows configurations where MoveFileExW does not honor the
  REPLACE_EXISTING flag.
- runtime-acceleration: download_if_newer uses strict >, never regresses
  to an older remote snapshot id; warns when remote is older than local.
- runtime: SnapshotRefreshState::current_snapshot_id is now
  Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot
  id 0 (a valid first-snapshot id) is no longer indistinguishable from
  "no snapshot loaded".
- build_snapshot_refresh_state: bootstrap_loaded_id is passed through
  directly without sentinel collapsing.
- swappable: doc comments now describe poison-recovery behavior
  accurately.
- New test: download_if_newer rejects regression to older remote id.

Bugs surfaced via manual S3 testing:

- build_snapshot_refresh_state was constructing SnapshotManager without
  a checkpointer factory, so refresh-time download_latest_snapshot
  failed with "Dataset checkpointer factory not set". Now wires the
  same factory used by the bootstrap path.
- refresh_from_snapshot validated snapshot schema against the federated
  schema. For snapshot mode the federated source is intentionally not
  consulted at refresh time and may diverge (e.g. empty stub source).
  Now validates against the SwappableTableProvider's cached schema.
- Schema validation was using strict arrow Schema equality, which
  rejected logically-identical schemas that differ only in
  engine-attached metadata or nullability flags. Replaced with
  schemas_compatible() that compares field count, ordered names, and
  data types only.

Pool cache invalidation:

- DuckDB and SQLite reload_from_snapshot now call the new upstream
  invalidate_file_instance / invalidate_instance APIs on the
  TableProviderFactory before rebuilding. Without this eviction the
  newly-built provider would reuse the prior pool's open connections
  (which keep observing the previously-opened file inode) and queries
  would return stale data even after the file is replaced on disk.
- Cargo.toml bumped to consume the upstream PR
  (datafusion-contrib/datafusion-table-providers#635).

Error context preservation:

- build_snapshot_refresh_state no longer swallows FilePathError from
  get_acceleration_layout; a new SnapshotRefreshModeLayoutUnavailable
  variant carries the original error so snapshot-mode config issues
  surface with their root cause (which engine, which path).

Original architecture:

A new refresh mode that, when paired with snapshots, drives accelerator
data exclusively from snapshots in object storage. The federated source
is never queried for refreshes; instead the runtime polls the snapshot
store on a configurable cadence and reloads the accelerator only when
a strictly newer snapshot is available.

- spicepod / runtime: new RefreshMode::Snapshot enum variant.
- runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new
  remote_current_snapshot_id() and download_if_newer().
- DataAccelerator trait: new supports_snapshot_reload() and
  reload_from_snapshot(source, previous_provider, factory). Implemented
  for DuckDB, SQLite, Cayenne, Turso. Engines that opt out (e.g.
  TableModePartitionedDuckDB) are explicitly rejected at config
  validation with a clear error.
- New SwappableTableProvider wrapper for atomic provider swap.
- AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry
  optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot()
  short-circuits the federated fetch path.
- create_accelerated_table validates configuration and auto-applies a
  default refresh_check_interval (1m) for snapshot mode.

Manual end-to-end test against an S3 snapshot bucket confirms the
full flow: writer creates snapshots, reader polls, detects newer ids,
downloads atomically, evicts the cached pool, rebuilds the provider,
swaps under SwappableTableProvider, and serves the new data on
subsequent queries.
phillipleblanc added a commit to spiceai/spiceai that referenced this pull request May 2, 2026
…eshes

Address Copilot review comments and bugs found via manual testing
against an S3 snapshot bucket:

- runtime-acceleration: download_to_file now uses temp-file + atomic
  rename so concurrent readers in refresh_mode: snapshot never observe
  a partially written file (download is also fsync'd before rename for
  crash safety). Adds an explicit AlreadyExists fallback for older
  Windows configurations where MoveFileExW does not honor the
  REPLACE_EXISTING flag.
- runtime-acceleration: download_if_newer uses strict >, never regresses
  to an older remote snapshot id; warns when remote is older than local.
- runtime: SnapshotRefreshState::current_snapshot_id is now
  Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot
  id 0 (a valid first-snapshot id) is no longer indistinguishable from
  "no snapshot loaded".
- build_snapshot_refresh_state: bootstrap_loaded_id is passed through
  directly without sentinel collapsing.
- swappable: doc comments now describe poison-recovery behavior
  accurately.
- New test: download_if_newer rejects regression to older remote id.

Bugs surfaced via manual S3 testing:

- build_snapshot_refresh_state was constructing SnapshotManager without
  a checkpointer factory, so refresh-time download_latest_snapshot
  failed with "Dataset checkpointer factory not set". Now wires the
  same factory used by the bootstrap path.
- refresh_from_snapshot validated snapshot schema against the federated
  schema. For snapshot mode the federated source is intentionally not
  consulted at refresh time and may diverge (e.g. empty stub source).
  Now validates against the SwappableTableProvider's cached schema.
- Schema validation was using strict arrow Schema equality, which
  rejected logically-identical schemas that differ only in
  engine-attached metadata or nullability flags. Replaced with
  schemas_compatible() that compares field count, ordered names, and
  data types only.

Pool cache invalidation:

- DuckDB and SQLite reload_from_snapshot now call the new upstream
  invalidate_file_instance / invalidate_instance APIs on the
  TableProviderFactory before rebuilding. Without this eviction the
  newly-built provider would reuse the prior pool's open connections
  (which keep observing the previously-opened file inode) and queries
  would return stale data even after the file is replaced on disk.
- Cargo.toml bumped to consume the upstream PR
  (datafusion-contrib/datafusion-table-providers#635).

Error context preservation:

- build_snapshot_refresh_state no longer swallows FilePathError from
  get_acceleration_layout; a new SnapshotRefreshModeLayoutUnavailable
  variant carries the original error so snapshot-mode config issues
  surface with their root cause (which engine, which path).

Original architecture:

A new refresh mode that, when paired with snapshots, drives accelerator
data exclusively from snapshots in object storage. The federated source
is never queried for refreshes; instead the runtime polls the snapshot
store on a configurable cadence and reloads the accelerator only when
a strictly newer snapshot is available.

- spicepod / runtime: new RefreshMode::Snapshot enum variant.
- runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new
  remote_current_snapshot_id() and download_if_newer().
- DataAccelerator trait: new supports_snapshot_reload() and
  reload_from_snapshot(source, previous_provider, factory). Implemented
  for DuckDB, SQLite, Cayenne, Turso. Engines that opt out (e.g.
  TableModePartitionedDuckDB) are explicitly rejected at config
  validation with a clear error.
- New SwappableTableProvider wrapper for atomic provider swap.
- AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry
  optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot()
  short-circuits the federated fetch path.
- create_accelerated_table validates configuration and auto-applies a
  default refresh_check_interval (1m) for snapshot mode.

Manual end-to-end test against an S3 snapshot bucket confirms the
full flow: writer creates snapshots, reader polls, detects newer ids,
downloads atomically, evicts the cached pool, rebuilds the provider,
swaps under SwappableTableProvider, and serves the new data on
subsequent queries.
@phillipleblanc phillipleblanc self-assigned this May 3, 2026
@phillipleblanc phillipleblanc marked this pull request as ready for review May 3, 2026 12:29
@phillipleblanc phillipleblanc merged commit 97ecd00 into spiceai-52 May 3, 2026
12 checks passed
@phillipleblanc phillipleblanc deleted the phillip/invalidate-instance branch May 3, 2026 14:58
pull Bot pushed a commit to TheRakeshPurohit/spiceai that referenced this pull request May 4, 2026
…tore slice (stacked) (spiceai#10651)

* fix(snapshot): flush SQLite/Turso WAL before snapshotting

SQLite and Turso accelerators run with WAL journal mode. The pre-existing
snapshot path performed `fs::copy(live_db, temp_copy)` directly, which
captured only the durable pages — every uncheckpointed write (typically
all writes since the last automatic checkpoint) lived only in the
`-wal` sidecar and was silently lost in the upload.

This bug was invisible under `refresh_mode: full|append|changes`
because the federated source repopulates on bootstrap. It surfaces as
data loss with `refresh_mode: snapshot` (separate change) where the
snapshot is the only data source.

Fixes by:

  1. Adding `SnapshotEngine::checkpoint_live` (default no-op) — invoked
     by `SnapshotManager::create_file_snapshot` *before* `fs::copy`
     while the accelerator write lock is held.

  2. New `SqliteSnapshotEngine` overrides `checkpoint_live` to run
     `PRAGMA wal_checkpoint(TRUNCATE)` against the live file via a
     short-lived rusqlite connection. WAL mode allows multiple
     connections; the existing accelerator write lock guarantees no
     concurrent writers race the checkpoint.

  3. `SqliteSnapshotEngine::prepare_for_upload` switches the *copied*
     file to `journal_mode=DELETE` as defense-in-depth: even if the
     copy ever ended up with stale `-wal`/`-shm` sidecars next to it,
     the uploaded file is guaranteed to be self-contained.

  4. New `TursoSnapshotEngine` reuses the SqliteSnapshotEngine logic
     (libsql is on-disk-compatible with SQLite). The `turso` feature
     now pulls in rusqlite for this purpose.

  5. `create_snapshot_engine` now returns the engine-specific impls
     for `AccelerationEngine::Sqlite` and `::Turso` instead of
     `DefaultSnapshotEngine`.

Closes spiceai#10643

* feat(acceleration): add refresh_mode: snapshot for snapshot-only refreshes

Adds a new refresh mode that loads accelerated tables only by polling
the snapshot store for newer snapshots, never querying the federated
source. Use cases: read replicas, expensive/slow sources, air-gapped
propagation.

Behavior:

  * `RefreshMode::Snapshot` is a new variant accepted in spicepod
    `acceleration.refresh_mode` (and is rejected by the validator
    unless `acceleration.snapshots` is enabled).

  * On startup, the dataset bootstraps from the snapshot store as
    today.

  * The refresh task polls the snapshot store on a configurable
    interval (default 60 s) via the new
    `SnapshotManager::download_if_newer(current_local_id, validate_schema)`
    API. The schema validator runs against the snapshot's
    `metadata.json` *before* download so a mismatched snapshot can
    never overwrite the primary file.

  * Per-engine reload is performed via a new
    `Accelerator::reload_from_snapshot` trait method. DuckDB/SQLite/
    Turso impls evict pool connections and reopen at the same primary
    path; the live federated query path serves the prior inode until
    the swap completes via `SwappableTableProvider` (RwLock-backed
    `Arc<dyn TableProvider>` that caches schema/constraints/table_type).

  * `INSERT INTO <dataset>` is rejected with a clear error when the
    accelerator's refresh_mode is Snapshot.

  * Reload + swap is serialized against concurrent accelerator writes
    via the existing accelerator write mutex.

  * Atomic file write on download (temp + fsync + rename + parent
    fsync).

Integration tests:

  * `snapshot_refresh::duckdb` — bootstrap → writer change →
    reader picks up new snapshot → swap → query reflects change.
  * `snapshot_refresh::sqlite` — same shape; passes thanks to the
    SQLite WAL flush in the parent commit.
  * `snapshot_refresh::turso` — same shape; same WAL flush
    dependency.

CI: new `Snapshot Refresh Mode Integration Tests` job in
integration.yml runs DuckDB/SQLite/Turso variants against MinIO.

Cayenne is intentionally not covered in this commit: portable Cayenne
snapshots require a per-dataset metastore export/import refactor
(addressed in the next commit). The snapshot_refresh test directory
omits Cayenne until then.

Depends on the SQLite/Turso WAL flush from the previous commit; without
it the SQLite/Turso integration tests cannot pass.

Pulls in the upstream pool-eviction APIs from
datafusion-contrib/datafusion-table-providers#635 (`invalidate_instance`,
`invalidate_file_instance`).

* feat(cayenne): add per-dataset metastore export/import (snapshot foundation)

Adds the foundation for portable, per-dataset Cayenne metastore snapshots:
a versioned-JSON 'slice' format that captures only one dataset's rows
from cayenne_table and its dependent tables, with path columns rewritten
to be relative to a configurable anchor.

Why:
The legacy Cayenne snapshot format archived the entire cayenne.db SQLite
file. That approach forced a one-dataset-per-metadata-directory limit
(see validate_cayenne_snapshot_consistency) because two snapshots would
clobber each other's metastore rows on extract, and produced snapshots
that were not portable across nodes whose data directories did not
match the writer's absolute paths (spiceai#10642). It also raced with the
reader process's eager metastore initialization (spiceai#10649): SQLite journal
sidecars created during init triggered checksum mismatches against the
archive on extract.

This commit lays the groundwork to fix all three issues by introducing:

  * `crates/cayenne/src/metastore/snapshot.rs` — the slice format and
    export/import logic:

    - `DatasetMetastoreSlice` (versioned JSON, format_version: 1)
      with one entry per metastore table from EXPECTED_TABLES.

    - `SliceValue` mirrors MetastoreValue but JSON-friendly (blobs
      base64-encoded under the 'x' tag).

    - `export_dataset(metastore, dataset_name, anchor)` selects every
      row that belongs to the dataset (the cayenne_table row plus all
      rows in dependent tables matching the same table_id) and emits a
      slice. Path columns (cayenne_table.path, cayenne_delete_file.path,
      cayenne_partition.path) are rewritten relative to `anchor` so the
      slice contains no absolute filesystem paths.

    - `import_dataset(metastore, slice, anchor)` runs inside a single
      transaction: deletes any existing rows for the same dataset_name
      (FK ON DELETE CASCADE removes all dependent rows), then inserts
      the slice's rows with paths rewritten back to absolute form
      anchored at `anchor`. Unsupported format_version or engine
      mismatch are rejected up front.

  * `MetastoreRow::get_value(index) -> MetastoreValue` — a new trait
    method that returns the raw column value without type coercion, so
    generic export logic can serialize columns without knowing each
    column's expected type at compile time. SQLite and Turso impls add
    a one-line implementation that clones from their existing values
    Vec.

Tests in `crates/cayenne/src/metastore/snapshot.rs::tests`:

  * `round_trip_preserves_rows_and_relocates_paths` — exports a
    dataset from one metastore, imports into a fresh metastore at a
    different anchor, verifies all partitions resolve to the new anchor.

  * `import_replaces_prior_dataset_rows_wholesale` — confirms the
    DuckDB-style snapshot-refresh semantic that import wipes any prior
    rows for the same dataset_name before inserting (no leftover
    partitions).

  * `import_leaves_other_datasets_untouched` — confirms slicing is
    correctly scoped: importing dataset A does not affect dataset B's
    rows in the same shared metastore.

  * `rejects_unsupported_format_version` — both wrong format_version
    and wrong engine identifier are rejected with clear error messages
    *before* any DML runs.

  * `json_round_trip` — slice serializes and parses back to an
    equivalent slice.

Follow-up (separate commit / PR): wire `CayenneSnapshotEngine` into
`SnapshotManager`'s directory create/extract paths so the new format
is actually used by snapshot upload/download. With that follow-up:
  * the cayenne.db file leaves the archive (replaces spiceai#10649),
  * paths become portable (closes spiceai#10642),
  * the single-dataset-per-metadata-dir validation can be lifted,
  * Cayenne refresh_mode: snapshot becomes a passing integration test.

* feat(cayenne): wire CayenneSnapshotEngine into snapshot pipeline

This is Commit 4 of the refresh_mode: snapshot stack. Commit 3 added the
metastore export/import API; this commit wires it into the
SnapshotManager upload/download pipeline so Cayenne snapshots actually
use the per-dataset slice format instead of shipping raw cayenne.db.

Closes:
  - spiceai#10642 (Cayenne snapshots not portable across data dirs)
  - spiceai#10649 (Cayenne metastore-init races with checksum)

Architecture:

  * Trait extension in runtime-acceleration::snapshot::engine:
    Adds two new SnapshotEngine methods, each with a no-op default so
    DuckDB/SQLite/Turso continue to behave exactly as before:

      - prepare_directory_snapshot(dirs, dataset_name)
          -> DirectorySnapshotPlan
        Returns (a) a list of file paths *relative to each source dir*
        that should be excluded from the tar, and (b) extra in-memory
        entries to append at the end of the tar.

      - finalize_directory_snapshot(dirs, dataset_name, extras)
        Engine-specific post-extract hook.

    Plus a new SnapshotEngineError::Custom { message } variant and a
    SnapshotEngineError::from_display() constructor so engines defined
    outside runtime-acceleration (CayenneSnapshotEngine in the runtime
    crate) can surface their own rich errors without needing a
    feature-gated variant in the trait crate.

  * directory_archive: new archive_directories_with_plan() that takes
    skip_relative_paths + extras. The original archive_directories() is
    preserved as a thin wrapper. The new add_directory_to_archive_filtered
    helper handles the recursive walk while honoring the skip set. The
    module remains engine-agnostic — the skip predicate is just a HashSet
    passed in.

  * SnapshotManager:
    - create_directory_snapshot now calls prepare_directory_snapshot
      first, then archive_directories_with_plan with the engine's
      skip list and extras.
    - download_to_directories now calls finalize_directory_snapshot
      after extract.
    - New with_snapshot_engine(self, Arc<dyn SnapshotEngine>) builder
      so accelerators can inject a custom engine.

  * download_snapshot_if_needed and snapshot_before_recreate gain an
    Option<Arc<dyn SnapshotEngine>> parameter. DuckDB/SQLite/Turso pass
    None (default behavior). Cayenne builds and passes a
    CayenneSnapshotEngine.

  * Cayenne accelerator (crates/runtime/src/dataaccelerator/cayenne):
    - New module snapshot_engine.rs implements CayenneSnapshotEngine.
      Holds an Arc<dyn MetadataCatalog>, dataset_name, and data_dir_anchor.
      prepare_directory_snapshot exports the per-dataset slice via
      MetadataCatalog::export_dataset_slice and instructs the archiver
      to skip cayenne.db / -wal / -shm. finalize_directory_snapshot
      reads the slice JSON back from the extracted directory and calls
      MetadataCatalog::import_dataset_slice.
    - The cayenne accelerator's bootstrap path constructs a
      CayenneSnapshotEngine using its existing get_or_create_catalog
      machinery and threads it through download_snapshot_if_needed.

  * cayenne::MetadataCatalog: new export_dataset_slice and
    import_dataset_slice trait methods (default impls return
    InvalidOperationNoSource); CayenneCatalog overrides both with
    dispatch to the underlying SQLite or libsql metastore.

  * Lifted validate_cayenne_snapshot_consistency's
    SharedAcceleration restriction. With per-dataset slices, multiple
    Cayenne datasets sharing a metastore directory can each ship their
    own snapshot without clobbering each other's metastore rows on
    extract. The InconsistentSnapshotSettings (mixed enabled/disabled)
    check stays.

  * Re-enabled the Cayenne snapshot_refresh integration test
    (snapshot_refresh::cayenne::snapshot_refresh_cayenne_bootstrap_then_refresh).

Tests:
  - 3 new unit tests in cayenne::snapshot_engine::tests:
      * create_directory_snapshot_skips_cayenne_db_and_emits_slice
      * refuses_mismatched_dataset
      * finalize_missing_slice_returns_clear_error
  - validate_snapshots tests updated; the
    test_cayenne_shared_acceleration_with_snapshots_errors test is
    renamed to ..._now_supported and asserts Ok.
  - Existing 91 runtime-acceleration::snapshot lib tests still pass.
  - Cayenne integration test runs against real S3 in CI (same harness
    as the DuckDB/SQLite/Turso refresh-mode tests).

* fix(snapshot): bootstrap dataset checkpoint from snapshot metadata

When a downloaded snapshot doesn't carry a populated _dataset_checkpoint
row (the steady state for Cayenne, where the on-disk archive ships the
per-dataset metastore slice JSON instead of the raw cayenne.db), the
post-extract Checkpointer::get_schema() returns None and bootstrap was
failing with MissingSchema.

The metadata.json carries the dataset's verified schema; after import we
now materialize that schema into the local checkpoint via
Checkpointer::checkpoint(metadata_schema, None) so the spice_sys side
matches the snapshot. For DuckDB / SQLite / Turso this branch is
unreachable in steady state (their archives ship _dataset_checkpoint
already); on a corrupted snapshot the self-heal is harmless because the
metadata schema is the same one we'd otherwise have validated against.

This unblocks refresh_mode: snapshot for Cayenne and re-enables the
snapshot_refresh::cayenne integration test by default. Closes spiceai#10658.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants