feat(duckdb,sqlite): add invalidate_instance for snapshot-driven cache invalidation#635
Merged
phillipleblanc merged 2 commits intospiceai-52from May 3, 2026
Merged
Conversation
phillipleblanc
added a commit
to spiceai/spiceai
that referenced
this pull request
Apr 28, 2026
…eshes Address Copilot review comments and bugs found via manual testing against an S3 snapshot bucket: - runtime-acceleration: download_to_file now uses temp-file + atomic rename so concurrent readers in refresh_mode: snapshot never observe a partially written file (download is also fsync'd before rename for crash safety). - runtime-acceleration: download_if_newer uses strict >, never regresses to an older remote snapshot id; warns when remote is older than local. - runtime: SnapshotRefreshState::current_snapshot_id is now Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot id 0 (a valid first-snapshot id) is no longer indistinguishable from "no snapshot loaded". - build_snapshot_refresh_state: bootstrap_loaded_id is passed through directly without sentinel collapsing. - swappable: doc comments now describe poison-recovery behavior accurately. - New test: download_if_newer rejects regression to older remote id. Bugs surfaced via manual S3 testing: - build_snapshot_refresh_state was constructing SnapshotManager without a checkpointer factory, so refresh-time download_latest_snapshot failed with "Dataset checkpointer factory not set". Now wires the same factory used by the bootstrap path. - refresh_from_snapshot validated snapshot schema against the federated schema. For snapshot mode the federated source is intentionally not consulted at refresh time and may diverge (e.g. empty stub source). Now validates against the SwappableTableProvider's cached schema. - Schema validation was using strict arrow Schema equality, which rejected logically-identical schemas that differ only in engine-attached metadata or nullability flags. Replaced with schemas_compatible() that compares field count, ordered names, and data types only. Pool cache invalidation: - DuckDB and SQLite reload_from_snapshot now call the new upstream invalidate_file_instance / invalidate_instance APIs on the TableProviderFactory before rebuilding. Without this eviction the newly-built provider would reuse the prior pool's open connections (which keep observing the previously-opened file inode) and queries would return stale data even after the file is replaced on disk. - Cargo.toml bumped to consume the upstream PR (datafusion-contrib/datafusion-table-providers#635). Original architecture: A new refresh mode that, when paired with snapshots, drives accelerator data exclusively from snapshots in object storage. The federated source is never queried for refreshes; instead the runtime polls the snapshot store on a configurable cadence and reloads the accelerator only when a strictly newer snapshot is available. - spicepod / runtime: new RefreshMode::Snapshot enum variant. - runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new remote_current_snapshot_id() and download_if_newer(). - DataAccelerator trait: new supports_snapshot_reload() and reload_from_snapshot(source, previous_provider, factory). Implemented for DuckDB, SQLite, Cayenne, Turso. - New SwappableTableProvider wrapper for atomic provider swap. - AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot() short-circuits the federated fetch path. - create_accelerated_table validates configuration and auto-applies a default refresh_check_interval (1m) for snapshot mode. Manual end-to-end test against an S3 snapshot bucket confirms the full flow: writer creates snapshots, reader polls, detects newer ids, downloads atomically, evicts the cached pool, rebuilds the provider, swaps under SwappableTableProvider, and serves the new data on subsequent queries.
6bf50e6 to
577d85f
Compare
phillipleblanc
added a commit
to spiceai/spiceai
that referenced
this pull request
Apr 28, 2026
…eshes Address Copilot review comments and bugs found via manual testing against an S3 snapshot bucket: - runtime-acceleration: download_to_file now uses temp-file + atomic rename so concurrent readers in refresh_mode: snapshot never observe a partially written file (download is also fsync'd before rename for crash safety). - runtime-acceleration: download_if_newer uses strict >, never regresses to an older remote snapshot id; warns when remote is older than local. - runtime: SnapshotRefreshState::current_snapshot_id is now Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot id 0 (a valid first-snapshot id) is no longer indistinguishable from "no snapshot loaded". - build_snapshot_refresh_state: bootstrap_loaded_id is passed through directly without sentinel collapsing. - swappable: doc comments now describe poison-recovery behavior accurately. - New test: download_if_newer rejects regression to older remote id. Bugs surfaced via manual S3 testing: - build_snapshot_refresh_state was constructing SnapshotManager without a checkpointer factory, so refresh-time download_latest_snapshot failed with "Dataset checkpointer factory not set". Now wires the same factory used by the bootstrap path. - refresh_from_snapshot validated snapshot schema against the federated schema. For snapshot mode the federated source is intentionally not consulted at refresh time and may diverge (e.g. empty stub source). Now validates against the SwappableTableProvider's cached schema. - Schema validation was using strict arrow Schema equality, which rejected logically-identical schemas that differ only in engine-attached metadata or nullability flags. Replaced with schemas_compatible() that compares field count, ordered names, and data types only. Pool cache invalidation: - DuckDB and SQLite reload_from_snapshot now call the new upstream invalidate_file_instance / invalidate_instance APIs on the TableProviderFactory before rebuilding. Without this eviction the newly-built provider would reuse the prior pool's open connections (which keep observing the previously-opened file inode) and queries would return stale data even after the file is replaced on disk. - Cargo.toml bumped to consume the upstream PR (datafusion-contrib/datafusion-table-providers#635). Original architecture: A new refresh mode that, when paired with snapshots, drives accelerator data exclusively from snapshots in object storage. The federated source is never queried for refreshes; instead the runtime polls the snapshot store on a configurable cadence and reloads the accelerator only when a strictly newer snapshot is available. - spicepod / runtime: new RefreshMode::Snapshot enum variant. - runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new remote_current_snapshot_id() and download_if_newer(). - DataAccelerator trait: new supports_snapshot_reload() and reload_from_snapshot(source, previous_provider, factory). Implemented for DuckDB, SQLite, Cayenne, Turso. - New SwappableTableProvider wrapper for atomic provider swap. - AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot() short-circuits the federated fetch path. - create_accelerated_table validates configuration and auto-applies a default refresh_check_interval (1m) for snapshot mode. Manual end-to-end test against an S3 snapshot bucket confirms the full flow: writer creates snapshots, reader polls, detects newer ids, downloads atomically, evicts the cached pool, rebuilds the provider, swaps under SwappableTableProvider, and serves the new data on subsequent queries.
phillipleblanc
added a commit
to spiceai/spiceai
that referenced
this pull request
Apr 28, 2026
…eshes Address Copilot review comments and bugs found via manual testing against an S3 snapshot bucket: - runtime-acceleration: download_to_file now uses temp-file + atomic rename so concurrent readers in refresh_mode: snapshot never observe a partially written file (download is also fsync'd before rename for crash safety). Adds an explicit AlreadyExists fallback for older Windows configurations where MoveFileExW does not honor the REPLACE_EXISTING flag. - runtime-acceleration: download_if_newer uses strict >, never regresses to an older remote snapshot id; warns when remote is older than local. - runtime: SnapshotRefreshState::current_snapshot_id is now Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot id 0 (a valid first-snapshot id) is no longer indistinguishable from "no snapshot loaded". - build_snapshot_refresh_state: bootstrap_loaded_id is passed through directly without sentinel collapsing. - swappable: doc comments now describe poison-recovery behavior accurately. - New test: download_if_newer rejects regression to older remote id. Bugs surfaced via manual S3 testing: - build_snapshot_refresh_state was constructing SnapshotManager without a checkpointer factory, so refresh-time download_latest_snapshot failed with "Dataset checkpointer factory not set". Now wires the same factory used by the bootstrap path. - refresh_from_snapshot validated snapshot schema against the federated schema. For snapshot mode the federated source is intentionally not consulted at refresh time and may diverge (e.g. empty stub source). Now validates against the SwappableTableProvider's cached schema. - Schema validation was using strict arrow Schema equality, which rejected logically-identical schemas that differ only in engine-attached metadata or nullability flags. Replaced with schemas_compatible() that compares field count, ordered names, and data types only. Pool cache invalidation: - DuckDB and SQLite reload_from_snapshot now call the new upstream invalidate_file_instance / invalidate_instance APIs on the TableProviderFactory before rebuilding. Without this eviction the newly-built provider would reuse the prior pool's open connections (which keep observing the previously-opened file inode) and queries would return stale data even after the file is replaced on disk. - Cargo.toml bumped to consume the upstream PR (datafusion-contrib/datafusion-table-providers#635). Error context preservation: - build_snapshot_refresh_state no longer swallows FilePathError from get_acceleration_layout; a new SnapshotRefreshModeLayoutUnavailable variant carries the original error so snapshot-mode config issues surface with their root cause (which engine, which path). Original architecture: A new refresh mode that, when paired with snapshots, drives accelerator data exclusively from snapshots in object storage. The federated source is never queried for refreshes; instead the runtime polls the snapshot store on a configurable cadence and reloads the accelerator only when a strictly newer snapshot is available. - spicepod / runtime: new RefreshMode::Snapshot enum variant. - runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new remote_current_snapshot_id() and download_if_newer(). - DataAccelerator trait: new supports_snapshot_reload() and reload_from_snapshot(source, previous_provider, factory). Implemented for DuckDB, SQLite, Cayenne, Turso. Engines that opt out (e.g. TableModePartitionedDuckDB) are explicitly rejected at config validation with a clear error. - New SwappableTableProvider wrapper for atomic provider swap. - AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot() short-circuits the federated fetch path. - create_accelerated_table validates configuration and auto-applies a default refresh_check_interval (1m) for snapshot mode. Manual end-to-end test against an S3 snapshot bucket confirms the full flow: writer creates snapshots, reader polls, detects newer ids, downloads atomically, evicts the cached pool, rebuilds the provider, swaps under SwappableTableProvider, and serves the new data on subsequent queries.
phillipleblanc
added a commit
to spiceai/spiceai
that referenced
this pull request
Apr 28, 2026
…eshes Address Copilot review comments and bugs found via manual testing against an S3 snapshot bucket: - runtime-acceleration: download_to_file now uses temp-file + atomic rename so concurrent readers in refresh_mode: snapshot never observe a partially written file (download is also fsync'd before rename for crash safety). Adds an explicit AlreadyExists fallback for older Windows configurations where MoveFileExW does not honor the REPLACE_EXISTING flag. - runtime-acceleration: download_if_newer uses strict >, never regresses to an older remote snapshot id; warns when remote is older than local. - runtime: SnapshotRefreshState::current_snapshot_id is now Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot id 0 (a valid first-snapshot id) is no longer indistinguishable from "no snapshot loaded". - build_snapshot_refresh_state: bootstrap_loaded_id is passed through directly without sentinel collapsing. - swappable: doc comments now describe poison-recovery behavior accurately. - New test: download_if_newer rejects regression to older remote id. Bugs surfaced via manual S3 testing: - build_snapshot_refresh_state was constructing SnapshotManager without a checkpointer factory, so refresh-time download_latest_snapshot failed with "Dataset checkpointer factory not set". Now wires the same factory used by the bootstrap path. - refresh_from_snapshot validated snapshot schema against the federated schema. For snapshot mode the federated source is intentionally not consulted at refresh time and may diverge (e.g. empty stub source). Now validates against the SwappableTableProvider's cached schema. - Schema validation was using strict arrow Schema equality, which rejected logically-identical schemas that differ only in engine-attached metadata or nullability flags. Replaced with schemas_compatible() that compares field count, ordered names, and data types only. Pool cache invalidation: - DuckDB and SQLite reload_from_snapshot now call the new upstream invalidate_file_instance / invalidate_instance APIs on the TableProviderFactory before rebuilding. Without this eviction the newly-built provider would reuse the prior pool's open connections (which keep observing the previously-opened file inode) and queries would return stale data even after the file is replaced on disk. - Cargo.toml bumped to consume the upstream PR (datafusion-contrib/datafusion-table-providers#635). Error context preservation: - build_snapshot_refresh_state no longer swallows FilePathError from get_acceleration_layout; a new SnapshotRefreshModeLayoutUnavailable variant carries the original error so snapshot-mode config issues surface with their root cause (which engine, which path). Original architecture: A new refresh mode that, when paired with snapshots, drives accelerator data exclusively from snapshots in object storage. The federated source is never queried for refreshes; instead the runtime polls the snapshot store on a configurable cadence and reloads the accelerator only when a strictly newer snapshot is available. - spicepod / runtime: new RefreshMode::Snapshot enum variant. - runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new remote_current_snapshot_id() and download_if_newer(). - DataAccelerator trait: new supports_snapshot_reload() and reload_from_snapshot(source, previous_provider, factory). Implemented for DuckDB, SQLite, Cayenne, Turso. Engines that opt out (e.g. TableModePartitionedDuckDB) are explicitly rejected at config validation with a clear error. - New SwappableTableProvider wrapper for atomic provider swap. - AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot() short-circuits the federated fetch path. - create_accelerated_table validates configuration and auto-applies a default refresh_check_interval (1m) for snapshot mode. Manual end-to-end test against an S3 snapshot bucket confirms the full flow: writer creates snapshots, reader polls, detects newer ids, downloads atomically, evicts the cached pool, rebuilds the provider, swaps under SwappableTableProvider, and serves the new data on subsequent queries.
phillipleblanc
added a commit
to spiceai/spiceai
that referenced
this pull request
Apr 29, 2026
…eshes Address Copilot review comments and bugs found via manual testing against an S3 snapshot bucket: - runtime-acceleration: download_to_file now uses temp-file + atomic rename so concurrent readers in refresh_mode: snapshot never observe a partially written file (download is also fsync'd before rename for crash safety). Adds an explicit AlreadyExists fallback for older Windows configurations where MoveFileExW does not honor the REPLACE_EXISTING flag. - runtime-acceleration: download_if_newer uses strict >, never regresses to an older remote snapshot id; warns when remote is older than local. - runtime: SnapshotRefreshState::current_snapshot_id is now Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot id 0 (a valid first-snapshot id) is no longer indistinguishable from "no snapshot loaded". - build_snapshot_refresh_state: bootstrap_loaded_id is passed through directly without sentinel collapsing. - swappable: doc comments now describe poison-recovery behavior accurately. - New test: download_if_newer rejects regression to older remote id. Bugs surfaced via manual S3 testing: - build_snapshot_refresh_state was constructing SnapshotManager without a checkpointer factory, so refresh-time download_latest_snapshot failed with "Dataset checkpointer factory not set". Now wires the same factory used by the bootstrap path. - refresh_from_snapshot validated snapshot schema against the federated schema. For snapshot mode the federated source is intentionally not consulted at refresh time and may diverge (e.g. empty stub source). Now validates against the SwappableTableProvider's cached schema. - Schema validation was using strict arrow Schema equality, which rejected logically-identical schemas that differ only in engine-attached metadata or nullability flags. Replaced with schemas_compatible() that compares field count, ordered names, and data types only. Pool cache invalidation: - DuckDB and SQLite reload_from_snapshot now call the new upstream invalidate_file_instance / invalidate_instance APIs on the TableProviderFactory before rebuilding. Without this eviction the newly-built provider would reuse the prior pool's open connections (which keep observing the previously-opened file inode) and queries would return stale data even after the file is replaced on disk. - Cargo.toml bumped to consume the upstream PR (datafusion-contrib/datafusion-table-providers#635). Error context preservation: - build_snapshot_refresh_state no longer swallows FilePathError from get_acceleration_layout; a new SnapshotRefreshModeLayoutUnavailable variant carries the original error so snapshot-mode config issues surface with their root cause (which engine, which path). Original architecture: A new refresh mode that, when paired with snapshots, drives accelerator data exclusively from snapshots in object storage. The federated source is never queried for refreshes; instead the runtime polls the snapshot store on a configurable cadence and reloads the accelerator only when a strictly newer snapshot is available. - spicepod / runtime: new RefreshMode::Snapshot enum variant. - runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new remote_current_snapshot_id() and download_if_newer(). - DataAccelerator trait: new supports_snapshot_reload() and reload_from_snapshot(source, previous_provider, factory). Implemented for DuckDB, SQLite, Cayenne, Turso. Engines that opt out (e.g. TableModePartitionedDuckDB) are explicitly rejected at config validation with a clear error. - New SwappableTableProvider wrapper for atomic provider swap. - AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot() short-circuits the federated fetch path. - create_accelerated_table validates configuration and auto-applies a default refresh_check_interval (1m) for snapshot mode. Manual end-to-end test against an S3 snapshot bucket confirms the full flow: writer creates snapshots, reader polls, detects newer ids, downloads atomically, evicts the cached pool, rebuilds the provider, swaps under SwappableTableProvider, and serves the new data on subsequent queries.
phillipleblanc
added a commit
to spiceai/spiceai
that referenced
this pull request
Apr 30, 2026
…eshes Address Copilot review comments and bugs found via manual testing against an S3 snapshot bucket: - runtime-acceleration: download_to_file now uses temp-file + atomic rename so concurrent readers in refresh_mode: snapshot never observe a partially written file (download is also fsync'd before rename for crash safety). Adds an explicit AlreadyExists fallback for older Windows configurations where MoveFileExW does not honor the REPLACE_EXISTING flag. - runtime-acceleration: download_if_newer uses strict >, never regresses to an older remote snapshot id; warns when remote is older than local. - runtime: SnapshotRefreshState::current_snapshot_id is now Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot id 0 (a valid first-snapshot id) is no longer indistinguishable from "no snapshot loaded". - build_snapshot_refresh_state: bootstrap_loaded_id is passed through directly without sentinel collapsing. - swappable: doc comments now describe poison-recovery behavior accurately. - New test: download_if_newer rejects regression to older remote id. Bugs surfaced via manual S3 testing: - build_snapshot_refresh_state was constructing SnapshotManager without a checkpointer factory, so refresh-time download_latest_snapshot failed with "Dataset checkpointer factory not set". Now wires the same factory used by the bootstrap path. - refresh_from_snapshot validated snapshot schema against the federated schema. For snapshot mode the federated source is intentionally not consulted at refresh time and may diverge (e.g. empty stub source). Now validates against the SwappableTableProvider's cached schema. - Schema validation was using strict arrow Schema equality, which rejected logically-identical schemas that differ only in engine-attached metadata or nullability flags. Replaced with schemas_compatible() that compares field count, ordered names, and data types only. Pool cache invalidation: - DuckDB and SQLite reload_from_snapshot now call the new upstream invalidate_file_instance / invalidate_instance APIs on the TableProviderFactory before rebuilding. Without this eviction the newly-built provider would reuse the prior pool's open connections (which keep observing the previously-opened file inode) and queries would return stale data even after the file is replaced on disk. - Cargo.toml bumped to consume the upstream PR (datafusion-contrib/datafusion-table-providers#635). Error context preservation: - build_snapshot_refresh_state no longer swallows FilePathError from get_acceleration_layout; a new SnapshotRefreshModeLayoutUnavailable variant carries the original error so snapshot-mode config issues surface with their root cause (which engine, which path). Original architecture: A new refresh mode that, when paired with snapshots, drives accelerator data exclusively from snapshots in object storage. The federated source is never queried for refreshes; instead the runtime polls the snapshot store on a configurable cadence and reloads the accelerator only when a strictly newer snapshot is available. - spicepod / runtime: new RefreshMode::Snapshot enum variant. - runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new remote_current_snapshot_id() and download_if_newer(). - DataAccelerator trait: new supports_snapshot_reload() and reload_from_snapshot(source, previous_provider, factory). Implemented for DuckDB, SQLite, Cayenne, Turso. Engines that opt out (e.g. TableModePartitionedDuckDB) are explicitly rejected at config validation with a clear error. - New SwappableTableProvider wrapper for atomic provider swap. - AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot() short-circuits the federated fetch path. - create_accelerated_table validates configuration and auto-applies a default refresh_check_interval (1m) for snapshot mode. Manual end-to-end test against an S3 snapshot bucket confirms the full flow: writer creates snapshots, reader polls, detects newer ids, downloads atomically, evicts the cached pool, rebuilds the provider, swaps under SwappableTableProvider, and serves the new data on subsequent queries.
577d85f to
e2ecafa
Compare
phillipleblanc
added a commit
to spiceai/spiceai
that referenced
this pull request
Apr 30, 2026
…eshes Address Copilot review comments and bugs found via manual testing against an S3 snapshot bucket: - runtime-acceleration: download_to_file now uses temp-file + atomic rename so concurrent readers in refresh_mode: snapshot never observe a partially written file (download is also fsync'd before rename for crash safety). Adds an explicit AlreadyExists fallback for older Windows configurations where MoveFileExW does not honor the REPLACE_EXISTING flag. - runtime-acceleration: download_if_newer uses strict >, never regresses to an older remote snapshot id; warns when remote is older than local. - runtime: SnapshotRefreshState::current_snapshot_id is now Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot id 0 (a valid first-snapshot id) is no longer indistinguishable from "no snapshot loaded". - build_snapshot_refresh_state: bootstrap_loaded_id is passed through directly without sentinel collapsing. - swappable: doc comments now describe poison-recovery behavior accurately. - New test: download_if_newer rejects regression to older remote id. Bugs surfaced via manual S3 testing: - build_snapshot_refresh_state was constructing SnapshotManager without a checkpointer factory, so refresh-time download_latest_snapshot failed with "Dataset checkpointer factory not set". Now wires the same factory used by the bootstrap path. - refresh_from_snapshot validated snapshot schema against the federated schema. For snapshot mode the federated source is intentionally not consulted at refresh time and may diverge (e.g. empty stub source). Now validates against the SwappableTableProvider's cached schema. - Schema validation was using strict arrow Schema equality, which rejected logically-identical schemas that differ only in engine-attached metadata or nullability flags. Replaced with schemas_compatible() that compares field count, ordered names, and data types only. Pool cache invalidation: - DuckDB and SQLite reload_from_snapshot now call the new upstream invalidate_file_instance / invalidate_instance APIs on the TableProviderFactory before rebuilding. Without this eviction the newly-built provider would reuse the prior pool's open connections (which keep observing the previously-opened file inode) and queries would return stale data even after the file is replaced on disk. - Cargo.toml bumped to consume the upstream PR (datafusion-contrib/datafusion-table-providers#635). Error context preservation: - build_snapshot_refresh_state no longer swallows FilePathError from get_acceleration_layout; a new SnapshotRefreshModeLayoutUnavailable variant carries the original error so snapshot-mode config issues surface with their root cause (which engine, which path). Original architecture: A new refresh mode that, when paired with snapshots, drives accelerator data exclusively from snapshots in object storage. The federated source is never queried for refreshes; instead the runtime polls the snapshot store on a configurable cadence and reloads the accelerator only when a strictly newer snapshot is available. - spicepod / runtime: new RefreshMode::Snapshot enum variant. - runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new remote_current_snapshot_id() and download_if_newer(). - DataAccelerator trait: new supports_snapshot_reload() and reload_from_snapshot(source, previous_provider, factory). Implemented for DuckDB, SQLite, Cayenne, Turso. Engines that opt out (e.g. TableModePartitionedDuckDB) are explicitly rejected at config validation with a clear error. - New SwappableTableProvider wrapper for atomic provider swap. - AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot() short-circuits the federated fetch path. - create_accelerated_table validates configuration and auto-applies a default refresh_check_interval (1m) for snapshot mode. Manual end-to-end test against an S3 snapshot bucket confirms the full flow: writer creates snapshots, reader polls, detects newer ids, downloads atomically, evicts the cached pool, rebuilds the provider, swaps under SwappableTableProvider, and serves the new data on subsequent queries.
e2ecafa to
346381c
Compare
…e invalidation DuckDBTableProviderFactory and SqliteTableProviderFactory cache pool instances by DbInstanceKey (file path or memory). Without an invalidation API, callers that legitimately need to swap the underlying database file out-of-band — for example, after restoring it from a snapshot in object storage — have no way to force the next provider build to open a fresh pool. Subsequent providers reuse the prior pool's already-open connections, which keep observing the previously-opened file inode rather than the replacement on disk. Add: - DuckDBTableProviderFactory::invalidate_instance(&DbInstanceKey) - DuckDBTableProviderFactory::invalidate_file_instance(path) - SqliteTableProviderFactory::invalidate_instance(&DbInstanceKey) - SqliteTableProviderFactory::invalidate_file_instance(path) Each removes the cached pool entry and returns it (so the caller can hold the prior pool alive for in-flight queries to drain) without disturbing other entries. Re-invalidating an absent key is a no-op. Tests cover: insertion, idempotent reuse, eviction, no-op missing-key, and re-population.
346381c to
96482d2
Compare
phillipleblanc
added a commit
to spiceai/spiceai
that referenced
this pull request
Apr 30, 2026
…eshes Address Copilot review comments and bugs found via manual testing against an S3 snapshot bucket: - runtime-acceleration: download_to_file now uses temp-file + atomic rename so concurrent readers in refresh_mode: snapshot never observe a partially written file (download is also fsync'd before rename for crash safety). Adds an explicit AlreadyExists fallback for older Windows configurations where MoveFileExW does not honor the REPLACE_EXISTING flag. - runtime-acceleration: download_if_newer uses strict >, never regresses to an older remote snapshot id; warns when remote is older than local. - runtime: SnapshotRefreshState::current_snapshot_id is now Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot id 0 (a valid first-snapshot id) is no longer indistinguishable from "no snapshot loaded". - build_snapshot_refresh_state: bootstrap_loaded_id is passed through directly without sentinel collapsing. - swappable: doc comments now describe poison-recovery behavior accurately. - New test: download_if_newer rejects regression to older remote id. Bugs surfaced via manual S3 testing: - build_snapshot_refresh_state was constructing SnapshotManager without a checkpointer factory, so refresh-time download_latest_snapshot failed with "Dataset checkpointer factory not set". Now wires the same factory used by the bootstrap path. - refresh_from_snapshot validated snapshot schema against the federated schema. For snapshot mode the federated source is intentionally not consulted at refresh time and may diverge (e.g. empty stub source). Now validates against the SwappableTableProvider's cached schema. - Schema validation was using strict arrow Schema equality, which rejected logically-identical schemas that differ only in engine-attached metadata or nullability flags. Replaced with schemas_compatible() that compares field count, ordered names, and data types only. Pool cache invalidation: - DuckDB and SQLite reload_from_snapshot now call the new upstream invalidate_file_instance / invalidate_instance APIs on the TableProviderFactory before rebuilding. Without this eviction the newly-built provider would reuse the prior pool's open connections (which keep observing the previously-opened file inode) and queries would return stale data even after the file is replaced on disk. - Cargo.toml bumped to consume the upstream PR (datafusion-contrib/datafusion-table-providers#635). Error context preservation: - build_snapshot_refresh_state no longer swallows FilePathError from get_acceleration_layout; a new SnapshotRefreshModeLayoutUnavailable variant carries the original error so snapshot-mode config issues surface with their root cause (which engine, which path). Original architecture: A new refresh mode that, when paired with snapshots, drives accelerator data exclusively from snapshots in object storage. The federated source is never queried for refreshes; instead the runtime polls the snapshot store on a configurable cadence and reloads the accelerator only when a strictly newer snapshot is available. - spicepod / runtime: new RefreshMode::Snapshot enum variant. - runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new remote_current_snapshot_id() and download_if_newer(). - DataAccelerator trait: new supports_snapshot_reload() and reload_from_snapshot(source, previous_provider, factory). Implemented for DuckDB, SQLite, Cayenne, Turso. Engines that opt out (e.g. TableModePartitionedDuckDB) are explicitly rejected at config validation with a clear error. - New SwappableTableProvider wrapper for atomic provider swap. - AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot() short-circuits the federated fetch path. - create_accelerated_table validates configuration and auto-applies a default refresh_check_interval (1m) for snapshot mode. Manual end-to-end test against an S3 snapshot bucket confirms the full flow: writer creates snapshots, reader polls, detects newer ids, downloads atomically, evicts the cached pool, rebuilds the provider, swaps under SwappableTableProvider, and serves the new data on subsequent queries.
phillipleblanc
added a commit
to spiceai/spiceai
that referenced
this pull request
May 2, 2026
…eshes Address Copilot review comments and bugs found via manual testing against an S3 snapshot bucket: - runtime-acceleration: download_to_file now uses temp-file + atomic rename so concurrent readers in refresh_mode: snapshot never observe a partially written file (download is also fsync'd before rename for crash safety). Adds an explicit AlreadyExists fallback for older Windows configurations where MoveFileExW does not honor the REPLACE_EXISTING flag. - runtime-acceleration: download_if_newer uses strict >, never regresses to an older remote snapshot id; warns when remote is older than local. - runtime: SnapshotRefreshState::current_snapshot_id is now Arc<Mutex<Option<u64>>> rather than AtomicU64+sentinel-0, so snapshot id 0 (a valid first-snapshot id) is no longer indistinguishable from "no snapshot loaded". - build_snapshot_refresh_state: bootstrap_loaded_id is passed through directly without sentinel collapsing. - swappable: doc comments now describe poison-recovery behavior accurately. - New test: download_if_newer rejects regression to older remote id. Bugs surfaced via manual S3 testing: - build_snapshot_refresh_state was constructing SnapshotManager without a checkpointer factory, so refresh-time download_latest_snapshot failed with "Dataset checkpointer factory not set". Now wires the same factory used by the bootstrap path. - refresh_from_snapshot validated snapshot schema against the federated schema. For snapshot mode the federated source is intentionally not consulted at refresh time and may diverge (e.g. empty stub source). Now validates against the SwappableTableProvider's cached schema. - Schema validation was using strict arrow Schema equality, which rejected logically-identical schemas that differ only in engine-attached metadata or nullability flags. Replaced with schemas_compatible() that compares field count, ordered names, and data types only. Pool cache invalidation: - DuckDB and SQLite reload_from_snapshot now call the new upstream invalidate_file_instance / invalidate_instance APIs on the TableProviderFactory before rebuilding. Without this eviction the newly-built provider would reuse the prior pool's open connections (which keep observing the previously-opened file inode) and queries would return stale data even after the file is replaced on disk. - Cargo.toml bumped to consume the upstream PR (datafusion-contrib/datafusion-table-providers#635). Error context preservation: - build_snapshot_refresh_state no longer swallows FilePathError from get_acceleration_layout; a new SnapshotRefreshModeLayoutUnavailable variant carries the original error so snapshot-mode config issues surface with their root cause (which engine, which path). Original architecture: A new refresh mode that, when paired with snapshots, drives accelerator data exclusively from snapshots in object storage. The federated source is never queried for refreshes; instead the runtime polls the snapshot store on a configurable cadence and reloads the accelerator only when a strictly newer snapshot is available. - spicepod / runtime: new RefreshMode::Snapshot enum variant. - runtime-acceleration: SnapshotDownloadInfo carries snapshot_id; new remote_current_snapshot_id() and download_if_newer(). - DataAccelerator trait: new supports_snapshot_reload() and reload_from_snapshot(source, previous_provider, factory). Implemented for DuckDB, SQLite, Cayenne, Turso. Engines that opt out (e.g. TableModePartitionedDuckDB) are explicitly rejected at config validation with a clear error. - New SwappableTableProvider wrapper for atomic provider swap. - AcceleratedTable / Refresher / RefreshTaskRunner / RefreshTask carry optional SnapshotRefreshState. RefreshTask::refresh_from_snapshot() short-circuits the federated fetch path. - create_accelerated_table validates configuration and auto-applies a default refresh_check_interval (1m) for snapshot mode. Manual end-to-end test against an S3 snapshot bucket confirms the full flow: writer creates snapshots, reader polls, detects newer ids, downloads atomically, evicts the cached pool, rebuilds the provider, swaps under SwappableTableProvider, and serves the new data on subsequent queries.
sgrebnov
approved these changes
May 3, 2026
pull Bot
pushed a commit
to TheRakeshPurohit/spiceai
that referenced
this pull request
May 4, 2026
…tore slice (stacked) (spiceai#10651) * fix(snapshot): flush SQLite/Turso WAL before snapshotting SQLite and Turso accelerators run with WAL journal mode. The pre-existing snapshot path performed `fs::copy(live_db, temp_copy)` directly, which captured only the durable pages — every uncheckpointed write (typically all writes since the last automatic checkpoint) lived only in the `-wal` sidecar and was silently lost in the upload. This bug was invisible under `refresh_mode: full|append|changes` because the federated source repopulates on bootstrap. It surfaces as data loss with `refresh_mode: snapshot` (separate change) where the snapshot is the only data source. Fixes by: 1. Adding `SnapshotEngine::checkpoint_live` (default no-op) — invoked by `SnapshotManager::create_file_snapshot` *before* `fs::copy` while the accelerator write lock is held. 2. New `SqliteSnapshotEngine` overrides `checkpoint_live` to run `PRAGMA wal_checkpoint(TRUNCATE)` against the live file via a short-lived rusqlite connection. WAL mode allows multiple connections; the existing accelerator write lock guarantees no concurrent writers race the checkpoint. 3. `SqliteSnapshotEngine::prepare_for_upload` switches the *copied* file to `journal_mode=DELETE` as defense-in-depth: even if the copy ever ended up with stale `-wal`/`-shm` sidecars next to it, the uploaded file is guaranteed to be self-contained. 4. New `TursoSnapshotEngine` reuses the SqliteSnapshotEngine logic (libsql is on-disk-compatible with SQLite). The `turso` feature now pulls in rusqlite for this purpose. 5. `create_snapshot_engine` now returns the engine-specific impls for `AccelerationEngine::Sqlite` and `::Turso` instead of `DefaultSnapshotEngine`. Closes spiceai#10643 * feat(acceleration): add refresh_mode: snapshot for snapshot-only refreshes Adds a new refresh mode that loads accelerated tables only by polling the snapshot store for newer snapshots, never querying the federated source. Use cases: read replicas, expensive/slow sources, air-gapped propagation. Behavior: * `RefreshMode::Snapshot` is a new variant accepted in spicepod `acceleration.refresh_mode` (and is rejected by the validator unless `acceleration.snapshots` is enabled). * On startup, the dataset bootstraps from the snapshot store as today. * The refresh task polls the snapshot store on a configurable interval (default 60 s) via the new `SnapshotManager::download_if_newer(current_local_id, validate_schema)` API. The schema validator runs against the snapshot's `metadata.json` *before* download so a mismatched snapshot can never overwrite the primary file. * Per-engine reload is performed via a new `Accelerator::reload_from_snapshot` trait method. DuckDB/SQLite/ Turso impls evict pool connections and reopen at the same primary path; the live federated query path serves the prior inode until the swap completes via `SwappableTableProvider` (RwLock-backed `Arc<dyn TableProvider>` that caches schema/constraints/table_type). * `INSERT INTO <dataset>` is rejected with a clear error when the accelerator's refresh_mode is Snapshot. * Reload + swap is serialized against concurrent accelerator writes via the existing accelerator write mutex. * Atomic file write on download (temp + fsync + rename + parent fsync). Integration tests: * `snapshot_refresh::duckdb` — bootstrap → writer change → reader picks up new snapshot → swap → query reflects change. * `snapshot_refresh::sqlite` — same shape; passes thanks to the SQLite WAL flush in the parent commit. * `snapshot_refresh::turso` — same shape; same WAL flush dependency. CI: new `Snapshot Refresh Mode Integration Tests` job in integration.yml runs DuckDB/SQLite/Turso variants against MinIO. Cayenne is intentionally not covered in this commit: portable Cayenne snapshots require a per-dataset metastore export/import refactor (addressed in the next commit). The snapshot_refresh test directory omits Cayenne until then. Depends on the SQLite/Turso WAL flush from the previous commit; without it the SQLite/Turso integration tests cannot pass. Pulls in the upstream pool-eviction APIs from datafusion-contrib/datafusion-table-providers#635 (`invalidate_instance`, `invalidate_file_instance`). * feat(cayenne): add per-dataset metastore export/import (snapshot foundation) Adds the foundation for portable, per-dataset Cayenne metastore snapshots: a versioned-JSON 'slice' format that captures only one dataset's rows from cayenne_table and its dependent tables, with path columns rewritten to be relative to a configurable anchor. Why: The legacy Cayenne snapshot format archived the entire cayenne.db SQLite file. That approach forced a one-dataset-per-metadata-directory limit (see validate_cayenne_snapshot_consistency) because two snapshots would clobber each other's metastore rows on extract, and produced snapshots that were not portable across nodes whose data directories did not match the writer's absolute paths (spiceai#10642). It also raced with the reader process's eager metastore initialization (spiceai#10649): SQLite journal sidecars created during init triggered checksum mismatches against the archive on extract. This commit lays the groundwork to fix all three issues by introducing: * `crates/cayenne/src/metastore/snapshot.rs` — the slice format and export/import logic: - `DatasetMetastoreSlice` (versioned JSON, format_version: 1) with one entry per metastore table from EXPECTED_TABLES. - `SliceValue` mirrors MetastoreValue but JSON-friendly (blobs base64-encoded under the 'x' tag). - `export_dataset(metastore, dataset_name, anchor)` selects every row that belongs to the dataset (the cayenne_table row plus all rows in dependent tables matching the same table_id) and emits a slice. Path columns (cayenne_table.path, cayenne_delete_file.path, cayenne_partition.path) are rewritten relative to `anchor` so the slice contains no absolute filesystem paths. - `import_dataset(metastore, slice, anchor)` runs inside a single transaction: deletes any existing rows for the same dataset_name (FK ON DELETE CASCADE removes all dependent rows), then inserts the slice's rows with paths rewritten back to absolute form anchored at `anchor`. Unsupported format_version or engine mismatch are rejected up front. * `MetastoreRow::get_value(index) -> MetastoreValue` — a new trait method that returns the raw column value without type coercion, so generic export logic can serialize columns without knowing each column's expected type at compile time. SQLite and Turso impls add a one-line implementation that clones from their existing values Vec. Tests in `crates/cayenne/src/metastore/snapshot.rs::tests`: * `round_trip_preserves_rows_and_relocates_paths` — exports a dataset from one metastore, imports into a fresh metastore at a different anchor, verifies all partitions resolve to the new anchor. * `import_replaces_prior_dataset_rows_wholesale` — confirms the DuckDB-style snapshot-refresh semantic that import wipes any prior rows for the same dataset_name before inserting (no leftover partitions). * `import_leaves_other_datasets_untouched` — confirms slicing is correctly scoped: importing dataset A does not affect dataset B's rows in the same shared metastore. * `rejects_unsupported_format_version` — both wrong format_version and wrong engine identifier are rejected with clear error messages *before* any DML runs. * `json_round_trip` — slice serializes and parses back to an equivalent slice. Follow-up (separate commit / PR): wire `CayenneSnapshotEngine` into `SnapshotManager`'s directory create/extract paths so the new format is actually used by snapshot upload/download. With that follow-up: * the cayenne.db file leaves the archive (replaces spiceai#10649), * paths become portable (closes spiceai#10642), * the single-dataset-per-metadata-dir validation can be lifted, * Cayenne refresh_mode: snapshot becomes a passing integration test. * feat(cayenne): wire CayenneSnapshotEngine into snapshot pipeline This is Commit 4 of the refresh_mode: snapshot stack. Commit 3 added the metastore export/import API; this commit wires it into the SnapshotManager upload/download pipeline so Cayenne snapshots actually use the per-dataset slice format instead of shipping raw cayenne.db. Closes: - spiceai#10642 (Cayenne snapshots not portable across data dirs) - spiceai#10649 (Cayenne metastore-init races with checksum) Architecture: * Trait extension in runtime-acceleration::snapshot::engine: Adds two new SnapshotEngine methods, each with a no-op default so DuckDB/SQLite/Turso continue to behave exactly as before: - prepare_directory_snapshot(dirs, dataset_name) -> DirectorySnapshotPlan Returns (a) a list of file paths *relative to each source dir* that should be excluded from the tar, and (b) extra in-memory entries to append at the end of the tar. - finalize_directory_snapshot(dirs, dataset_name, extras) Engine-specific post-extract hook. Plus a new SnapshotEngineError::Custom { message } variant and a SnapshotEngineError::from_display() constructor so engines defined outside runtime-acceleration (CayenneSnapshotEngine in the runtime crate) can surface their own rich errors without needing a feature-gated variant in the trait crate. * directory_archive: new archive_directories_with_plan() that takes skip_relative_paths + extras. The original archive_directories() is preserved as a thin wrapper. The new add_directory_to_archive_filtered helper handles the recursive walk while honoring the skip set. The module remains engine-agnostic — the skip predicate is just a HashSet passed in. * SnapshotManager: - create_directory_snapshot now calls prepare_directory_snapshot first, then archive_directories_with_plan with the engine's skip list and extras. - download_to_directories now calls finalize_directory_snapshot after extract. - New with_snapshot_engine(self, Arc<dyn SnapshotEngine>) builder so accelerators can inject a custom engine. * download_snapshot_if_needed and snapshot_before_recreate gain an Option<Arc<dyn SnapshotEngine>> parameter. DuckDB/SQLite/Turso pass None (default behavior). Cayenne builds and passes a CayenneSnapshotEngine. * Cayenne accelerator (crates/runtime/src/dataaccelerator/cayenne): - New module snapshot_engine.rs implements CayenneSnapshotEngine. Holds an Arc<dyn MetadataCatalog>, dataset_name, and data_dir_anchor. prepare_directory_snapshot exports the per-dataset slice via MetadataCatalog::export_dataset_slice and instructs the archiver to skip cayenne.db / -wal / -shm. finalize_directory_snapshot reads the slice JSON back from the extracted directory and calls MetadataCatalog::import_dataset_slice. - The cayenne accelerator's bootstrap path constructs a CayenneSnapshotEngine using its existing get_or_create_catalog machinery and threads it through download_snapshot_if_needed. * cayenne::MetadataCatalog: new export_dataset_slice and import_dataset_slice trait methods (default impls return InvalidOperationNoSource); CayenneCatalog overrides both with dispatch to the underlying SQLite or libsql metastore. * Lifted validate_cayenne_snapshot_consistency's SharedAcceleration restriction. With per-dataset slices, multiple Cayenne datasets sharing a metastore directory can each ship their own snapshot without clobbering each other's metastore rows on extract. The InconsistentSnapshotSettings (mixed enabled/disabled) check stays. * Re-enabled the Cayenne snapshot_refresh integration test (snapshot_refresh::cayenne::snapshot_refresh_cayenne_bootstrap_then_refresh). Tests: - 3 new unit tests in cayenne::snapshot_engine::tests: * create_directory_snapshot_skips_cayenne_db_and_emits_slice * refuses_mismatched_dataset * finalize_missing_slice_returns_clear_error - validate_snapshots tests updated; the test_cayenne_shared_acceleration_with_snapshots_errors test is renamed to ..._now_supported and asserts Ok. - Existing 91 runtime-acceleration::snapshot lib tests still pass. - Cayenne integration test runs against real S3 in CI (same harness as the DuckDB/SQLite/Turso refresh-mode tests). * fix(snapshot): bootstrap dataset checkpoint from snapshot metadata When a downloaded snapshot doesn't carry a populated _dataset_checkpoint row (the steady state for Cayenne, where the on-disk archive ships the per-dataset metastore slice JSON instead of the raw cayenne.db), the post-extract Checkpointer::get_schema() returns None and bootstrap was failing with MissingSchema. The metadata.json carries the dataset's verified schema; after import we now materialize that schema into the local checkpoint via Checkpointer::checkpoint(metadata_schema, None) so the spice_sys side matches the snapshot. For DuckDB / SQLite / Turso this branch is unreachable in steady state (their archives ship _dataset_checkpoint already); on a corrupted snapshot the self-heal is harmless because the metadata schema is the same one we'd otherwise have validated against. This unblocks refresh_mode: snapshot for Cayenne and re-enables the snapshot_refresh::cayenne integration test by default. Closes spiceai#10658.
Merged
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DuckDBTableProviderFactoryandSqliteTableProviderFactorycache pool instances byDbInstanceKey(file path or memory). There is currently no way for a caller to evict an entry from that registry, so a caller that legitimately needs to swap the underlying database file out-of-band — for example, after restoring it from a snapshot in object storage — has no way to force the next provider build to open a fresh pool. Subsequent providers reuse the prior pool's already-open connections, which keep observing the previously-opened file inode rather than the replacement on disk.This PR adds a small, targeted invalidation API on both factories:
DuckDBTableProviderFactory::invalidate_instance(&DbInstanceKey) -> Option<DuckDbConnectionPool>DuckDBTableProviderFactory::invalidate_file_instance(impl Into<Arc<str>>) -> Option<DuckDbConnectionPool>SqliteTableProviderFactory::invalidate_instance(&DbInstanceKey) -> Option<SqliteConnectionPool>SqliteTableProviderFactory::invalidate_file_instance(impl Into<Arc<str>>) -> Option<SqliteConnectionPool>Semantics:
None).Motivation
Used by spiceai/spiceai#10565 (
refresh_mode: snapshot) to allow accelerator providers to be cleanly reloaded after a snapshot is restored to the local file. Manual end-to-end testing without this API confirmed that the swap path produces a fresh provider but queries continue to read stale data from the cached connection pool.Tests
duckdb::tests::invalidate_instance_drops_cached_pool— insertion, idempotent reuse, eviction, no-op missing-key, repopulation.sqlite::tests::invalidate_instance_drops_cached_pool— same coverage including the file-path convenience wrapper.Both pass:
cargo test --lib --features duckdb,sqlite,duckdb-federation invalidate_instance. Clippy clean.Note on backporting
Branched from the
spiceai-52rev currently consumed by Spice (d7fae5f0). Targetingmainhere as the canonical upstream landing; will request a backport tospiceai-52once accepted.