Deploy workflow workers as an ephemeral ECS Fargate fleet — Closes #33 by conradbzura · Pull Request #34 · abdenlab/cfdb

conradbzura · 2026-05-01T18:01:21Z

Summary

Add the runtime application substrate the workflow subsystem needs to run on the ECS Fargate fleet outlined in #32: an EcsProvisioner that launches ephemeral worker containers per workflow via RunTask, an EcsDiscovery poll-and-diff loop over ListTasks + DescribeTasks that surfaces healthy tasks to Wool's worker pool, an S3Cache backend that mirrors LocalFsCache semantics over boto3, a worker-container entrypoint, and a workflow heartbeat that lets multi-hour preprocessing runs survive the stale-reclaim predicate. The same boto3-backed code targets real AWS in production and LocalStack in dev — only AWS_ENDPOINT_URL differs.

Two WoolExecutor defaults move with the multi-hour profile: DEFAULT_WORKFLOW_DURATION_CAP_SECONDS rises from 20 min to 4 h (env-driven), and _NO_WORKERS_RETRY_ATTEMPTS widens from 5 to 60 to cover the Fargate cold-start budget. STALE_WORKFLOW_THRESHOLD drops from 1 h to 15 min — paired with a 5-min heartbeat sidecar in the executor, that catches actually-dead workers fast without false-positive reclaiming legitimate long sorts.

Scope intentionally narrows to the runtime application code so it lands self-contained with full unit coverage. Defer for follow-up PRs once the runtime classes have integration coverage and a deploy pipeline exists: LocalStack docker-compose.yml, CloudFormation worker task-def updates, multi-stage worker Dockerfile + SOCI index, README documentation, and LocalStack-backed integration tests.

Closes #33

Proposed changes

S3 cache backend

Add S3Cache (src/cfdb/workflows/cache.py) as a CacheBackend over boto3 with head_object / get_object (with Range) / upload_file / delete_object. Support an optional key prefix for sharing a single bucket across environments, reject path-traversal segments, and yield an empty iterator on missing objects to match LocalFsCache semantics. Centralize client construction in build_s3_client so production, LocalStack-backed dev, and unit tests all produce the same shape of client — only the endpoint differs.

ECS provisioner and worker discovery

Add EcsProvisioner (src/cfdb/workflows/provisioner.py) wrapping RunTask with awsvpc network configuration. Dedupe concurrent calls keyed on the workflow mutex key so a burst of ensure_workflow calls on the same source does not fan out into multiple tasks, and guard the ~20 req/s RunTask rate limit with a semaphore. CapacityException covers both ClientError- and failures[].reason-shaped capacity / ENI errors so the executor can surface them as retryable rather than hanging.

Add EcsDiscovery (src/cfdb/workflows/discovery.py) implementing Wool's DiscoveryLike protocol with a poll-and-diff loop over list_tasks + describe_tasks. Filter on healthStatus: HEALTHY, extract the awsvpc IP, and emit worker-added / worker-dropped events to non-blocking subscribers via per-subscriber asyncio.Queue. Replay state on subscribe so subscribers attached after startup observe the existing healthy fleet.

Worker container entrypoint

Add worker_main.py (src/cfdb/workflows/worker_main.py) starting wool.LocalWorker, exposing a tiny aiohttp /health endpoint that returns 503 during drain so ECS marks the task unhealthy before stop_task kills gRPC, installing SIGTERM/SIGINT handlers, and self-terminating after a configurable idle timeout. Deliberately omit discovery registration — EcsDiscovery polls ECS directly, so the worker only needs to bind its gRPC port and stay HEALTHY.

Workflow heartbeat

Add heartbeat_workflow (src/cfdb/workflows/lock.py) bumping updated_at on an active job record so a multi-hour run does not trip claim_workflow's stale-reclaim predicate. Fence the update on an active status so a heartbeat against a terminal or stale-reclaimed record is a silent no-op — the executor uses the False return as a stop signal. Shorten STALE_WORKFLOW_THRESHOLD from 1 h to 15 min and expose DEFAULT_HEARTBEAT_INTERVAL_SECONDS as the cadence shared between executor and tests.

Executor integration

Extend WoolExecutor with provisioner and heartbeat_interval_seconds ctor args. Invoke the provisioner from ensure_workflow on a fresh claim and surface CapacityException as a terminal FAILED job with a retryable error string. Spawn a heartbeat sidecar task that bumps updated_at while the workflow body runs and cancel it in finally before the terminal release so the next tick cannot observe an active record while status is flipping. Raise DEFAULT_WORKFLOW_DURATION_CAP_SECONDS from 1200 to 4 h and widen _NO_WORKERS_RETRY_ATTEMPTS from 5 to 60 (~60 s window) for Fargate cold-start. Drop the LambdaExecutor mention from the JobExecutor ABC docstring.

API lifespan wiring

Add env vars to src/cfdb/api/__init__.py: AWS_ENDPOINT_URL, AWS_REGION, WORKFLOW_S3_BUCKET / PREFIX, ECS_CLUSTER, ECS_WORKER_TASK_DEFINITION, ECS_WORKER_SUBNETS, ECS_WORKER_SECURITY_GROUPS, ECS_WORKER_ASSIGN_PUBLIC_IP, WORKFLOW_DURATION_CAP_SECONDS. Have the lifespan pick S3Cache over LocalFsCache when WORKFLOW_S3_BUCKET is set and build an EcsProvisioner when ECS env config is present (helper _maybe_build_provisioner). Leave the bare PoC profile unchanged: with none of the new env set, the API continues to use the local filesystem cache plus the in-process WorkerPool.

Build dependencies

Add boto3 to runtime deps and moto[ecs,s3] to dev deps so the unit tests can exercise boto3 paths without LocalStack or real AWS.

Test cases

#	Test Suite	Given	When	Then	Coverage Target
1	`TestS3Cache`	An `S3Cache` constructed with empty bucket	The constructor runs	`ValueError` is raised	Constructor input validation
2	`TestS3Cache`	A bucket containing no object for the key	`head` is awaited	`None` is returned	Cache-miss head
3	`TestS3Cache`	A small payload uploaded via `put`	`head` is awaited for the same key	The reported size matches the payload	Round-trip put/head
4	`TestS3Cache`	A cached object	`get` is iterated without a byte range	The full payload is yielded	Full-object streaming
5	`TestS3Cache`	A cached object	`get` is iterated with an inclusive byte range	Only the slice is yielded	Range-aware streaming
6	`TestS3Cache`	A bucket with no object for the key	`get` is iterated	An empty stream is produced	Cache-miss get
7	`TestS3Cache`	A cached object	`delete` is awaited	`True` is returned and the object is gone	Existing-key deletion
8	`TestS3Cache`	A bucket with no object for the key	`delete` is awaited	`False` is returned	Absent-key deletion
9	`TestS3Cache`	A key containing a `..` segment	`put` is awaited	`ValueError` is raised	Path-traversal rejection
10	`TestS3Cache`	An `S3Cache` constructed with a key prefix	Round-trip put/head is awaited	Object lands under the prefix	Prefix application
11	`TestEcsProvisioner`	A provisioner constructed without a cluster	The constructor runs	`ValueError` is raised	Constructor input validation
12	`TestEcsProvisioner`	A provisioner constructed without a task definition	The constructor runs	`ValueError` is raised	Constructor input validation
13	`TestEcsProvisioner`	A provisioner constructed without subnets	The constructor runs	`ValueError` is raised	Constructor input validation
14	`TestEcsProvisioner`	A configured provisioner	`request` is awaited once	`RunTask` is called once with the expected payload	Single-caller dispatch
15	`TestEcsProvisioner`	A provisioner with concurrent calls sharing a dedup key	Many `request` calls await in parallel	`RunTask` is invoked once and all callers receive the same ARN	In-flight dedup
16	`TestEcsProvisioner`	A provisioner with concurrent calls under distinct dedup keys	Many `request` calls await in parallel	`RunTask` is invoked once per key	Dedup boundary
17	`TestEcsProvisioner`	A boto3 client raising `ClientError` for capacity	`request` is awaited	`CapacityException` is raised	Capacity error path
18	`TestEcsProvisioner`	A `RunTask` response with capacity in `failures[]`	`request` is awaited	`CapacityException` is raised	Failure-payload capacity path
19	`TestEcsDiscovery`	A discovery constructed without a cluster	The constructor runs	`ValueError` is raised	Constructor input validation
20	`TestEcsDiscovery`	A discovery constructed without a task family	The constructor runs	`ValueError` is raised	Constructor input validation
21	`TestEcsDiscovery`	An ECS client returning healthy tasks	`_poll_once` is awaited	`worker-added` events are emitted with the expected addresses	Initial healthy fleet
22	`TestEcsDiscovery`	An ECS client returning a non-healthy task	`_poll_once` is awaited	The unhealthy task is filtered out	Health filter
23	`TestEcsDiscovery`	A previously seen worker absent on a later poll	`_poll_once` is awaited again	`worker-dropped` is emitted for the gone worker	Drop diff
24	`TestEcsDiscovery`	A steady-state ECS client	`_poll_once` is awaited twice	No duplicate events are emitted	Idempotent steady state
25	`TestEcsDiscovery`	A subscriber with a filter rejecting a worker	The discovery emits an `add` event	The filtered subscriber receives nothing	Subscriber filter
26	`TestParseArgs`	A clean environment	`_parse_args` runs without args	Documented defaults are returned	Default arg parsing
27	`TestParseArgs`	Worker env vars set	`_parse_args` runs without args	Env values override defaults	Env override
28	`TestParseArgs`	Worker env vars set and CLI flags passed	`_parse_args` runs with the flags	CLI values override env	CLI precedence
29	`TestHeartbeatWorkflow`	An active job whose `updated_at` has been backdated	`heartbeat_workflow` is awaited	`True` is returned and `updated_at` advances	Heartbeat happy path
30	`TestHeartbeatWorkflow`	A job already released to `COMPLETED`	`heartbeat_workflow` is awaited	`False` is returned	Heartbeat stop signal
31	`TestStaleWorkflowThreshold`	The current `STALE_WORKFLOW_THRESHOLD` constant	Its value is read	It equals 15 minutes	Threshold guard
32	`TestWoolExecutorWithProvisioner`	An executor with a recording provisioner	`ensure_workflow` is awaited for a fresh key	The provisioner observes one request keyed by the workflow mutex key	Provisioner dispatch
33	`TestWoolExecutorWithProvisioner`	A provisioner that always raises `CapacityException`	`ensure_workflow` is awaited	The job is `FAILED` with a "capacity"-prefixed error	Capacity-failure handling
34	`TestWoolExecutorHeartbeat`	An executor with a sub-second heartbeat cadence and a blocking processor	`ensure_workflow` is awaited and the test polls during the run	`updated_at` advances past the original claim timestamp	In-flight heartbeat

The ECS Fargate worker fleet needs boto3 at runtime for the S3 cache backend, the ECS provisioner, and the worker discovery loop. moto is added as a dev-only dependency so the unit tests can exercise the boto3 code paths without standing up LocalStack or hitting real AWS.

S3Cache is a CacheBackend implementation over boto3 with the same range-aware semantics as LocalFsCache (head_object, get_object with a Range header, upload_file, delete_object). Production points it at real S3; LocalStack-backed dev points it at the LocalStack endpoint via AWS_ENDPOINT_URL — only the endpoint differs. The backend supports an optional key prefix for sharing a single bucket across environments, rejects path-traversal segments, and yields an empty iterator on missing objects so router code can treat cache misses uniformly across backends.

EcsProvisioner is a thin boto3 RunTask wrapper that launches an ephemeral worker container per workflow, with awsvpc network configuration, concurrent-call dedup keyed on the workflow mutex key (so a burst of ensure_workflow calls on the same source doesn't fan out into multiple tasks), and a semaphore guarding the ~20 req/s RunTask rate limit. CapacityException covers both ClientError- and failures[].reason-shaped capacity / ENI errors so the executor can retry rather than hang. EcsDiscovery is a Wool DiscoveryLike poll-and-diff over list_tasks + describe_tasks: it filters on healthStatus HEALTHY, extracts the awsvpc IP, and emits worker-added / worker-dropped events to non- blocking subscribers via per-subscriber asyncio.Queue. State replay on subscribe means subscribers attached after startup observe the existing healthy fleet.

worker_main.py is the entrypoint baked into the worker container image. It starts a wool.LocalWorker so the API can dispatch routines to the task, exposes a tiny aiohttp /health endpoint that returns 503 during drain — so ECS marks the task unhealthy before stop_task kills the gRPC port — and installs SIGTERM/SIGINT handlers so a stop_task issued by the API or the Fargate scheduler shuts down cleanly. The idle-shutdown timeout is configurable so tasks self-terminate when their workflow completes. The entrypoint deliberately does not register itself with discovery; EcsDiscovery polls ECS directly and surfaces healthy tasks to the pool, so the worker only needs to be listening and HEALTHY.

Multi-hour preprocessing runs (samtools sort + index on a multi-GB BAM, tabix on a large interval file) routinely exceed the previous 1200 s (20 min) default and trip the asyncio.timeout in _run_workflow. The new 14400 s (4 h) default sizes the cap for the real workload profile without requiring every operator to set CFDB_WORKFLOW_DURATION_CAP_S explicitly. The cap remains env-driven so fixture-bound dev setups can lower it to keep test runs snappy. The accompanying docstring in executor.py is updated to match the new rationale.

ensure_workflow now requests a worker from an externally-injected EcsProvisioner on a fresh claim, dedup-keyed on the workflow mutex so two concurrent claims for the same source file share one RunTask and one worker. The request is awaited in _run_workflow between mark_running and the routine-stream open so a capacity / ENI / throttling failure (surfaced as RetryableProvisionerError) routes through the same FAILED terminal path as a stream-open failure, with a "provisioner:" prefix preserved on the persisted error so the operator can tell the two apart in /jobs/{id}. The provisioner ctor arg defaults to None — the PoC dev profile that relies on manually-started wool workers via LanDiscovery is unchanged. The EcsProvisioner type is imported under TYPE_CHECKING so executor.py imports stay boto3-free when the provisioner isn't in use. TestWoolExecutorWithProvisioner covers the three observable shapes: request issued on fresh claim with the workflow_key dedup_key, request suppressed on attach to an already-claimed workflow, and capacity failures landing as FAILED with the provisioner error preserved.

Add the AWS / ECS env config (AWS_ENDPOINT_URL, AWS_REGION, WORKFLOW_S3_BUCKET, WORKFLOW_S3_PREFIX, ECS_CLUSTER, ECS_WORKER_TASK_DEFINITION, ECS_WORKER_TASK_FAMILY, ECS_WORKER_SUBNETS, ECS_WORKER_SECURITY_GROUPS, ECS_WORKER_ASSIGN_PUBLIC_IP) to cfdb.api and switch the lifespan to pick the right runtime backends per env state. Three helpers gate the selection so each concern stays separable: _build_cache returns S3Cache when WORKFLOW_S3_BUCKET is set and LocalFsCache otherwise; _maybe_build_provisioner returns an EcsProvisioner when ECS_CLUSTER / task-def / subnets are all set and None otherwise; _build_discovery wraps EcsDiscovery in its async context (or yields LanDiscovery unchanged) so the lifespan's WorkerPool block opens against either discovery with the same shape. ECS_WORKER_TASK_FAMILY defaults to ECS_WORKER_TASK_DEFINITION with any :revision suffix stripped — ListTasks only accepts the family, RunTask accepts family[:revision]. The bare PoC profile (no AWS env set) keeps producing LocalFsCache + LanDiscovery + no provisioner, identical to the path before this change. The EXDEV cross-filesystem check is now gated on the LocalFsCache branch since S3 has no rename-atomicity precondition. Startup log reports the resolved cache / discovery / provisioner types instead of the LAN namespace so operators can see at a glance which profile activated.

conradbzura self-assigned this May 1, 2026

conradbzura linked an issue May 1, 2026 that may be closed by this pull request

Deploy workflow workers as an ephemeral ECS Fargate fleet #33

Open

14 tasks

conradbzura changed the title ~~Deploy workflow workers as an ephemeral ECS Fargate fleet with Mongo-backed discovery — Closes #33~~ Deploy workflow workers as an ephemeral ECS Fargate fleet — Closes #33 May 1, 2026

conradbzura force-pushed the 33-ecs-fargate-discovery-deployment branch 3 times, most recently from 906c063 to 89f4685 Compare May 2, 2026 23:59

conradbzura force-pushed the 30-add-preprocessing-indexing-workflow branch 2 times, most recently from c8585f1 to b92c47a Compare May 4, 2026 01:55

conradbzura force-pushed the 30-add-preprocessing-indexing-workflow branch from abc625c to 94535ff Compare May 19, 2026 20:59

conradbzura force-pushed the 33-ecs-fargate-discovery-deployment branch 2 times, most recently from a92e44c to b47e8cd Compare May 20, 2026 17:58

conradbzura added 7 commits May 20, 2026 16:37

conradbzura force-pushed the 33-ecs-fargate-discovery-deployment branch from b47e8cd to 950686b Compare May 20, 2026 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploy workflow workers as an ephemeral ECS Fargate fleet — Closes #33#34

Deploy workflow workers as an ephemeral ECS Fargate fleet — Closes #33#34
conradbzura wants to merge 7 commits into
30-add-preprocessing-indexing-workflowfrom
33-ecs-fargate-discovery-deployment

conradbzura commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

conradbzura commented May 1, 2026

Summary

Proposed changes

S3 cache backend

ECS provisioner and worker discovery

Worker container entrypoint

Workflow heartbeat

Executor integration

API lifespan wiring

Build dependencies

Test cases

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant