Skip to content

Deploy workflow workers as an ephemeral ECS Fargate fleet — Closes #33#34

Draft
conradbzura wants to merge 7 commits into
30-add-preprocessing-indexing-workflowfrom
33-ecs-fargate-discovery-deployment
Draft

Deploy workflow workers as an ephemeral ECS Fargate fleet — Closes #33#34
conradbzura wants to merge 7 commits into
30-add-preprocessing-indexing-workflowfrom
33-ecs-fargate-discovery-deployment

Conversation

@conradbzura
Copy link
Copy Markdown
Collaborator

Summary

Add the runtime application substrate the workflow subsystem needs to run on the ECS Fargate fleet outlined in #32: an EcsProvisioner that launches ephemeral worker containers per workflow via RunTask, an EcsDiscovery poll-and-diff loop over ListTasks + DescribeTasks that surfaces healthy tasks to Wool's worker pool, an S3Cache backend that mirrors LocalFsCache semantics over boto3, a worker-container entrypoint, and a workflow heartbeat that lets multi-hour preprocessing runs survive the stale-reclaim predicate. The same boto3-backed code targets real AWS in production and LocalStack in dev — only AWS_ENDPOINT_URL differs.

Two WoolExecutor defaults move with the multi-hour profile: DEFAULT_WORKFLOW_DURATION_CAP_SECONDS rises from 20 min to 4 h (env-driven), and _NO_WORKERS_RETRY_ATTEMPTS widens from 5 to 60 to cover the Fargate cold-start budget. STALE_WORKFLOW_THRESHOLD drops from 1 h to 15 min — paired with a 5-min heartbeat sidecar in the executor, that catches actually-dead workers fast without false-positive reclaiming legitimate long sorts.

Scope intentionally narrows to the runtime application code so it lands self-contained with full unit coverage. Defer for follow-up PRs once the runtime classes have integration coverage and a deploy pipeline exists: LocalStack docker-compose.yml, CloudFormation worker task-def updates, multi-stage worker Dockerfile + SOCI index, README documentation, and LocalStack-backed integration tests.

Closes #33

Proposed changes

S3 cache backend

Add S3Cache (src/cfdb/workflows/cache.py) as a CacheBackend over boto3 with head_object / get_object (with Range) / upload_file / delete_object. Support an optional key prefix for sharing a single bucket across environments, reject path-traversal segments, and yield an empty iterator on missing objects to match LocalFsCache semantics. Centralize client construction in build_s3_client so production, LocalStack-backed dev, and unit tests all produce the same shape of client — only the endpoint differs.

ECS provisioner and worker discovery

Add EcsProvisioner (src/cfdb/workflows/provisioner.py) wrapping RunTask with awsvpc network configuration. Dedupe concurrent calls keyed on the workflow mutex key so a burst of ensure_workflow calls on the same source does not fan out into multiple tasks, and guard the ~20 req/s RunTask rate limit with a semaphore. CapacityException covers both ClientError- and failures[].reason-shaped capacity / ENI errors so the executor can surface them as retryable rather than hanging.

Add EcsDiscovery (src/cfdb/workflows/discovery.py) implementing Wool's DiscoveryLike protocol with a poll-and-diff loop over list_tasks + describe_tasks. Filter on healthStatus: HEALTHY, extract the awsvpc IP, and emit worker-added / worker-dropped events to non-blocking subscribers via per-subscriber asyncio.Queue. Replay state on subscribe so subscribers attached after startup observe the existing healthy fleet.

Worker container entrypoint

Add worker_main.py (src/cfdb/workflows/worker_main.py) starting wool.LocalWorker, exposing a tiny aiohttp /health endpoint that returns 503 during drain so ECS marks the task unhealthy before stop_task kills gRPC, installing SIGTERM/SIGINT handlers, and self-terminating after a configurable idle timeout. Deliberately omit discovery registration — EcsDiscovery polls ECS directly, so the worker only needs to bind its gRPC port and stay HEALTHY.

Workflow heartbeat

Add heartbeat_workflow (src/cfdb/workflows/lock.py) bumping updated_at on an active job record so a multi-hour run does not trip claim_workflow's stale-reclaim predicate. Fence the update on an active status so a heartbeat against a terminal or stale-reclaimed record is a silent no-op — the executor uses the False return as a stop signal. Shorten STALE_WORKFLOW_THRESHOLD from 1 h to 15 min and expose DEFAULT_HEARTBEAT_INTERVAL_SECONDS as the cadence shared between executor and tests.

Executor integration

Extend WoolExecutor with provisioner and heartbeat_interval_seconds ctor args. Invoke the provisioner from ensure_workflow on a fresh claim and surface CapacityException as a terminal FAILED job with a retryable error string. Spawn a heartbeat sidecar task that bumps updated_at while the workflow body runs and cancel it in finally before the terminal release so the next tick cannot observe an active record while status is flipping. Raise DEFAULT_WORKFLOW_DURATION_CAP_SECONDS from 1200 to 4 h and widen _NO_WORKERS_RETRY_ATTEMPTS from 5 to 60 (~60 s window) for Fargate cold-start. Drop the LambdaExecutor mention from the JobExecutor ABC docstring.

API lifespan wiring

Add env vars to src/cfdb/api/__init__.py: AWS_ENDPOINT_URL, AWS_REGION, WORKFLOW_S3_BUCKET / PREFIX, ECS_CLUSTER, ECS_WORKER_TASK_DEFINITION, ECS_WORKER_SUBNETS, ECS_WORKER_SECURITY_GROUPS, ECS_WORKER_ASSIGN_PUBLIC_IP, WORKFLOW_DURATION_CAP_SECONDS. Have the lifespan pick S3Cache over LocalFsCache when WORKFLOW_S3_BUCKET is set and build an EcsProvisioner when ECS env config is present (helper _maybe_build_provisioner). Leave the bare PoC profile unchanged: with none of the new env set, the API continues to use the local filesystem cache plus the in-process WorkerPool.

Build dependencies

Add boto3 to runtime deps and moto[ecs,s3] to dev deps so the unit tests can exercise boto3 paths without LocalStack or real AWS.

Test cases

# Test Suite Given When Then Coverage Target
1 TestS3Cache An S3Cache constructed with empty bucket The constructor runs ValueError is raised Constructor input validation
2 TestS3Cache A bucket containing no object for the key head is awaited None is returned Cache-miss head
3 TestS3Cache A small payload uploaded via put head is awaited for the same key The reported size matches the payload Round-trip put/head
4 TestS3Cache A cached object get is iterated without a byte range The full payload is yielded Full-object streaming
5 TestS3Cache A cached object get is iterated with an inclusive byte range Only the slice is yielded Range-aware streaming
6 TestS3Cache A bucket with no object for the key get is iterated An empty stream is produced Cache-miss get
7 TestS3Cache A cached object delete is awaited True is returned and the object is gone Existing-key deletion
8 TestS3Cache A bucket with no object for the key delete is awaited False is returned Absent-key deletion
9 TestS3Cache A key containing a .. segment put is awaited ValueError is raised Path-traversal rejection
10 TestS3Cache An S3Cache constructed with a key prefix Round-trip put/head is awaited Object lands under the prefix Prefix application
11 TestEcsProvisioner A provisioner constructed without a cluster The constructor runs ValueError is raised Constructor input validation
12 TestEcsProvisioner A provisioner constructed without a task definition The constructor runs ValueError is raised Constructor input validation
13 TestEcsProvisioner A provisioner constructed without subnets The constructor runs ValueError is raised Constructor input validation
14 TestEcsProvisioner A configured provisioner request is awaited once RunTask is called once with the expected payload Single-caller dispatch
15 TestEcsProvisioner A provisioner with concurrent calls sharing a dedup key Many request calls await in parallel RunTask is invoked once and all callers receive the same ARN In-flight dedup
16 TestEcsProvisioner A provisioner with concurrent calls under distinct dedup keys Many request calls await in parallel RunTask is invoked once per key Dedup boundary
17 TestEcsProvisioner A boto3 client raising ClientError for capacity request is awaited CapacityException is raised Capacity error path
18 TestEcsProvisioner A RunTask response with capacity in failures[] request is awaited CapacityException is raised Failure-payload capacity path
19 TestEcsDiscovery A discovery constructed without a cluster The constructor runs ValueError is raised Constructor input validation
20 TestEcsDiscovery A discovery constructed without a task family The constructor runs ValueError is raised Constructor input validation
21 TestEcsDiscovery An ECS client returning healthy tasks _poll_once is awaited worker-added events are emitted with the expected addresses Initial healthy fleet
22 TestEcsDiscovery An ECS client returning a non-healthy task _poll_once is awaited The unhealthy task is filtered out Health filter
23 TestEcsDiscovery A previously seen worker absent on a later poll _poll_once is awaited again worker-dropped is emitted for the gone worker Drop diff
24 TestEcsDiscovery A steady-state ECS client _poll_once is awaited twice No duplicate events are emitted Idempotent steady state
25 TestEcsDiscovery A subscriber with a filter rejecting a worker The discovery emits an add event The filtered subscriber receives nothing Subscriber filter
26 TestParseArgs A clean environment _parse_args runs without args Documented defaults are returned Default arg parsing
27 TestParseArgs Worker env vars set _parse_args runs without args Env values override defaults Env override
28 TestParseArgs Worker env vars set and CLI flags passed _parse_args runs with the flags CLI values override env CLI precedence
29 TestHeartbeatWorkflow An active job whose updated_at has been backdated heartbeat_workflow is awaited True is returned and updated_at advances Heartbeat happy path
30 TestHeartbeatWorkflow A job already released to COMPLETED heartbeat_workflow is awaited False is returned Heartbeat stop signal
31 TestStaleWorkflowThreshold The current STALE_WORKFLOW_THRESHOLD constant Its value is read It equals 15 minutes Threshold guard
32 TestWoolExecutorWithProvisioner An executor with a recording provisioner ensure_workflow is awaited for a fresh key The provisioner observes one request keyed by the workflow mutex key Provisioner dispatch
33 TestWoolExecutorWithProvisioner A provisioner that always raises CapacityException ensure_workflow is awaited The job is FAILED with a "capacity"-prefixed error Capacity-failure handling
34 TestWoolExecutorHeartbeat An executor with a sub-second heartbeat cadence and a blocking processor ensure_workflow is awaited and the test polls during the run updated_at advances past the original claim timestamp In-flight heartbeat

@conradbzura conradbzura self-assigned this May 1, 2026
@conradbzura conradbzura linked an issue May 1, 2026 that may be closed by this pull request
14 tasks
@conradbzura conradbzura changed the title Deploy workflow workers as an ephemeral ECS Fargate fleet with Mongo-backed discovery — Closes #33 Deploy workflow workers as an ephemeral ECS Fargate fleet — Closes #33 May 1, 2026
@conradbzura conradbzura force-pushed the 33-ecs-fargate-discovery-deployment branch 3 times, most recently from 906c063 to 89f4685 Compare May 2, 2026 23:59
@conradbzura conradbzura force-pushed the 30-add-preprocessing-indexing-workflow branch 2 times, most recently from c8585f1 to b92c47a Compare May 4, 2026 01:55
@conradbzura conradbzura force-pushed the 30-add-preprocessing-indexing-workflow branch from abc625c to 94535ff Compare May 19, 2026 20:59
@conradbzura conradbzura force-pushed the 33-ecs-fargate-discovery-deployment branch 2 times, most recently from a92e44c to b47e8cd Compare May 20, 2026 17:58
The ECS Fargate worker fleet needs boto3 at runtime for the S3 cache
backend, the ECS provisioner, and the worker discovery loop. moto is
added as a dev-only dependency so the unit tests can exercise the
boto3 code paths without standing up LocalStack or hitting real AWS.
S3Cache is a CacheBackend implementation over boto3 with the same
range-aware semantics as LocalFsCache (head_object, get_object with a
Range header, upload_file, delete_object). Production points it at
real S3; LocalStack-backed dev points it at the LocalStack endpoint
via AWS_ENDPOINT_URL — only the endpoint differs.

The backend supports an optional key prefix for sharing a single
bucket across environments, rejects path-traversal segments, and
yields an empty iterator on missing objects so router code can treat
cache misses uniformly across backends.
EcsProvisioner is a thin boto3 RunTask wrapper that launches an
ephemeral worker container per workflow, with awsvpc network
configuration, concurrent-call dedup keyed on the workflow mutex key
(so a burst of ensure_workflow calls on the same source doesn't fan
out into multiple tasks), and a semaphore guarding the ~20 req/s
RunTask rate limit. CapacityException covers both ClientError- and
failures[].reason-shaped capacity / ENI errors so the executor can
retry rather than hang.

EcsDiscovery is a Wool DiscoveryLike poll-and-diff over list_tasks +
describe_tasks: it filters on healthStatus HEALTHY, extracts the
awsvpc IP, and emits worker-added / worker-dropped events to non-
blocking subscribers via per-subscriber asyncio.Queue. State replay
on subscribe means subscribers attached after startup observe the
existing healthy fleet.
worker_main.py is the entrypoint baked into the worker container
image. It starts a wool.LocalWorker so the API can dispatch routines
to the task, exposes a tiny aiohttp /health endpoint that returns 503
during drain — so ECS marks the task unhealthy before stop_task kills
the gRPC port — and installs SIGTERM/SIGINT handlers so a stop_task
issued by the API or the Fargate scheduler shuts down cleanly. The
idle-shutdown timeout is configurable so tasks self-terminate when
their workflow completes.

The entrypoint deliberately does not register itself with discovery;
EcsDiscovery polls ECS directly and surfaces healthy tasks to the
pool, so the worker only needs to be listening and HEALTHY.
Multi-hour preprocessing runs (samtools sort + index on a multi-GB BAM,
tabix on a large interval file) routinely exceed the previous 1200 s
(20 min) default and trip the asyncio.timeout in _run_workflow. The
new 14400 s (4 h) default sizes the cap for the real workload profile
without requiring every operator to set CFDB_WORKFLOW_DURATION_CAP_S
explicitly.

The cap remains env-driven so fixture-bound dev setups can lower it
to keep test runs snappy. The accompanying docstring in executor.py
is updated to match the new rationale.
ensure_workflow now requests a worker from an externally-injected
EcsProvisioner on a fresh claim, dedup-keyed on the workflow mutex so
two concurrent claims for the same source file share one RunTask and
one worker. The request is awaited in _run_workflow between
mark_running and the routine-stream open so a capacity / ENI /
throttling failure (surfaced as RetryableProvisionerError) routes
through the same FAILED terminal path as a stream-open failure, with
a "provisioner:" prefix preserved on the persisted error so the
operator can tell the two apart in /jobs/{id}.

The provisioner ctor arg defaults to None — the PoC dev profile that
relies on manually-started wool workers via LanDiscovery is unchanged.
The EcsProvisioner type is imported under TYPE_CHECKING so executor.py
imports stay boto3-free when the provisioner isn't in use.

TestWoolExecutorWithProvisioner covers the three observable shapes:
request issued on fresh claim with the workflow_key dedup_key, request
suppressed on attach to an already-claimed workflow, and capacity
failures landing as FAILED with the provisioner error preserved.
Add the AWS / ECS env config (AWS_ENDPOINT_URL, AWS_REGION,
WORKFLOW_S3_BUCKET, WORKFLOW_S3_PREFIX, ECS_CLUSTER,
ECS_WORKER_TASK_DEFINITION, ECS_WORKER_TASK_FAMILY,
ECS_WORKER_SUBNETS, ECS_WORKER_SECURITY_GROUPS,
ECS_WORKER_ASSIGN_PUBLIC_IP) to cfdb.api and switch the lifespan to
pick the right runtime backends per env state.

Three helpers gate the selection so each concern stays separable:
_build_cache returns S3Cache when WORKFLOW_S3_BUCKET is set and
LocalFsCache otherwise; _maybe_build_provisioner returns an
EcsProvisioner when ECS_CLUSTER / task-def / subnets are all set and
None otherwise; _build_discovery wraps EcsDiscovery in its async
context (or yields LanDiscovery unchanged) so the lifespan's
WorkerPool block opens against either discovery with the same shape.
ECS_WORKER_TASK_FAMILY defaults to ECS_WORKER_TASK_DEFINITION with
any :revision suffix stripped — ListTasks only accepts the family,
RunTask accepts family[:revision].

The bare PoC profile (no AWS env set) keeps producing LocalFsCache +
LanDiscovery + no provisioner, identical to the path before this
change. The EXDEV cross-filesystem check is now gated on the
LocalFsCache branch since S3 has no rename-atomicity precondition.
Startup log reports the resolved cache / discovery / provisioner
types instead of the LAN namespace so operators can see at a glance
which profile activated.
@conradbzura conradbzura force-pushed the 33-ecs-fargate-discovery-deployment branch from b47e8cd to 950686b Compare May 20, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deploy workflow workers as an ephemeral ECS Fargate fleet

1 participant