feat(processing_services): v2 worker mode for minimal stub by mihow · Pull Request #1252 · RolnickLab/antenna

mihow · 2026-04-17T20:54:14Z

Summary

Extends processing_services/minimal/ to support Antenna's v2 pull/async/worker paradigm alongside the existing v1 push mode, so agents and CI can exercise the NATS/Redis/Celery lifecycle without spinning up the external ami-data-companion (ADC) worker.

One container, three modes via MODE env var:

MODE=api — v1 FastAPI only (unchanged; CI default)
MODE=worker — v2 HTTP poll loop
MODE=api+worker — both (local dev default)

Intention & context

Antenna supports two processing-service paradigms:

v1 (push) — synchronous HTTP: Antenna POSTs PipelineRequest to /process. Good for interactive single-image inference, /api/v2/docs/ schema exposure, and admin smoke tests.
v2 (pull/async/worker) — workers poll POST /api/v2/jobs/{id}/tasks/, process, and POST /api/v2/jobs/{id}/result/. Antenna proxies NATS internally so workers can live behind firewalls.

Today the only v2 implementation is the external ami-data-companion (heavyweight: conda + torch + CUDA warmup). This PR adds a deterministic stub analogous to what minimal/ already is for v1.

Design doc: docs/claude/planning/2026-04-17-minimal-worker-design.md — includes explicit comparison to the ADC worker (what this mirrors vs. diverges) and to PR #1011 (the earlier Celery-direct attempt, and why NATS/API was chosen over it).

What changed

Processing service

processing_services/minimal/start.sh — MODE-dispatching orchestrator with signal forwarding for TERM/INT. In api+worker mode: FastAPI first (so /process stays available while register.py retries), then register.py, then the worker loop.
processing_services/minimal/register.py — self-registration. Auth priority: ANTENNA_API_KEY → ANTENNA_API_AUTH_TOKEN → login with ANTENNA_USER/ANTENNA_PASSWORD. Pipelines are imported directly from the api module (no HTTP round-trip to self). Sends both processing_service_name (main's serializer expects it as the label for ProcessingService.get_or_create(name=...)) and client_info (read by PR feat: API key auth and identity for processing services #1194).
processing_services/minimal/worker_main.py — entry for MODE=worker. Reads the cached auth header that register.py drops at /tmp/antenna_auth_header, falls back to env-based login if absent.
processing_services/minimal/worker/ — three small modules: client.py (HTTP wrapper with retry), loop.py (per-slug poll), runner.py (task → result via the existing api/pipelines.py stubs). No duplicated pipeline or schema code.
processing_services/minimal/worker/schemas.py — v2-only worker classes (PipelineProcessingTask, PipelineTaskResult, PipelineResultsError, ProcessingServiceClientInfo, AsyncPipelineRegistrationRequest), importing the shared wire-format types (PipelineResultsResponse, PipelineConfigResponse, ...) from api/schemas.py. Keeping the pull-mode types out of api/schemas.py means the v1 FastAPI side stays free of v2 types.
processing_services/minimal/.env.dev — dev defaults for MODE, Antenna target, auth fallback, and worker tuning. Loaded via env_file: in processing_services/docker-compose.yml. No more duplicated defaults in Python code or in environment: blocks.
processing_services/minimal/Dockerfile — ENV MODE=api (CI default), CMD ["/app/start.sh"].

Out-of-the-box sequencing

ami/main/management/commands/ensure_default_project.py — idempotent bootstrap. Creates the default superuser (createsuperuser from DJANGO_SUPERUSER_* env vars) and a named project (default "Default Project"). Looks up by name, not PK, so long-lived dev DBs that already have PK 1 claimed are safe.
compose/local/django/start — runs the command when ENSURE_DEFAULT_PROJECT=1.
.envs/.local/.django, .envs/.ci/.django — set ENSURE_DEFAULT_PROJECT=1 and (for CI) DJANGO_SUPERUSER_*.

Review-round changes (2026-04-17)

Applied @mihow's review feedback:

Schema typing. client.py, runner.py, and register.py take/return real Pydantic types rather than dicts. The v2-only worker classes live in worker/schemas.py (importing the shared wire-format types from api/schemas.py), so the v1 FastAPI side stays free of pull-mode types.
Env file. All dev defaults centralized in processing_services/minimal/.env.dev. os.environ.get(..., "<default>") fallbacks in register.py and worker_main.py replaced with os.environ[...]; inline environment: duplication in docker-compose.yml removed.
Per-slug polling. loop.py iterates for slug in my_slugs at the outer level so runner.process_task(task, slug) gets the slug as a loop variable. The prior _slug_for_job reverse-lookup is gone.
register.py docstring. Rewritten to describe current-main identity (PS identified by the Authorization header's user; processing_service_name just labels the DB row via get_or_create(name=...)) and the feat: API key auth and identity for processing services #1194 transition (API-key identifies the PS directly).

The flow on a cold boot

Postgres/RabbitMQ/Redis/NATS come up.
Django runs migrations, then ensure_default_project creates/finds a superuser + "Default Project".
Minimal container starts. start.sh runs FastAPI in the background.
register.py imports its pipelines from api.api, logs in to Antenna (or uses ANTENNA_API_AUTH_TOKEN), resolves the project by name, and POSTs pipelines to /api/v2/projects/{id}/pipelines/.
worker_main.py starts polling GET /api/v2/jobs/?pipeline=<slug>&status=STARTED&ids_only=true once per slug. When a STARTED async_api job matches, it reserves a batch via POST /jobs/{id}/tasks/, runs the stub pipeline, and POSTs PipelineTaskResult items back.

Errors during stub execution are captured and sent as PipelineResultsError so the NATS ACK path still fires — important for exercising retry / stale-job-cutoff paths.

What's verified vs. what still needs verification

docker compose -f processing_services/docker-compose.yml build ml_backend_minimal — builds clean
MODE=api container — backward compatible with existing CI (same /process, /info, /livez, /readyz)
MODE=api+worker container — boots, runs register.py, retries login when Antenna unreachable
python -m py_compile — all new Python compiles
Pre-commit (black, isort, flake8) passes
End-to-end on a live stack — submit an async_api job for constant or random-detection-random-species, confirm worker picks it up, processes it, and process_nats_pipeline_result fires. Not yet run.
ensure_default_project idempotence across repeated Django boots
CI still green — docker-compose.ci.yml unchanged but the image now defaults to MODE=api via ENV (same behavior, different surface)
Confirm PR interop with #1194 when it lands (one-line removal of processing_service_name)

Test plan (for reviewers)

# Bring up Antenna core
docker compose up -d

# Wait for django to apply migrations + run ensure_default_project
docker compose logs django | grep -E 'ensure_default_project|Default Project'

# Bring up the minimal container in api+worker mode
docker compose -f processing_services/docker-compose.yml up -d ml_backend_minimal
docker compose -f processing_services/docker-compose.yml logs -f ml_backend_minimal

# Expected: FastAPI up on :2000, register.py says "Registered N pipelines",
# worker loop says "Polling for jobs on pipelines: ['constant', 'random-detection-random-species']"

# Submit a sync job (v1 path still works)
curl -X POST http://localhost:8000/api/v2/jobs/ ...

# Submit an async_api job (v2 path, exercises the new worker)
curl -X POST http://localhost:8000/api/v2/jobs/?start_now=true ...

Out of scope (called out in design doc)

Real ML in the stub — stays random/constant.
Renaming example/ to global_moths / complete — deferred follow-up.
API-key-only auth — depends on feat: API key auth and identity for processing services #1194.
Competing-consumer / multi-worker tests.
Direct-NATS workers.

#987 — job queue HTTP API (merged)
#1011 — earlier Celery-direct PSv2 attempt (never merged); design doc explains why NATS/API wins
#1194 — API key auth & identity (open; this PR is forward-compatible)
#952, #992 — example/ service evolution
ami-data-companion#94 — ADC worker (merged)
ami-data-companion#136, #141 — ADC evolution

What has been tested

Unit (ami/main/tests.py::TestEnsureDefaultProjectSeed, 3 passed): the seed creates a non-empty Default Collection, is idempotent (re-running does not re-seed — keyed on the collection), and --skip-seed creates no images. Worker modules compile; lint green.

Local end-to-end (deployed on this branch, MODE=api+worker, no ami-data-companion): the worker registered both stub pipelines to the seeded Default Project and polled. One async_api job (constant, 6 images) ran the full pull lifecycle — STARTED → NATS task claim (POST /jobs/{id}/tasks/) → stub inference → POST /jobs/{id}/result/ ×6 → process_nats_pipeline_result ×6 → NATS ACK + consumer/stream teardown → SUCCESS in ~4s, producing 6 detections and 6 occurrences with progress.errors == []. No code changes were needed.

Auth used in this run: the worker authenticates via the Authorization header (DRF). register.py resolves it in priority order — ANTENNA_API_KEY (Authorization: Api-Key …, the #1194 path) → ANTENNA_API_AUTH_TOKEN (Authorization: Token …) → fallback login with ANTENNA_USER/ANTENNA_PASSWORD at /api/v2/auth/token/login/ to obtain a DRF token. With no key/token set, this run used the user/password → Token fallback; the resolved header is cached and reused by the poll loop.

netlify · 2026-04-17T20:54:19Z

✅ Deploy Preview for antenna-preview ready!

Name	Link
🔨 Latest commit	`1e0da75`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/6a3a1229c3609800088832d8
😎 Deploy Preview	https://deploy-preview-1252--antenna-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.
Lighthouse	1 paths audited Performance: 57 (🔴 down 8 from production) Accessibility: 81 (🔴 down 8 from production) Best Practices: 92 (🔴 down 8 from production) SEO: 92 (no change from production) PWA: 80 (no change from production) View the detailed breakdown and full score reports
🤖 Make changes	Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

netlify · 2026-04-17T20:54:19Z

✅ Deploy Preview for antenna-ssec ready!

Name	Link
🔨 Latest commit	`1e0da75`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/6a3a1229d0c31c000810ba44
😎 Deploy Preview	https://deploy-preview-1252--antenna-ssec.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.
🤖 Make changes	Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

coderabbitai · 2026-04-17T20:54:22Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d73e36fc-1503-448a-a09e-c0173b99462f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch worktree-minimal-worker-default-processing-service

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

mihow

See my inline comments. Make it clear in the PR description and/or design doc what mirrors or diverges from the implementation in the AMI Data Companion worker. Also review this previous implementation for adding a worker to the docker compose stack, but for connecting to Celery directly. We will stick with the NATS / API version, but I'm curious for you to read the details of this implementation and write up a comparison. #1011

…, simpler loop Apply feedback from PR #1252 review: - Consolidate schemas into `processing_services/minimal/api/schemas.py`. v1 push and v2 worker now share a single source of truth; the separate `worker/schemas.py` mirror is removed. Client/runner/register/loop use real Pydantic types in signatures (PipelineProcessingTask, PipelineTaskResult, ProcessingServiceClientInfo, AsyncPipelineRegistrationRequest). - Move defaults into `processing_services/minimal/.env.dev`. Strip the hard-coded `os.environ.get(..., "<default>")` fallbacks in register.py and worker_main.py; replace with `os.environ[...]`. The inline environment block in docker-compose.yml becomes `env_file:`. - Replace the loop's job→slug reverse-lookup with per-slug iteration: the outer loop variable IS the slug, so `_slug_for_job` is gone. - Fix register.py docstring: the PS is identified by the Authorization header's user (main) or the API key (#1194), not by `processing_service_name`. That field just labels the DB row `get_or_create(name=...)` lands on. - Import pipelines directly in register.py instead of HTTP-GETting /info from the co-located FastAPI. Removes the FastAPI readiness wait entirely, and means register.py works in MODE=worker (where FastAPI isn't running at all). Co-Authored-By: Claude <noreply@anthropic.com>

mihow · 2026-04-18T00:05:16Z

Claude says: Follow-up in 58b247a9 per your note that the schemas concern was about using Pydantic at all, not about where the v2 classes live:

Re-split schemas. v2-only classes moved back out of api/schemas.py into worker/schemas.py (PipelineProcessingTask, PipelineTaskResult, PipelineResultsError, ProcessingServiceClientInfo, AsyncPipelineRegistrationRequest, TasksResponse). The v1 push side no longer imports any pull-mode types. Shared wire-format classes (PipelineResultsResponse, PipelineConfigResponse, SourceImageResponse, ...) stay in api/schemas.py and are imported by worker/schemas.py. Typed signatures throughout client.py / runner.py / loop.py / register.py / worker_main.py are preserved.
Dropped my_slugs. It was a redundant local for list(pipeline_choices.keys()). Loop now iterates pipeline_choices directly — for slug in pipeline_choices — so there's no intermediate and no awkward possessive name.

Live-stack e2e is up next.

Adds a design document for extending processing_services/minimal/ with a v2 worker mode so agents and CI can exercise the pull path without spinning up the heavyweight ami-data-companion worker. Co-Authored-By: Claude <noreply@anthropic.com>

Extends processing_services/minimal/ to support Antenna's v2 pull/async/worker processing-service paradigm alongside the existing v1 push mode. One container image, three modes via the MODE env var: MODE=api v1 FastAPI only (unchanged, CI default) MODE=worker v2 HTTP poll loop (reserve tasks -> run stub pipeline -> post results) MODE=api+worker both (local dev default) The worker talks HTTP-only to Antenna's job queue API (POST /jobs/{id}/tasks/, POST /jobs/{id}/result/), matching the ADC contract. Pipelines are the same stubs used by /process (ConstantPipeline, RandomDetectionRandomSpeciesPipeline), so v1 and v2 produce identical detections. Out-of-the-box automation: - ami/main/management/commands/ensure_default_project.py: idempotent bootstrap that creates the default superuser + "Default Project" on Django startup. Guarded by ENSURE_DEFAULT_PROJECT=1, set in .envs/.local/.django and .envs/.ci/.django. - compose/local/django/start runs it before starting gunicorn. - processing_services/minimal/register.py self-provisions a ProcessingService and registers the stub pipelines with the default project. Supports user token auth (current main) and has TODOs for API-key auth (PR #1194). - processing_services/docker-compose.yml sets MODE=api+worker by default. This unblocks agent-driven e2e work on the async_api path without needing to stand up the external ami-data-companion worker. Design doc: docs/claude/planning/2026-04-17-minimal-worker-design.md Co-Authored-By: Claude <noreply@anthropic.com>

…, simpler loop Apply feedback from PR #1252 review: - Consolidate schemas into `processing_services/minimal/api/schemas.py`. v1 push and v2 worker now share a single source of truth; the separate `worker/schemas.py` mirror is removed. Client/runner/register/loop use real Pydantic types in signatures (PipelineProcessingTask, PipelineTaskResult, ProcessingServiceClientInfo, AsyncPipelineRegistrationRequest). - Move defaults into `processing_services/minimal/.env.dev`. Strip the hard-coded `os.environ.get(..., "<default>")` fallbacks in register.py and worker_main.py; replace with `os.environ[...]`. The inline environment block in docker-compose.yml becomes `env_file:`. - Replace the loop's job→slug reverse-lookup with per-slug iteration: the outer loop variable IS the slug, so `_slug_for_job` is gone. - Fix register.py docstring: the PS is identified by the Authorization header's user (main) or the API key (#1194), not by `processing_service_name`. That field just labels the DB row `get_or_create(name=...)` lands on. - Import pipelines directly in register.py instead of HTTP-GETting /info from the co-located FastAPI. Removes the FastAPI readiness wait entirely, and means register.py works in MODE=worker (where FastAPI isn't running at all). Co-Authored-By: Claude <noreply@anthropic.com>

- Add "Comparison with prior/external implementations" section covering what the stub mirrors vs. diverges from the ADC worker, and why the NATS/API path was picked over the Celery-direct path in PR #1011. - Update the env-var table to reflect the `.env.dev` file as the single source of defaults (no code fallbacks). - Update directory layout and poll-loop pseudocode to match the current per-slug iteration (no job→slug reverse-lookup). Co-Authored-By: Claude <noreply@anthropic.com>

…name my_slugs Follow-up to the previous review round: v2-only classes go in worker/schemas.py so the v1 push path doesn't have to know about pull-mode types. The v1 schemas and v1/v2 shared types (PipelineResultsResponse, PipelineConfigResponse, SourceImageResponse, ...) stay in api/schemas.py. Also rename the awkward `my_slugs` in loop.py. The list comes directly from `pipeline_choices` (the dict of stub pipelines exported by api.api), so iterate it directly instead of copying keys into a local variable. Co-Authored-By: Claude <noreply@anthropic.com>

The /jobs/ endpoint's `pipeline` filter is a ModelChoiceFilter expecting a DB id, so slug values get rejected with 400 "Select a valid choice". The slug-based alias exposed by JobFilterSet is `pipeline__slug`. Caught by running the minimal worker end-to-end against a live Antenna stack — the poll loop was logging 400 warnings every WORKER_POLL_INTERVAL. After the fix, two async_api test jobs (pipelines `constant` and `random-detection-random-species`) flow through worker → reserve tasks → process → submit results → NATS ACK → Job.SUCCESS. Co-Authored-By: Claude <noreply@anthropic.com>

…ng [skip ci] The docstring referenced processing_services/.env.example, which does not exist; the real file is processing_services/minimal/.env.dev. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…efault_project Make the local/CI bootstrap genuinely zero-config for the minimal v2 worker. The worker opens each image and reads its pixel dimensions, so it needs reachable images, not path-only rows. ensure_default_project now seeds a small object-store-backed "Default Collection" into the default project by reusing the test fixtures (generate images -> object store -> sync captures). Idempotent on the collection (skips when a non-empty "Default Collection" already exists — keyed on the collection, not on any source images, since a project may have images from other sources yet still lack the collection the worker needs). Best-effort (a failure such as no object store is logged and swallowed so it never breaks startup), and opt-out via --skip-seed. Adds tests for the seed, idempotency, and the skip flag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mihow · 2026-06-23T04:57:11Z

Claude says: Rebased onto current main (was 56 behind; clean except one ami/main/tests.py conflict from #1312 having merged — resolved by keeping both), added a zero-config seed, and ran a full local async e2e through the minimal worker.

Zero-config seed. ensure_default_project now seeds a small object-store-backed Default Collection into the default project (the worker opens each image and reads pixel dimensions, so it needs reachable images, not path-only rows). Idempotent — keyed on the collection, not on "any source image", since a project can have images from other sources yet still lack the collection the worker needs. Best-effort (swallows + logs a failure such as no object store so it never breaks startup), opt-out via --skip-seed. Tests added.

E2E result (local, MODE=api+worker, no ADC). Worker registered both stub pipelines to the seeded project and polled. One async_api job (constant, 6 images):

STARTED → worker reserved tasks via NATS → ran the stub → POST /jobs/{id}/result/ ×6 → process_nats_pipeline_result ×6 → NATS ACK → consumer/stream auto-teardown → SUCCESS in ~4s.
6 detections + 6 occurrences created; progress.errors == []; all three stages at 100%.
No code changes were needed — the worker ran as-is. Networking worked out of the box via the external antenna_network.

So the full v2 pull lifecycle is exercised end-to-end without the heavyweight ADC worker. With the seed, a fresh stack boots straight into a processable state. This is ready to come out of draft for review whenever you like; the [ ] e2e item in the description can be ticked.

mihow · 2026-06-23T21:53:54Z

Claude says: Review for merging this as the foundational step — a correct, copy-able reference for the v2 (pull/async) API contract via a worker in the app stack. (Deeper lifecycle hardening — failure injection, concurrent-load + chaos — is intentionally out of scope here and tracked separately.)

Should-fix before merge (cheap; they protect the contract this PR exists to demonstrate):

Heartbeat filter is out of step with main. The poll lists jobs with pipeline__slug=<slug> (singular exact). Main's "service-live" heartbeat now fires off pipeline__slug__in (plural) plus a Celery task, so the worker never updates ProcessingService.last_seen between registrations — it reads as offline in the admin even while it's processing jobs. Since this file is meant to be the reference for talking to the v2 API, switching the poll to pipeline__slug__in keeps it in step (the filter already exists on JobFilterSet). Doesn't block processing; it's contract drift in a file people will copy.
Schema parity has no guard. worker/schemas.py mirrors ami/ml/schemas.py field-for-field, protected only by a "keep these in sync" docstring. For a contract reference, silent drift makes it quietly wrong. A lightweight parity test (import both, assert field names/types match, fail CI on drift) would protect exactly the thing this PR demonstrates. If you'd rather defer the test, point the docstring at the canonical module + line range so it can be diffed by hand.

Notes / small doc adds (prevent misuse of the reference):

Auth caveat. Worth stating in .env.dev / register.py that the user-password→Token and static-token paths are for local/CI; a real processing service should use the API key (feat: API key auth and identity for processing services #1194) so there's no stored password — otherwise someone copies password auth into a deployment.
MODE=api+worker liveness coupling. start.sh's wait -n takes the whole container down if either child dies. A one-line note that this is intentional (harness simplicity) keeps a reader from reading it as a bug.
No re-auth on 401 (minor). The poll loop catches errors and continues (good) and there is a urllib3 retry, but a rotated/revoked token would 401-loop until restart. DRF authtoken does not expire by default so it is low-risk — worth a one-line note that token refresh is a feat: API key auth and identity for processing services #1194-era concern.

Already addressed: rebased onto current main (the ami/main/tests.py conflict from #1312 having merged is resolved); the stale "worker/schemas.py deleted" body claim + register.py's .env.example path corrected; zero-config seed added so a fresh stack boots into a processable state; happy-path async e2e ran clean end-to-end.

Net: mergeable as the foundation. The only two I would gate on are #1 and #2 — both cheap, both protect the contract-reference value.