Split oversized result uploads so wide-taxonomy batches don't get rejected by mihow · Pull Request #149 · RolnickLab/ami-data-companion

mihow · 2026-06-11T03:58:05Z

Summary

For models with very large taxonomies — for example the ~29,000-class global_moths_2024 classifier — a single batch's result upload can reach 110–142 MB, because every detection carries full per-class scores and logits arrays. In a production deployment those uploads were rejected by the reverse proxy with HTTP 413 (Request Entity Too Large), so the worker could never record its results: the upload failed, the NATS task was never acknowledged, the task eventually exhausted its redelivery budget, and the job failed with nothing stored. Smaller-taxonomy jobs were unaffected, which is why this stayed invisible until a wide-taxonomy job ran on a dense capture set.

This change makes the worker split a batch's results across multiple smaller uploads, each kept under a configurable byte cap, so wide-taxonomy jobs complete regardless of the proxy's request-body limit. It is a client-side fix that needs no server change.

List of Changes

#	Change (user/operator effect)	How (implementation)
1	Result uploads for large-taxonomy models now succeed instead of being rejected for size.	New `chunk_results_by_size()` greedy byte-bounded packer in `client.py`; `post_batch_results()` serializes each result once, packs results into chunks under `max_bytes`, posts each chunk, and reports success only if every chunk lands. A single result that exceeds the cap on its own is sent alone and logged.
2	Operators can tune the per-upload size cap per deployment.	New setting `antenna_result_post_max_bytes` → env var `AMI_ANTENNA_RESULT_POST_MAX_BYTES` (default 25 MB), threaded through `ResultPoster` into the worker.
3	Regression coverage for the splitting behaviour.	`trapdata/antenna/tests/test_result_chunking.py` — 7 tests: packing under cap, splitting a large batch across multiple uploads, over-cap single result, and all-chunks-must-succeed semantics.

Diagnosis — why one batch reaches ~140 MB

Each ClassificationResponse carries scores and logits, each an array of length = number of model classes. At ~29k classes that is ~1.1 MB per detection for those two arrays alone.
The default batch is 24 images, and trap images routinely contain many moths each, so a batch of ~130 detections lands at roughly 120–140 MB — matching the observed rejected upload sizes.
This is payload width, not duplication: the worker builds each upload from only the current batch's detections (worker.py), and the classifier/detector are reset per batch, so there is no cross-batch accumulation.

Notes and follow-ups (out of scope for this PR)

Directions to discuss, ordered by leverage. Important context: result payloads are expected to grow, not shrink. logits are needed and will be kept, and per-crop vector embeddings are planned for each detection. That makes compressing the upload more valuable over time, and argues against tightening the proxy's body-size limit.

Request gzip is the highest-leverage lever, and will matter more as payloads grow. The worker posts raw JSON, and the API server compresses responses but has no request-body decompression (no Content-Encoding: gzip handling). The big arrays — scores, logits, and soon per-crop embeddings — are numeric and compress roughly 5–10×, so gzipping uploads could cut every request at the source. It requires the API server to accept and decompress gzipped request bodies first, and the reverse proxy to pass them through. With embeddings incoming this is worth prioritizing as a cross-repo follow-up.
Keep generous size headroom. Because logits stay and embeddings are coming, neither the per-upload cap nor the proxy body-size limit should be tightened. Chunking keeps individual uploads bounded regardless of how large a full batch's results become.
labels is already omittable for large models. The per-classification labels array is list[str] | None and documented as "omitted if the model has too many labels" on both the worker and server schemas, so for a 29k-class model it is most likely already dropped — a minor saving already realized. The bulk is scores + logits (and soon embeddings).

Not yet verified

No live-worker run. Chunking is verified by unit and integration tests against the serialization path, not against a real GPU job. This should be confirmed on a real large-taxonomy job, checking that each upload lands under the deployment's proxy limit.
If a single image produces enough detections to exceed the cap on its own (~13+ dense detections at ~29k classes), that one result is still sent in a single upload — it cannot be split below one image without an API-contract change. The configurable cap plus a generous proxy limit covers this for now.

Test status

uv run python -m pytest trapdata/antenna/tests/test_result_chunking.py -q → 7 passed. TDD-verified: disabling chunking made the split test fail, restoring it passed. black, isort, and flake8 clean on changed files.

Summary by CodeRabbit

New Features
- Large result batches are now automatically split into smaller chunks for more reliable posting.
- Added configurable size limit for result POST requests (default: 25 MiB).
Bug Fixes
- Result posting now continues processing remaining chunks if an individual chunk submission fails, improving resilience.
Tests
- Added comprehensive test coverage for result batch chunking functionality.

…y limits Wide-taxonomy pipelines (e.g. the global moths model with ~29k classes) emit roughly 2 MB per detection because each classification carries full-length labels, scores, and logits arrays. A single processed batch of two dozen images with several detections each therefore serialized to 110-140 MB, which reverse proxies rejected with HTTP 413 even after the body limit was raised to 512 MB. The results for one batch were already scoped to the current batch only (no accumulation across batches), so the size came purely from payload width times detection count. This change splits the results for a batch across multiple POST requests, each kept at or below a configurable byte cap, so no single request body exceeds the proxy limit. - Add chunk_results_by_size() and make post_batch_results() serialize each result once, greedily pack them into byte-bounded chunks, and POST each chunk; it now returns True only if every chunk succeeds. A single result that exceeds the cap on its own is sent alone (and logged) rather than dropped. - Add AMI_ANTENNA_RESULT_POST_MAX_BYTES setting (default 25 MB) and thread it through ResultPoster to post_batch_results. - Add tests asserting each POST body stays under the cap, no results are dropped, and the unsplit baseline would have exceeded the cap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-11T03:58:19Z

📝 Walkthrough

Walkthrough

This PR implements byte-bounded chunking of Antenna API result POST bodies to prevent HTTP 413 (payload too large) errors when result batches exceed reverse-proxy limits. Results are now split into sequential POSTs, each constrained by a configurable per-POST byte cap (default 25 MiB).

Changes

Antenna API Result Chunking

Layer / File(s)	Summary
Configuration and byte-limit constants `trapdata/settings.py`, `trapdata/antenna/client.py`	New `antenna_result_post_max_bytes` setting in `Settings` (default 25 MiB) with GUI/INI configuration metadata; `DEFAULT_RESULT_POST_MAX_BYTES` constant defined in client module.
Core result chunking implementation `trapdata/antenna/client.py`	New `_result_json_size()` and `chunk_results_by_size()` helpers to greedily pack serialized results into byte-bounded chunks; `post_batch_results()` rewritten to serialize once, chunk, and POST each chunk sequentially while aggregating success/failure across chunks.
ResultPoster integration with chunking limits `trapdata/antenna/result_posting.py`	`ResultPoster.__init__` now accepts `max_post_bytes` parameter (defaulting to `DEFAULT_RESULT_POST_MAX_BYTES`); `_post_with_timing()` explicitly passes the limit to `post_batch_results()`.
Worker configuration and ResultPoster instantiation `trapdata/antenna/worker.py`	Job worker now passes `settings.antenna_result_post_max_bytes` when constructing `ResultPoster`, threading the byte limit from configuration through to the posting logic.
Comprehensive test suite for chunking `trapdata/antenna/tests/test_result_chunking.py`	New test module with helpers to generate large payloads and utilities to measure serialized request sizes; covers `chunk_results_by_size` packing (empty input, per-chunk caps, data preservation, oversize entries) and `post_batch_results` integration (multi-POST splitting, baseline validation, empty-result no-op).

Sequence Diagram

sequenceDiagram
  participant Worker
  participant ResultPoster
  participant post_batch_results
  participant chunk_results_by_size
  participant HTTP
  Worker->>ResultPoster: __init__(max_post_bytes=settings.antenna_result_post_max_bytes)
  Note over ResultPoster: stores self.max_post_bytes
  Worker->>ResultPoster: post_batch(results)
  ResultPoster->>post_batch_results: post_batch_results(..., max_bytes=self.max_post_bytes)
  post_batch_results->>post_batch_results: serialize to JSON dicts
  post_batch_results->>chunk_results_by_size: JSON dicts, max_bytes
  chunk_results_by_size-->>post_batch_results: list of chunks
  loop for each chunk
    post_batch_results->>HTTP: POST {results: chunk}
    HTTP-->>post_batch_results: 200/validation error
    post_batch_results->>post_batch_results: log per-chunk status
  end
  post_batch_results-->>ResultPoster: all_ok flag
  ResultPoster-->>Worker: success status

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

RolnickLab/ami-data-companion#134: Both PRs directly modify post_batch_results implementation and request handling in trapdata/antenna/client.py.
RolnickLab/ami-data-companion#113: Both PRs modify the async result-posting pipeline through ResultPoster and its integration in worker.py.

Poem

🐰 A batchy new day for results so wide,
Split by bytes and posted with pride!
No 413s shall spoil our quest,
Chunked and posted, each one blessed.
The Antenna sings—results take flight! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main objective of the PR: splitting oversized result uploads to prevent rejection for wide-taxonomy batches.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/result-post-chunking

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

trapdata/antenna/result_posting.py (1)
68-76: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Update class docstring to document the new max_post_bytes parameter.

The __init__ method now accepts a max_post_bytes parameter, but the class docstring (lines 52-66) doesn't document it in the Args section. This makes the parameter undiscoverable for users reading the API documentation.
📝 Proposed documentation update

Add to the docstring's Args section (after line 59):
     Args:
         max_pending: Maximum number of concurrent posts before blocking (default: 5)
+        max_post_bytes: Maximum size in bytes of a single POST body (default: 25 MiB)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@trapdata/antenna/result_posting.py` around lines 68 - 76, Update the class
docstring (the Args section in the class above __init__) to document the new
__init__ parameter max_post_bytes: explain it's the per-POST body size cap in
bytes, include the default value DEFAULT_RESULT_POST_MAX_BYTES, and place this
entry alongside max_pending and future_timeout so users can discover the
parameter when reading the API docs; reference the parameter name max_post_bytes
and the constructor method __init__ when adding the brief description.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@trapdata/antenna/client.py`:
- Around line 190-207: The loop that posts chunks uses response.json() and
AntennaResultPostResponse.model_validate(), but the except currently only
catches requests.RequestException so JSON decoding or Pydantic validation errors
will leak and abort processing; update the exception handling around the
session.post/response parsing/model_validate block (the code that calls
session.post, response.json(), and AntennaResultPostResponse.model_validate) to
also catch json.JSONDecodeError and pydantic.ValidationError (or ValueError if
pydantic isn't imported) in the same except (or use a broad Exception as a last
resort), log the error via logger.error including chunk_idx/job_id/url, set
all_ok = False, and continue to the next chunk so remaining chunks are still
posted.

---

Outside diff comments:
In `@trapdata/antenna/result_posting.py`:
- Around line 68-76: Update the class docstring (the Args section in the class
above __init__) to document the new __init__ parameter max_post_bytes: explain
it's the per-POST body size cap in bytes, include the default value
DEFAULT_RESULT_POST_MAX_BYTES, and place this entry alongside max_pending and
future_timeout so users can discover the parameter when reading the API docs;
reference the parameter name max_post_bytes and the constructor method __init__
when adding the brief description.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1c2b7005-4ad0-4b52-aee5-febb971e4e3e

📥 Commits

Reviewing files that changed from the base of the PR and between a33746a and fce66be.

📒 Files selected for processing (5)

trapdata/antenna/client.py
trapdata/antenna/result_posting.py
trapdata/antenna/tests/test_result_chunking.py
trapdata/antenna/worker.py
trapdata/settings.py

coderabbitai · 2026-06-11T04:09:45Z

+        for chunk_idx, chunk in enumerate(chunks):
+            payload = {"results": chunk}
+            try:
+                response = session.post(url, json=payload, timeout=60)
+                response.raise_for_status()
+                result = AntennaResultPostResponse.model_validate(response.json())
+                logger.debug(
+                    f"Posted chunk {chunk_idx + 1}/{len(chunks)} "
+                    f"({len(chunk)} results) to job {job_id}: "
+                    f"{result.results_queued} queued"
+                )
+            except requests.RequestException as e:
+                logger.error(
+                    f"Failed to post result chunk {chunk_idx + 1}/{len(chunks)} "
+                    f"to {url}: {e}"
+                )
+                all_ok = False
+    return all_ok


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Catch broader exception types to handle response validation errors.

The current except clause only catches requests.RequestException (line 201), which won't catch Pydantic validation errors from AntennaResultPostResponse.model_validate() (line 195) or JSON decode errors from response.json() (line 195). If the server returns malformed JSON or a valid JSON response that doesn't match the expected schema, these exceptions would propagate uncaught, preventing remaining chunks from being posted and causing the entire batch to fail.

🛡️ Proposed fix to catch all exceptions

except requests.RequestException as e: + except Exception as e: logger.error( f"Failed to post result chunk {chunk_idx + 1}/{len(chunks)} " f"to {url}: {e}" ) all_ok = False

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for chunk_idx, chunk in enumerate(chunks):

payload = {"results": chunk}

try:

response = session.post(url, json=payload, timeout=60)

response.raise_for_status()

result = AntennaResultPostResponse.model_validate(response.json())

logger.debug(

f"Posted chunk {chunk_idx + 1}/{len(chunks)} "

f"({len(chunk)} results) to job {job_id}: "

f"{result.results_queued} queued"

)

except requests.RequestException as e:

logger.error(

f"Failed to post result chunk {chunk_idx + 1}/{len(chunks)} "

f"to {url}: {e}"

)

all_ok = False

return all_ok

for chunk_idx, chunk in enumerate(chunks):

payload = {"results": chunk}

try:

response = session.post(url, json=payload, timeout=60)

response.raise_for_status()

result = AntennaResultPostResponse.model_validate(response.json())

logger.debug(

f"Posted chunk {chunk_idx + 1}/{len(chunks)} "

f"({len(chunk)} results) to job {job_id}: "

f"{result.results_queued} queued"

)

except Exception as e:

logger.error(

f"Failed to post result chunk {chunk_idx + 1}/{len(chunks)} "

f"to {url}: {e}"

)

all_ok = False

return all_ok

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@trapdata/antenna/client.py` around lines 190 - 207, The loop that posts chunks uses response.json() and AntennaResultPostResponse.model_validate(), but the except currently only catches requests.RequestException so JSON decoding or Pydantic validation errors will leak and abort processing; update the exception handling around the session.post/response parsing/model_validate block (the code that calls session.post, response.json(), and AntennaResultPostResponse.model_validate) to also catch json.JSONDecodeError and pydantic.ValidationError (or ValueError if pydantic isn't imported) in the same except (or use a broad Exception as a last resort), log the error via logger.error including chunk_idx/job_id/url, set all_ok = False, and continue to the next chunk so remaining chunks are still posted.

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

mihow added the Pipeline API Updates to the requests & responses to/from processing service workers for ML pipelines label Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split oversized result uploads so wide-taxonomy batches don't get rejected#149

Split oversized result uploads so wide-taxonomy batches don't get rejected#149
mihow wants to merge 1 commit into
mainfrom
fix/result-post-chunking

mihow commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mihow commented Jun 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

List of Changes

Diagnosis — why one batch reaches ~140 MB

Notes and follow-ups (out of scope for this PR)

Not yet verified

Test status

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mihow commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading