Skip to content

Split oversized result uploads so wide-taxonomy batches don't get rejected#149

Open
mihow wants to merge 1 commit into
mainfrom
fix/result-post-chunking
Open

Split oversized result uploads so wide-taxonomy batches don't get rejected#149
mihow wants to merge 1 commit into
mainfrom
fix/result-post-chunking

Conversation

@mihow

@mihow mihow commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

For models with very large taxonomies — for example the ~29,000-class global_moths_2024 classifier — a single batch's result upload can reach 110–142 MB, because every detection carries full per-class scores and logits arrays. In a production deployment those uploads were rejected by the reverse proxy with HTTP 413 (Request Entity Too Large), so the worker could never record its results: the upload failed, the NATS task was never acknowledged, the task eventually exhausted its redelivery budget, and the job failed with nothing stored. Smaller-taxonomy jobs were unaffected, which is why this stayed invisible until a wide-taxonomy job ran on a dense capture set.

This change makes the worker split a batch's results across multiple smaller uploads, each kept under a configurable byte cap, so wide-taxonomy jobs complete regardless of the proxy's request-body limit. It is a client-side fix that needs no server change.

List of Changes

# Change (user/operator effect) How (implementation)
1 Result uploads for large-taxonomy models now succeed instead of being rejected for size. New chunk_results_by_size() greedy byte-bounded packer in client.py; post_batch_results() serializes each result once, packs results into chunks under max_bytes, posts each chunk, and reports success only if every chunk lands. A single result that exceeds the cap on its own is sent alone and logged.
2 Operators can tune the per-upload size cap per deployment. New setting antenna_result_post_max_bytes → env var AMI_ANTENNA_RESULT_POST_MAX_BYTES (default 25 MB), threaded through ResultPoster into the worker.
3 Regression coverage for the splitting behaviour. trapdata/antenna/tests/test_result_chunking.py — 7 tests: packing under cap, splitting a large batch across multiple uploads, over-cap single result, and all-chunks-must-succeed semantics.

Diagnosis — why one batch reaches ~140 MB

  • Each ClassificationResponse carries scores and logits, each an array of length = number of model classes. At ~29k classes that is ~1.1 MB per detection for those two arrays alone.
  • The default batch is 24 images, and trap images routinely contain many moths each, so a batch of ~130 detections lands at roughly 120–140 MB — matching the observed rejected upload sizes.
  • This is payload width, not duplication: the worker builds each upload from only the current batch's detections (worker.py), and the classifier/detector are reset per batch, so there is no cross-batch accumulation.

Notes and follow-ups (out of scope for this PR)

Directions to discuss, ordered by leverage. Important context: result payloads are expected to grow, not shrink. logits are needed and will be kept, and per-crop vector embeddings are planned for each detection. That makes compressing the upload more valuable over time, and argues against tightening the proxy's body-size limit.

  1. Request gzip is the highest-leverage lever, and will matter more as payloads grow. The worker posts raw JSON, and the API server compresses responses but has no request-body decompression (no Content-Encoding: gzip handling). The big arrays — scores, logits, and soon per-crop embeddings — are numeric and compress roughly 5–10×, so gzipping uploads could cut every request at the source. It requires the API server to accept and decompress gzipped request bodies first, and the reverse proxy to pass them through. With embeddings incoming this is worth prioritizing as a cross-repo follow-up.
  2. Keep generous size headroom. Because logits stay and embeddings are coming, neither the per-upload cap nor the proxy body-size limit should be tightened. Chunking keeps individual uploads bounded regardless of how large a full batch's results become.
  3. labels is already omittable for large models. The per-classification labels array is list[str] | None and documented as "omitted if the model has too many labels" on both the worker and server schemas, so for a 29k-class model it is most likely already dropped — a minor saving already realized. The bulk is scores + logits (and soon embeddings).

Not yet verified

  • No live-worker run. Chunking is verified by unit and integration tests against the serialization path, not against a real GPU job. This should be confirmed on a real large-taxonomy job, checking that each upload lands under the deployment's proxy limit.
  • If a single image produces enough detections to exceed the cap on its own (~13+ dense detections at ~29k classes), that one result is still sent in a single upload — it cannot be split below one image without an API-contract change. The configurable cap plus a generous proxy limit covers this for now.

Test status

uv run python -m pytest trapdata/antenna/tests/test_result_chunking.py -q → 7 passed. TDD-verified: disabling chunking made the split test fail, restoring it passed. black, isort, and flake8 clean on changed files.

Summary by CodeRabbit

  • New Features

    • Large result batches are now automatically split into smaller chunks for more reliable posting.
    • Added configurable size limit for result POST requests (default: 25 MiB).
  • Bug Fixes

    • Result posting now continues processing remaining chunks if an individual chunk submission fails, improving resilience.
  • Tests

    • Added comprehensive test coverage for result batch chunking functionality.

…y limits

Wide-taxonomy pipelines (e.g. the global moths model with ~29k classes) emit
roughly 2 MB per detection because each classification carries full-length
labels, scores, and logits arrays. A single processed batch of two dozen images
with several detections each therefore serialized to 110-140 MB, which reverse
proxies rejected with HTTP 413 even after the body limit was raised to 512 MB.

The results for one batch were already scoped to the current batch only (no
accumulation across batches), so the size came purely from payload width times
detection count. This change splits the results for a batch across multiple
POST requests, each kept at or below a configurable byte cap, so no single
request body exceeds the proxy limit.

- Add chunk_results_by_size() and make post_batch_results() serialize each
  result once, greedily pack them into byte-bounded chunks, and POST each chunk;
  it now returns True only if every chunk succeeds. A single result that exceeds
  the cap on its own is sent alone (and logged) rather than dropped.
- Add AMI_ANTENNA_RESULT_POST_MAX_BYTES setting (default 25 MB) and thread it
  through ResultPoster to post_batch_results.
- Add tests asserting each POST body stays under the cap, no results are
  dropped, and the unsplit baseline would have exceeded the cap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR implements byte-bounded chunking of Antenna API result POST bodies to prevent HTTP 413 (payload too large) errors when result batches exceed reverse-proxy limits. Results are now split into sequential POSTs, each constrained by a configurable per-POST byte cap (default 25 MiB).

Changes

Antenna API Result Chunking

Layer / File(s) Summary
Configuration and byte-limit constants
trapdata/settings.py, trapdata/antenna/client.py
New antenna_result_post_max_bytes setting in Settings (default 25 MiB) with GUI/INI configuration metadata; DEFAULT_RESULT_POST_MAX_BYTES constant defined in client module.
Core result chunking implementation
trapdata/antenna/client.py
New _result_json_size() and chunk_results_by_size() helpers to greedily pack serialized results into byte-bounded chunks; post_batch_results() rewritten to serialize once, chunk, and POST each chunk sequentially while aggregating success/failure across chunks.
ResultPoster integration with chunking limits
trapdata/antenna/result_posting.py
ResultPoster.__init__ now accepts max_post_bytes parameter (defaulting to DEFAULT_RESULT_POST_MAX_BYTES); _post_with_timing() explicitly passes the limit to post_batch_results().
Worker configuration and ResultPoster instantiation
trapdata/antenna/worker.py
Job worker now passes settings.antenna_result_post_max_bytes when constructing ResultPoster, threading the byte limit from configuration through to the posting logic.
Comprehensive test suite for chunking
trapdata/antenna/tests/test_result_chunking.py
New test module with helpers to generate large payloads and utilities to measure serialized request sizes; covers chunk_results_by_size packing (empty input, per-chunk caps, data preservation, oversize entries) and post_batch_results integration (multi-POST splitting, baseline validation, empty-result no-op).

Sequence Diagram

sequenceDiagram
  participant Worker
  participant ResultPoster
  participant post_batch_results
  participant chunk_results_by_size
  participant HTTP
  Worker->>ResultPoster: __init__(max_post_bytes=settings.antenna_result_post_max_bytes)
  Note over ResultPoster: stores self.max_post_bytes
  Worker->>ResultPoster: post_batch(results)
  ResultPoster->>post_batch_results: post_batch_results(..., max_bytes=self.max_post_bytes)
  post_batch_results->>post_batch_results: serialize to JSON dicts
  post_batch_results->>chunk_results_by_size: JSON dicts, max_bytes
  chunk_results_by_size-->>post_batch_results: list of chunks
  loop for each chunk
    post_batch_results->>HTTP: POST {results: chunk}
    HTTP-->>post_batch_results: 200/validation error
    post_batch_results->>post_batch_results: log per-chunk status
  end
  post_batch_results-->>ResultPoster: all_ok flag
  ResultPoster-->>Worker: success status
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 A batchy new day for results so wide,
Split by bytes and posted with pride!
No 413s shall spoil our quest,
Chunked and posted, each one blessed.
The Antenna sings—results take flight! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main objective of the PR: splitting oversized result uploads to prevent rejection for wide-taxonomy batches.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/result-post-chunking

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
trapdata/antenna/result_posting.py (1)

68-76: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Update class docstring to document the new max_post_bytes parameter.

The __init__ method now accepts a max_post_bytes parameter, but the class docstring (lines 52-66) doesn't document it in the Args section. This makes the parameter undiscoverable for users reading the API documentation.

📝 Proposed documentation update

Add to the docstring's Args section (after line 59):

     Args:
         max_pending: Maximum number of concurrent posts before blocking (default: 5)
+        max_post_bytes: Maximum size in bytes of a single POST body (default: 25 MiB)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@trapdata/antenna/result_posting.py` around lines 68 - 76, Update the class
docstring (the Args section in the class above __init__) to document the new
__init__ parameter max_post_bytes: explain it's the per-POST body size cap in
bytes, include the default value DEFAULT_RESULT_POST_MAX_BYTES, and place this
entry alongside max_pending and future_timeout so users can discover the
parameter when reading the API docs; reference the parameter name max_post_bytes
and the constructor method __init__ when adding the brief description.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@trapdata/antenna/client.py`:
- Around line 190-207: The loop that posts chunks uses response.json() and
AntennaResultPostResponse.model_validate(), but the except currently only
catches requests.RequestException so JSON decoding or Pydantic validation errors
will leak and abort processing; update the exception handling around the
session.post/response parsing/model_validate block (the code that calls
session.post, response.json(), and AntennaResultPostResponse.model_validate) to
also catch json.JSONDecodeError and pydantic.ValidationError (or ValueError if
pydantic isn't imported) in the same except (or use a broad Exception as a last
resort), log the error via logger.error including chunk_idx/job_id/url, set
all_ok = False, and continue to the next chunk so remaining chunks are still
posted.

---

Outside diff comments:
In `@trapdata/antenna/result_posting.py`:
- Around line 68-76: Update the class docstring (the Args section in the class
above __init__) to document the new __init__ parameter max_post_bytes: explain
it's the per-POST body size cap in bytes, include the default value
DEFAULT_RESULT_POST_MAX_BYTES, and place this entry alongside max_pending and
future_timeout so users can discover the parameter when reading the API docs;
reference the parameter name max_post_bytes and the constructor method __init__
when adding the brief description.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1c2b7005-4ad0-4b52-aee5-febb971e4e3e

📥 Commits

Reviewing files that changed from the base of the PR and between a33746a and fce66be.

📒 Files selected for processing (5)
  • trapdata/antenna/client.py
  • trapdata/antenna/result_posting.py
  • trapdata/antenna/tests/test_result_chunking.py
  • trapdata/antenna/worker.py
  • trapdata/settings.py

Comment on lines +190 to +207
for chunk_idx, chunk in enumerate(chunks):
payload = {"results": chunk}
try:
response = session.post(url, json=payload, timeout=60)
response.raise_for_status()
result = AntennaResultPostResponse.model_validate(response.json())
logger.debug(
f"Posted chunk {chunk_idx + 1}/{len(chunks)} "
f"({len(chunk)} results) to job {job_id}: "
f"{result.results_queued} queued"
)
except requests.RequestException as e:
logger.error(
f"Failed to post result chunk {chunk_idx + 1}/{len(chunks)} "
f"to {url}: {e}"
)
all_ok = False
return all_ok

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Catch broader exception types to handle response validation errors.

The current except clause only catches requests.RequestException (line 201), which won't catch Pydantic validation errors from AntennaResultPostResponse.model_validate() (line 195) or JSON decode errors from response.json() (line 195). If the server returns malformed JSON or a valid JSON response that doesn't match the expected schema, these exceptions would propagate uncaught, preventing remaining chunks from being posted and causing the entire batch to fail.

🛡️ Proposed fix to catch all exceptions
             except requests.RequestException as e:
+            except Exception as e:
                 logger.error(
                     f"Failed to post result chunk {chunk_idx + 1}/{len(chunks)} "
                     f"to {url}: {e}"
                 )
                 all_ok = False
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for chunk_idx, chunk in enumerate(chunks):
payload = {"results": chunk}
try:
response = session.post(url, json=payload, timeout=60)
response.raise_for_status()
result = AntennaResultPostResponse.model_validate(response.json())
logger.debug(
f"Posted chunk {chunk_idx + 1}/{len(chunks)} "
f"({len(chunk)} results) to job {job_id}: "
f"{result.results_queued} queued"
)
except requests.RequestException as e:
logger.error(
f"Failed to post result chunk {chunk_idx + 1}/{len(chunks)} "
f"to {url}: {e}"
)
all_ok = False
return all_ok
for chunk_idx, chunk in enumerate(chunks):
payload = {"results": chunk}
try:
response = session.post(url, json=payload, timeout=60)
response.raise_for_status()
result = AntennaResultPostResponse.model_validate(response.json())
logger.debug(
f"Posted chunk {chunk_idx + 1}/{len(chunks)} "
f"({len(chunk)} results) to job {job_id}: "
f"{result.results_queued} queued"
)
except Exception as e:
logger.error(
f"Failed to post result chunk {chunk_idx + 1}/{len(chunks)} "
f"to {url}: {e}"
)
all_ok = False
return all_ok
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@trapdata/antenna/client.py` around lines 190 - 207, The loop that posts
chunks uses response.json() and AntennaResultPostResponse.model_validate(), but
the except currently only catches requests.RequestException so JSON decoding or
Pydantic validation errors will leak and abort processing; update the exception
handling around the session.post/response parsing/model_validate block (the code
that calls session.post, response.json(), and
AntennaResultPostResponse.model_validate) to also catch json.JSONDecodeError and
pydantic.ValidationError (or ValueError if pydantic isn't imported) in the same
except (or use a broad Exception as a last resort), log the error via
logger.error including chunk_idx/job_id/url, set all_ok = False, and continue to
the next chunk so remaining chunks are still posted.

@mihow mihow added the Pipeline API Updates to the requests & responses to/from processing service workers for ML pipelines label Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Pipeline API Updates to the requests & responses to/from processing service workers for ML pipelines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant