Skip to content

fix(performance): stream remote pdf downloads to reduce memory usage#551

Open
Namraa310806 wants to merge 4 commits into
FireFistisDead:masterfrom
Namraa310806:fix/process-from-url-streaming
Open

fix(performance): stream remote pdf downloads to reduce memory usage#551
Namraa310806 wants to merge 4 commits into
FireFistisDead:masterfrom
Namraa310806:fix/process-from-url-streaming

Conversation

@Namraa310806

Copy link
Copy Markdown
Contributor

Summary

This PR improves the /process-from-url document ingestion pipeline by replacing memory-intensive PDF buffering with a streaming-based download and processing workflow.

Previously, remote PDF files were downloaded into memory using an ArrayBuffer and converted into a Buffer before processing. Under concurrent workloads, multiple large PDF uploads could significantly increase memory consumption, resulting in excessive garbage collection, degraded performance, and potential service instability.

This update introduces a streaming pipeline that minimizes memory usage, improves scalability, and strengthens resilience against resource exhaustion attacks.


Changes Made

Streaming-Based PDF Processing

  • Replaced in-memory PDF buffering with a streaming download pipeline.
  • Eliminated the need to load the entire PDF into memory before processing.
  • Reduced peak memory consumption during document ingestion.

Large File Handling Improvements

  • Added streaming-aware file size validation.
  • Prevented oversized documents from being fully downloaded before rejection.
  • Improved handling of large PDF uploads.

Resource Management

  • Reduced memory pressure caused by concurrent downloads.
  • Improved garbage collection behavior under load.
  • Prevented memory growth proportional to the combined size of active uploads.

Error Handling Enhancements

  • Added safer handling for interrupted downloads.
  • Improved cleanup behavior for failed processing operations.
  • Prevented resource leaks during stream failures.

Test Coverage

Added tests covering:

  • Large PDF processing
  • Oversized file rejection
  • Streaming download behavior
  • Concurrent upload scenarios
  • Resource cleanup after failures

Performance Impact

Before

Remote PDF → ArrayBuffer → Buffer → Processing
  • Entire PDF stored in memory.
  • Memory usage scaled with file size.
  • Concurrent uploads increased RAM consumption significantly.

After

Remote PDF → Stream Pipeline → Processing
  • Constant memory footprint during downloads.
  • Reduced garbage collection pressure.
  • Improved scalability under concurrent workloads.

Security Impact

This change reduces the risk of:

  • Memory exhaustion attacks
  • Resource exhaustion through large file uploads
  • Performance degradation under concurrent workloads
  • Denial-of-service scenarios caused by excessive memory allocation

Files Modified

server.js
Upload processing utilities
Upload test suite

Verification Checklist

  • Removed in-memory PDF buffering from active processing path
  • Implemented streaming-based download workflow
  • Added large file protection
  • Added resource cleanup safeguards
  • Added regression tests
  • Existing document processing functionality preserved
  • No breaking API changes introduced

Related Issue

Fixes: #502

@vercel

vercel Bot commented Jun 15, 2026

Copy link
Copy Markdown

@Namraa310806 is attempting to deploy a commit to the firefistisdead's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Warning

Review limit reached

@Namraa310806, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 48 minutes and 14 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: ac6eae60-d477-4ab9-8af5-38ee77493015

📥 Commits

Reviewing files that changed from the base of the PR and between 5590b87 and 0290dd3.

📒 Files selected for processing (3)
  • server.js
  • server.test.js
  • src/data/users.json
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added backend Express or API gateway work bug Something isn't working enhancement New feature or request feature A new feature or improvement fix A targeted fix or cleanup frontend Frontend-related work rag-service FastAPI / model service work type:security type:testing labels Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Express or API gateway work bug Something isn't working enhancement New feature or request feature A new feature or improvement fix A targeted fix or cleanup frontend Frontend-related work rag-service FastAPI / model service work type:security type:testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: /process-from-url Loads Entire Remote PDF Into Memory Causing Potential Memory Exhaustion Under Concurrent Requests

1 participant