Skip to content

wip feat: improve middleware performance by 50%#1209

Open
sansyrox wants to merge 1 commit into
mainfrom
fix/middleware-performance
Open

wip feat: improve middleware performance by 50%#1209
sansyrox wants to merge 1 commit into
mainfrom
fix/middleware-performance

Conversation

@sansyrox

@sansyrox sansyrox commented Jul 11, 2025

Copy link
Copy Markdown
Member

Description

This PR fixes #

Summary

This PR does....

PR Checklist

Please ensure that:

  • The PR contains a descriptive title
  • The PR contains a descriptive summary of the changes
  • You build and test your changes before submitting a PR.
  • You have added relevant documentation
  • You have added relevant tests. We prefer integration tests wherever possible

Pre-Commit Instructions:

Summary by CodeRabbit

  • Refactor
    • Optimized middleware chain execution to enhance request and response processing performance.

@vercel

vercel Bot commented Jul 11, 2025

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
robyn Ready Ready Preview, Comment Mar 28, 2026 7:33pm

@recurseml

recurseml Bot commented Jul 11, 2025

Copy link
Copy Markdown

✨ No issues found! Your code is sparkling clean! ✨

Need help? Join our Discord for support!
https://discord.gg/qEjHQk64Z9

@codspeed-hq

codspeed-hq Bot commented Jul 11, 2025

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 189 untouched benchmarks


Comparing fix/middleware-performance (cdd6889) with main (3e04c65)

Open in CodSpeed

@sansyrox sansyrox force-pushed the fix/middleware-performance branch from 72ede91 to cdd6889 Compare March 28, 2026 19:31
@coderabbitai

coderabbitai Bot commented Mar 28, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR introduces middleware chain execution functions in src/executors/mod.rs that batch-process before and after middleware sequences with optimizations for synchronous-only chains, and refactors src/server.rs to use single-call chain executors instead of per-middleware loops.

Changes

Cohort / File(s) Summary
Middleware Chain Executors
src/executors/mod.rs
Added execute_before_middleware_chain() and execute_after_middleware_chain() async functions. Both optimize by detecting all-sync middleware to execute in a single Python::with_gil block; async chains fall back to sequential per-middleware execution. Before chain returns early on Response; after chain constrains extraction to Response and errors on request returns.
Server Middleware Integration
src/server.rs
Updated imports to use new chain executors. Refactored before/after middleware control flows: replaced per-middleware iteration loops with single-call chain execution. Before middleware chain immediately returns on early response; after middleware applies response updates and errors internally. Removed early-response pre-check before const-router cache selection.

Sequence Diagram

sequenceDiagram
    actor Client
    participant Server as Server
    participant BeforeChain as Before Chain
    participant Middleware as Middleware(s)
    participant AfterChain as After Chain
    participant HTTP as HTTP Handler

    Client->>Server: Request
    Server->>BeforeChain: execute_before_middleware_chain()
    
    rect rgba(100, 150, 200, 0.5)
        Note over BeforeChain,Middleware: Detect all_sync?
        alt All Synchronous
            BeforeChain->>BeforeChain: Single GIL block
            BeforeChain->>Middleware: Execute all
        else Has Async
            loop Each Middleware
                BeforeChain->>Middleware: await execute_middleware_function()
            end
        end
    end
    
    alt Early Response from Chain
        Middleware-->>BeforeChain: Response
        BeforeChain-->>Server: MiddlewareReturn::Response
        Server-->>Client: Response
    else Request Continues
        Middleware-->>BeforeChain: Request
        BeforeChain-->>Server: MiddlewareReturn::Request
        Server->>HTTP: execute_http_function()
        HTTP-->>Server: Response
        Server->>AfterChain: execute_after_middleware_chain()
        
        rect rgba(150, 150, 100, 0.5)
            AfterChain->>AfterChain: Detect all_sync?
            alt All Synchronous
                AfterChain->>Middleware: Execute all in GIL
            else Has Async
                loop Each Middleware
                    AfterChain->>Middleware: await execution
                end
            end
        end
        
        Middleware-->>AfterChain: Response
        AfterChain-->>Server: Updated Response
        Server-->>Client: Response
    end
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Middleware chains, now bundled with care,
GIL optimized through sync-detection's flair,
Before and after in one swift embrace,
Early responses exit with perfect grace!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely a placeholder template with no actual content filled in; issue reference and summary remain unfilled. Complete the description template by filling in the issue number, providing a detailed summary of the middleware chain refactoring changes, and confirming the PR checklist items.
Title check ❓ Inconclusive The title mentions a performance improvement but uses vague terminology ('wip' and 'improve') and lacks specificity about what changed. Replace with a more specific title like 'Refactor middleware execution to use batched chain processing' that clearly describes the actual implementation change.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/middleware-performance

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/executors/mod.rs`:
- Around line 263-298: The after-middleware chain currently only forwards a
Response, breaking the (request, response) contract used by
execute_after_middleware_function; update execute_after_middleware_chain to
accept and pass both &Request and &Response to middleware invocations (use
get_function_output and execute_middleware_function with both references so
after_request(req, resp) callbacks can access request-scoped data), ensure you
handle the returned MiddlewareReturn the same way (only accept
MiddlewareReturn::Response) and update the caller in server.rs to pass &request
into this helper accordingly.

In `@src/server.rs`:
- Around line 517-520: The current early return in index() when
execute_before_middleware_chain returns MiddlewareReturn::Response(r) skips
global response header application and the after-middleware chain; instead,
change the handling so index() continues down the common response path: if
execute_before_middleware_chain yields MiddlewareReturn::Response, store that
Response as the "short-circuit" response while also preserving the most-recent
mutated Request (from MiddlewareReturn::Request or a sidecar), then proceed
through the same global header application and execute_after_middleware_chain on
the response before returning; update the match on
execute_before_middleware_chain (and any variables around before_middlewares) to
propagate both the latest Request and the Response into the standard response
flow rather than returning ResponseType::Standard immediately.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a573752f-50fa-4ab3-a2d6-3721ec7f5e1f

📥 Commits

Reviewing files that changed from the base of the PR and between 3e04c65 and cdd6889.

📒 Files selected for processing (2)
  • src/executors/mod.rs
  • src/server.rs

Comment thread src/executors/mod.rs
Comment on lines +263 to +298
pub async fn execute_after_middleware_chain(
input: &Response,
middlewares: &[FunctionInfo],
) -> Result<MiddlewareReturn> {
let mut current_response = input.clone();

// Check if all middlewares are sync to optimize GIL usage
let all_sync = middlewares.iter().all(|m| !m.is_async);

if all_sync {
// Execute all sync middlewares in a single GIL acquisition
Python::with_gil(|py| -> Result<MiddlewareReturn> {
for middleware in middlewares {
let output = get_function_output(middleware, py, &current_response)?;

// After middleware should return Response
match output.extract::<Response>() {
Ok(response) => current_response = response,
Err(e) => return Err(e.into()),
}
}

Ok(MiddlewareReturn::Response(current_response))
})
} else {
// Fall back to individual execution for mixed sync/async middlewares
for middleware in middlewares {
current_response = match execute_middleware_function(&current_response, middleware).await? {
MiddlewareReturn::Response(r) => r,
MiddlewareReturn::Request(_) => {
return Err(anyhow::anyhow!("After middleware returned a request"))
}
};
}

Ok(MiddlewareReturn::Response(current_response))

@coderabbitai coderabbitai Bot Mar 28, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Preserve the (request, response) contract for after middleware.

This chain runner only passes a Response: the sync path calls get_function_output(...), and the async path calls execute_middleware_function(...). That regresses the behavior implemented by execute_after_middleware_function at Lines 181-217, which passes both request and response. Any after_request(req, resp) middleware will now fail at runtime, and those callbacks can no longer inspect request-scoped data such as path params.

🔧 Suggested fix
-pub async fn execute_after_middleware_chain(
-    input: &Response,
+pub async fn execute_after_middleware_chain(
+    request: &Request,
+    input: &Response,
     middlewares: &[FunctionInfo],
 ) -> Result<MiddlewareReturn> {
     let mut current_response = input.clone();
@@
     if all_sync {
         Python::with_gil(|py| -> Result<MiddlewareReturn> {
             for middleware in middlewares {
-                let output = get_function_output(middleware, py, &current_response)?;
+                let output = get_function_output_with_two_args(
+                    middleware,
+                    py,
+                    request,
+                    &current_response,
+                )?;
@@
     } else {
         for middleware in middlewares {
-            current_response = match execute_middleware_function(&current_response, middleware).await? {
+            current_response = match execute_after_middleware_function(
+                request,
+                &current_response,
+                middleware,
+            )
+            .await? {
                 MiddlewareReturn::Response(r) => r,
                 MiddlewareReturn::Request(_) => {
                     return Err(anyhow::anyhow!("After middleware returned a request"))
                 }
             };
         }

src/server.rs will need to pass &request into this helper as well.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/executors/mod.rs` around lines 263 - 298, The after-middleware chain
currently only forwards a Response, breaking the (request, response) contract
used by execute_after_middleware_function; update execute_after_middleware_chain
to accept and pass both &Request and &Response to middleware invocations (use
get_function_output and execute_middleware_function with both references so
after_request(req, resp) callbacks can access request-scoped data), ensure you
handle the returned MiddlewareReturn the same way (only accept
MiddlewareReturn::Response) and update the caller in server.rs to pass &request
into this helper accordingly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolve — noted, WIP PR. Will fix before merge.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate Limit Exceeded

@sansyrox have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 37 seconds before sending another message.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment thread src/server.rs
Comment on lines +517 to +520
request = match execute_before_middleware_chain(&request, &before_middlewares).await {
Ok(MiddlewareReturn::Request(r)) => r,
Ok(MiddlewareReturn::Response(r)) => {
return ResponseType::Standard(r);

@coderabbitai coderabbitai Bot Mar 28, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid returning short-circuited before-middleware responses here.

This exits index() before the common response path at Lines 567-609, so responses produced by before middleware no longer receive global response headers or the after-middleware chain. That will show up on auth/preflight-style short circuits as missing CORS/security headers. If you keep these responses on the normal path, the chain result also needs to preserve the latest mutated Request.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/server.rs` around lines 517 - 520, The current early return in index()
when execute_before_middleware_chain returns MiddlewareReturn::Response(r) skips
global response header application and the after-middleware chain; instead,
change the handling so index() continues down the common response path: if
execute_before_middleware_chain yields MiddlewareReturn::Response, store that
Response as the "short-circuit" response while also preserving the most-recent
mutated Request (from MiddlewareReturn::Request or a sidecar), then proceed
through the same global header application and execute_after_middleware_chain on
the response before returning; update the match on
execute_before_middleware_chain (and any variables around before_middlewares) to
propagate both the latest Request and the Response into the standard response
flow rather than returning ResponseType::Standard immediately.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolve — noted, will address before merge.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate Limit Exceeded

@sansyrox have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 52 seconds before sending another message.

@thePromger

Copy link
Copy Markdown
Contributor

@coderabbitai

Do find, is the fix for improvement are really complete? Even do run the benchmark tests and find the performance improvements, and from where that performance is being extracted, which thing caused it to speed up?

@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== New chain executor functions ==="
cat src/executors/mod.rs

echo ""
echo "=== Server middleware usage ==="
cat src/server.rs

echo ""
echo "=== Look for benchmark files ==="
fd -e rs bench
fd -e toml bench
rg -r "criterion\|bench\|hyperfine\|wrk\|vegeta" --type toml -l
rg -r "criterion\|bench\|hyperfine" --type rs -l

echo ""
echo "=== Cargo.toml for bench deps ==="
cat Cargo.toml

echo ""
echo "=== Existing test files ==="
fd -e rs test
rg --type rs "#\[bench\]" -l

echo ""
echo "=== Any benchmark scripts at repo root ==="
fd bench
fd benchmark
ls -la

echo ""
echo "=== Check for profiling or perf scripts ==="
fd -e sh
fd -e py bench

Length of output: 42633


🏁 Script executed:

#!/bin/bash
echo "=== benchmark.sh ==="
cat benchmark.sh

echo ""
echo "=== unit_tests contents ==="
ls -la unit_tests/
find unit_tests -type f | head -30
cat unit_tests/*.rs 2>/dev/null || true
find unit_tests -name "*.py" -exec cat {} \;

echo ""
echo "=== integration_tests structure ==="
find integration_tests -type f | head -40

echo ""
echo "=== middleware-related integration tests ==="
rg -l "middleware" integration_tests/

echo ""
echo "=== Any before/after middleware test files ==="
find integration_tests -name "*middleware*" | xargs cat 2>/dev/null | head -200

Length of output: 13247



@thePromger Here's a thorough analysis of the PR — covering what the optimization does, whether it's complete, and whether the 50% claim holds up.


What causes the speedup

The core idea is in execute_before_middleware_chain / execute_after_middleware_chain:

let all_sync = middlewares.iter().all(|m| !m.is_async);

if all_sync {
    // ✅ ONE GIL acquisition for the entire chain
    Python::with_gil(|py| {
        for middleware in middlewares { ... }
    })
} else {
    // ❌ N GIL acquisitions — one per middleware (same as before)
    for middleware in middlewares {
        execute_middleware_function(...).await?;
    }
}

Before: Even synchronous middlewares went through execute_middleware_function() one at a time — N separate Python::with_gil() calls → N GIL mutex lock/unlock cycles.
After: If every middleware in the chain is synchronous, the whole chain runs under a single GIL acquisition → O(1) lock overhead instead of O(N).

This is the real source of the claimed speedup.


Is the improvement actually complete? Several gaps:

1. ❌ No benchmark numbers exist in the PR

The PR description is a placeholder ("This PR does...."). The "50%" claim is completely unsubstantiated. No before/after numbers were provided.

2. ❌ benchmark.sh tests the wrong endpoint

oha -n "$number" http://localhost:8080/sync   # ← a plain route, NO middleware

This hits a route with no middleware at all, so it measures zero benefit from this change.

3. ❌ Mixed async/sync chains get no benefit

If even one middleware in the chain is async, the code falls back to per-middleware execution with individual GIL acquisitions — same cost as before. The optimization is narrowly scoped to all-sync chains only.

4. ❌ Inconsistent error handling between sync and async paths in execute_after_middleware_chain

Path Middleware returns Request
Sync Raw PyO3 extraction error (cryptic)
Async Err(anyhow!("After middleware returned a request")) (clear)

The sync path only tries extract::<Response>() and returns a PyO3 error on failure — it never produces the descriptive message.

5. ❌ No unit tests for the new chain functions

unit_tests/ and integration_tests/ contain no tests that specifically exercise execute_before_middleware_chain or execute_after_middleware_chain, particularly the all_sync batching path. The existing test_middlewares.py is a functional correctness test, not a performance or path-coverage test.

6. ❌ execute_after_middleware_chain async fallback passes a Response to execute_middleware_function

execute_middleware_function<T> tries both extract::<Response> and extract::<Request>. When T = Response, trying to extract a Request from a Response object is a no-op failure that gets silently swallowed — this is latent behavior inconsistency vs the sync path.


Summary

Question Answer
What causes the speedup? Reducing N Python::with_gil lock acquire/release cycles to 1 for all-sync middleware chains
Is the 50% claim proven? ❌ No — no benchmark numbers, and benchmark.sh doesn't even test middleware routes
Does it help async middleware users? ❌ No — async chains still execute one-by-one
Is the implementation complete? ⚠️ Functionally mostly yes, but with edge-case inconsistencies and no new tests
Is the PR description complete? ❌ No — it's a placeholder with no linked issue, no description, no data

The PR author (@sansyrox) should be asked to:

  1. Add actual benchmark numbers (run benchmark.sh against a middleware-heavy route, before and after)
  2. Add a benchmark route to benchmark.sh or a dedicated middleware benchmark script
  3. Fix the inconsistent Request-returned-from-after-middleware error handling in the sync path
  4. Add unit/integration tests specifically for the chain batching path

@thePromger

Copy link
Copy Markdown
Contributor

@coderabbitai

Can we add optimization even for async paths, mixed sync & async paths too, how? If not, then why, reason for it? Detailed explanation for that too

@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== Full executors/mod.rs ==="
cat src/executors/mod.rs

echo ""
echo "=== execute_middleware_function implementation ==="
rg -n "execute_middleware_function\|execute_before_middleware\|execute_after_middleware\|with_gil\|is_async\|FunctionInfo\|MiddlewareReturn" src/executors/mod.rs

echo ""
echo "=== FunctionInfo struct ==="
rg -n "struct FunctionInfo\|is_async\|pub fn call\|fn call" src/ -A5

echo ""
echo "=== MiddlewareReturn definition ==="
rg -n "MiddlewareReturn\|enum Middleware" src/ -A5

Length of output: 13850



@thePromger Here's a deep dive into each scenario.


1. Why pure async middleware chains cannot be further batched

For every async Python function, execute_middleware_function does this:

// Step 1: GIL acquire `#1` — convert the Python coroutine into a Rust future
let output: Py<PyAny> = Python::with_gil(|py| {
    pyo3_async_runtimes::tokio::into_future(get_function_output(function, py, input)?)
})?
// Step 2: GIL is RELEASED here — Tokio drives the Python coroutine
.await?;

// Step 3: GIL acquire `#2` — extract the Rust value from the Python result
Python::with_gil(|py| { output.extract::<Response>(py) ... })

The .await? on Step 2 is the hard wall. Python::with_gil(|py| { ... }) takes a synchronous closure — you literally cannot write .await inside it in Rust. Holding the GIL across an await point would also deadlock the Python event loop (Python's asyncio also needs the GIL to schedule tasks). So every async middleware permanently costs 2 GIL acquisitions; there is no way around this.

N async middlewares → minimum 2N GIL acquire/release cycles (irreducible)

2. Why concurrent/parallel execution of async middlewares is also impossible

You might think: "run async1, async2, async3 with tokio::join! in parallel."
That doesn't work because middlewares form a sequential pipeline — each middleware receives the output of the previous one as its input. async2 cannot start until async1 has finished and transformed the request.

input → [async1] → modified_input → [async2] → modified_input → [async3] → output

No two middlewares in a chain can be parallelized since they are data-dependent on each other.


3. Mixed (sync + async) chains can be partially optimized

This is where there is real untapped improvement. The idea is run-length batching: group consecutive sync middlewares into one with_gil block, and only break out of it when you hit an async one.

Diagram

Chain: [sync1, sync2, sync3, async4, sync5, sync6, async7, sync8]

Current code (no optimization for mixed):  8 GIL acquisitions
                                           (1 per middleware in fallback path)

Proposed run-length batching:
  ┌─── 1 GIL ───┐          ┌─ 1 GIL ─┐          ┌─ 1 GIL ─┐
  sync1 sync2 sync3  async4  sync5 sync6  async7   sync8
                    └──2 GIL──┘        └──2 GIL──┘
  = 1 + 2 + 1 + 2 + 1 = 7 GIL acquisitions  (still better for larger batches)

For a more extreme case: [sync×100, async1, sync×100] goes from 201 GIL ops to 1 + 2 + 1 = 4 GIL ops.

What the implementation would look like

pub async fn execute_before_middleware_chain(
    input: &Request,
    middlewares: &[FunctionInfo],
) -> Result<MiddlewareReturn> {
    let mut current_request = input.clone();
    let mut i = 0;

    while i < middlewares.len() {
        // Find the end of the next consecutive sync run
        let sync_run_end = middlewares[i..]
            .iter()
            .position(|m| m.is_async)   // first async index relative to i
            .map(|pos| i + pos)
            .unwrap_or(middlewares.len()); // no async → rest are all sync

        // Batch-execute all consecutive sync middlewares in ONE GIL acquisition
        if i < sync_run_end {
            let sync_slice = &middlewares[i..sync_run_end];
            let result = Python::with_gil(|py| -> Result<MiddlewareReturn> {
                let mut req = current_request.clone();
                for mw in sync_slice {
                    let output = get_function_output(mw, py, &req)?;
                    match output.extract::<Response>() {
                        Ok(resp) => return Ok(MiddlewareReturn::Response(resp)),
                        Err(_)   => req = output.extract::<Request>()
                                         .map_err(anyhow::Error::from)?,
                    }
                }
                Ok(MiddlewareReturn::Request(req))
            })?;
            match result {
                MiddlewareReturn::Response(r) => return Ok(MiddlewareReturn::Response(r)),
                MiddlewareReturn::Request(r)  => current_request = r,
            }
            i = sync_run_end;
        }

        // Execute exactly ONE async middleware (cannot batch; GIL must be released for .await)
        if i < middlewares.len() {
            // middlewares[i] is guaranteed async here
            current_request = match execute_middleware_function(&current_request, &middlewares[i]).await? {
                MiddlewareReturn::Request(r)  => r,
                MiddlewareReturn::Response(r) => return Ok(MiddlewareReturn::Response(r)),
            };
            i += 1;
        }
    }

    Ok(MiddlewareReturn::Request(current_request))
}

4. Summary table

Chain composition Current PR (GIL ops) Run-length batching (GIL ops) Improvable?
All sync 1 1 Already optimal
All async (N middlewares) 2N 2N ❌ Irreducible
Mixed (k sync-runs, m async) N total k + 2m ✅ Yes, especially when sync-runs are large
Single async middleware 2 2 ❌ Already minimal

5. Bottom line

  • Pure async → You cannot do better than 2 GIL acquisitions per async middleware. The await boundary is a hard physics constraint of how PyO3 + Tokio work together.
  • Mixed chains → Run-length batching of consecutive sync segments is implementable and worth doing. It degrades gracefully (all-sync = current PR's behaviour; all-async = no change; mixed = significant improvement proportional to sync density).

The PR as written handles only the all-sync fast path. Adding run-length batching for mixed chains would make the optimization genuinely general and would justify the "50%" headline more robustly across real-world middleware configurations.

@thePromger

thePromger commented May 29, 2026

Copy link
Copy Markdown
Contributor

@coderabbitai

So it means the using one gil for all sync tasks, is meant for python 3.14+?

Is it any possibility even for async tasks too? If we try our own rust layer to handle async tasks, rather than pyo3, will it helps us? How, in which terms, areas, improvements?

@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

@thePromger Great questions — let me address both precisely.


1. Is the single-GIL-for-sync optimization Python 3.14+ only?

No — it works on any Python version PyO3 supports (3.7+).

You may be thinking of these Python initiatives, which are related but different:

Python feature Version What it does
No-GIL experimental 3.12 Optional build flag, not default
Free-threaded mode (--disable-gil) 3.13+ Opt-in; multiple threads can run Python bytecode simultaneously
GIL-optional continuation 3.14 Continued refinement of 3.13 free-threaded

The optimization in this PR doesn't remove the GIL — it reduces how many times you acquire and release it. Acquiring the GIL is a mutex lock/unlock (a syscall), which is expensive when done N times per request. Doing it once for N sync middlewares instead of N times is pure performance gain on any Python version.


2. Can a custom Rust async layer bypass PyO3 and optimize async tasks?

Fundamentally no — and here is the exact reason.

The GIL is a CPython implementation constraint, not a PyO3 constraint. PyO3 is just a safe Rust binding; it did not invent the GIL. Any code path that runs Python bytecode inside CPython must hold the GIL — no matter what Rust library you use.

Your code → Custom Rust Layer → CPython C API → GIL (still required)
Your code → PyO3             → CPython C API → GIL (same cost)

What pyo3_async_runtimes::tokio::into_future already does is essentially the minimum possible:

GIL acquire `#1` → call Python fn → get coroutine object → GIL RELEASE
                       ↑ Tokio drives the coroutine here (no GIL needed) ↑
GIL acquire `#2` → extract Rust result from Python return value → GIL RELEASE

You cannot do it in fewer than 2 GIL acquire/release cycles for any async Python function. A custom Rust layer would arrive at the same floor.


3. Where a custom Rust layer WOULD actually help

Even though it can't eliminate the GIL cost, a custom Rust layer or different architecture could help in specific ways:

a) Deferred extraction — reduce with_gil overhead for pipelines

For a chain of async middlewares [a1, a2, a3], today you extract the Rust Request/Response value after each step and then convert it back to Python for the next. With a smarter design:

Keep intermediate values as raw Py<PyAny> (Python objects)
→ pass them directly to the next middleware without extracting to Rust
→ only extract to Rust at the very end of the chain

This trades away type-safe intermediate handling but saves two with_gil calls per middleware (the extract + re-wrap cycle). For a chain of 5 async middlewares:

Strategy GIL operations
Current 2×5 = 10
Deferred extraction 5 × 1 (launch) + 5 × 1 (drive/await) + 1 (final extract) = 11 — actually worse due to re-passing overhead

In practice this is only a win if extraction itself is expensive (e.g., deep struct clone).

b) Move hot middleware to Rust entirely (the real win)

If common middleware — e.g., JWT validation, rate limiting, CORS checks — is rewritten as native Rust middleware rather than Python functions, the GIL disappears entirely for those steps:

sequenceDiagram
    participant R as Rust Executor
    participant P as Python (GIL)

    Note over R: auth check (Rust native) — NO GIL
    Note over R: rate limit (Rust native) — NO GIL
    R->>P: call user Python middleware (GIL acquired)
    P-->>R: result
    Note over R: response transform (Rust native) — NO GIL
Loading

This is architecturally the most impactful path — move infrastructure middleware to Rust, keep business-logic middleware in Python.

c) Python 3.13+ free-threaded mode (no-GIL build)

If Robyn adds support for Python 3.13+ compiled with --disable-gil, then multiple with_gil calls don't actually serialize anymore — they become near-zero-cost. This would make the async case as fast as the current PR's sync case, without any code changes to the executor.

Standard CPython:    N async middlewares → 2N serial GIL acquisitions
Free-threaded 3.13+: N async middlewares → 2N concurrent, non-serializing GIL calls

Summary

Scenario Can a custom Rust layer help? Real solution
Sync-only chains ✅ Already done in this PR Already optimal
Async-only chains ❌ GIL is CPython-level, not PyO3-level Free-threaded Python 3.13+, or rewrite in Rust
Mixed chains ⚠️ Run-length batching (sync segments) Already described in previous comment
Infrastructure middleware (auth, rate-limit) ✅ Rewrite those in native Rust Big win, no GIL at all

The real 10x improvement path for Robyn is: native Rust middleware for common cross-cutting concerns + free-threaded Python 3.13+ support for user async middleware.

(ᓀ‸ᓂ)

@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

@thePromger Great questions — let me address both precisely.


1. Is the "one GIL for all sync tasks" optimization Python 3.14+ only?

No — it works on every Python version (3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14, …).

The optimization is entirely on the Rust side. Python::with_gil(|py| { ... }) is a PyO3 API that acquires CPython's interpreter mutex, runs a closure, then releases it. The PR simply reduces how many times that mutex is locked/unlocked:

Before: N sync middlewares → N lock/unlock cycles  (one Python::with_gil per middleware)
After:  N sync middlewares → 1 lock/unlock cycle   (one Python::with_gil for all of them)

This is pure Rust-side scheduling logic. Python itself doesn't change — you just hold the lock for longer in one stretch instead of repeatedly asking for it.

What IS Python 3.13/3.14 relevant to: Python 3.13 introduced free-threaded mode (python3.13t, built with --disable-gil, PEP 703). In that mode, the GIL doesn't exist at all for multi-threaded code, so GIL optimization becomes irrelevant — but it requires a special Python build and is not yet default.


2. Can a custom Rust async layer eliminate GIL overhead for async Python functions?

Short answer: No — not without fundamentally changing what Robyn allows users to write.

Here's why, broken into layers:

Why the constraint is fundamental

When a user writes an async def middleware(request): ... in Python, CPython compiles it to a coroutine object. Driving that coroutine (calling .send() on it, receiving yielded values, resuming after await) requires Python bytecode execution — and CPython's bytecode interpreter always requires the GIL.

Each async Python middleware step:
  GIL acquire `#1` → schedule coroutine on asyncio → release GIL
                    ← Python event loop runs (may release GIL during I/O) →
  GIL acquire `#2` → extract Result from completed coroutine → release GIL
                   ↑
            This minimum cost is irreducible as long as the
            middleware is written in Python async

pyo3_async_runtimes::tokio::into_future() already implements this bridge as efficiently as possible. There is no shortcut — any custom Rust layer you write would face exactly the same constraint.

What a custom Rust layer could and could not do

Approach What it helps Hard limit
Custom coroutine scheduler in Rust Avoids pyo3_async_runtimes overhead, tighter Tokio integration Still needs GIL for every Python .send() call; same minimum 2N GIL ops
Batch-scheduling multiple coroutines with asyncio.gather Run N independent async tasks concurrently (not a middleware chain) ❌ Middlewares are sequential by design — each needs the prior's output
Move middleware logic to pure Rust Zero GIL cost, full Tokio parallelism ❌ Users can no longer write middlewares in Python
Sub-interpreters (PEP 554, Python 3.12+) Each sub-interpreter has its own GIL; true parallel Python PyO3 support is limited; objects can't be shared across sub-interpreters easily; high overhead per-call
Free-threaded Python 3.13+ GIL removed; Python threads run truly in parallel Requires python3.13t build; not default; thread-safety for middleware state becomes user's problem

The one genuine opportunity: Python 3.13 free-threaded mode

graph LR
    A["Standard CPython (3.x)"] -->|"GIL = 1 global mutex"| B["Sequential Python execution\nin any thread"]
    C["Free-threaded Python 3.13t"] -->|"Per-object locks instead"| D["True parallel Python\nacross threads"]
    D --> E["N async middlewares could overlap\nif they're I/O-bound"]
Loading

If Robyn adds a build-time flag to target python3.13t and uses PyO3's pyo3/abi3 free-threaded feature, async middlewares that do I/O (DB queries, HTTP calls) could genuinely run concurrently without the GIL bottleneck. But:

  • Still sequential for the chain (data dependency); concurrency only helps if a single middleware does internal concurrent I/O
  • Most users won't be on free-threaded Python yet

Summary

Question Answer
Is sync batching Python 3.14+ specific? No — works on all Python versions today
Can async middleware GIL cost be eliminated with a custom Rust layer? No — driving Python coroutines always requires the GIL in standard CPython
Is there any async optimization path? Yes: (1) run-length batching of mixed chains (previous reply), (2) free-threaded Python 3.13+ as an opt-in build target
Would writing our own async runtime help? No — the bottleneck is CPython's interpreter, not PyO3 or Tokio's scheduling overhead

🐇 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants