ClickHouse · vadimskipin · May 10, 2026 · May 10, 2026 · May 10, 2026
diff --git a/docs/perf.md b/docs/perf.md
@@ -12,44 +12,44 @@ The main tables are reproducible with `./bb -b release perf --duration 60s --war
 
 | numjobs | iodepth | mode | IOPS | BW | avg | p50 | p95 | p99 | p99.9 |
 |---|---|---|---|---|---|---|---|---|---|
-| 1 | 1 | randwrite | 175k | 684 MiB/s | 6 µs | 3 µs | 13 µs | 15 µs | 23 µs |
-| 1 | 16 | randwrite | 526k | 2056 MiB/s | 30 µs | 28 µs | 40 µs | 46 µs | 59 µs |
-| 16 | 1 | randwrite | 893k | 3488 MiB/s | 18 µs | 19 µs | 28 µs | 38 µs | 54 µs |
-| 16 | 16 | randwrite | 789k | 3080 MiB/s | 325 µs | 264 µs | 875 µs | 1387 µs | 1970 µs |
-| 1 | 1 | randread | 214k | 837 MiB/s | 5 µs | 3 µs | 12 µs | 14 µs | 22 µs |
-| 1 | 16 | randread | 655k | 2557 MiB/s | 24 µs | 26 µs | 32 µs | 40 µs | 54 µs |
-| 16 | 1 | randread | 2592k | 10125 MiB/s | 6 µs | 4 µs | 16 µs | 33 µs | 120 µs |
-| 16 | 16 | randread | 5818k | 22726 MiB/s | 44 µs | 40 µs | 70 µs | 83 µs | 113 µs |
+| 1 | 1 | randwrite | 165k | 644 MiB/s | 6 µs | 4 µs | 13 µs | 15 µs | 23 µs |
+| 1 | 16 | randwrite | 518k | 2023 MiB/s | 31 µs | 29 µs | 41 µs | 47 µs | 62 µs |
+| 16 | 1 | randwrite | 928k | 3626 MiB/s | 17 µs | 18 µs | 28 µs | 36 µs | 52 µs |
+| 16 | 16 | randwrite | 829k | 3238 MiB/s | 309 µs | 259 µs | 794 µs | 1304 µs | 1883 µs |
+| 1 | 1 | randread | 197k | 769 MiB/s | 5 µs | 3 µs | 13 µs | 15 µs | 22 µs |
+| 1 | 16 | randread | 640k | 2501 MiB/s | 25 µs | 25 µs | 32 µs | 40 µs | 55 µs |
+| 16 | 1 | randread | 2369k | 9254 MiB/s | 7 µs | 4 µs | 17 µs | 36 µs | 117 µs |
+| 16 | 16 | randread | 6019k | 23513 MiB/s | 43 µs | 45 µs | 65 µs | 81 µs | 112 µs |
 
-**Best throughput** (`numjobs=16 iodepth=16 randread`): 5818k IOPS, 22.2 GiB/s.
+**Best throughput** (`numjobs=16 iodepth=16 randread`): 6019k IOPS, 23.0 GiB/s.
 
 **Best latency** (`numjobs=1 iodepth=1`): 3-4 µs p50 for both read and write.
 
-At `iodepth=16`, SQEs are batched per fiber suspension (one `io_uring_submit` per fiber run instead of one per SQE).
+**Note on batching**: The default `Options::ioUringFlushThreshold = 64` defers `io_uring_submit` until the SQ ring has accumulated enough work to amortize the syscall -- the right trade for network/HTTP/S3 workloads where completion latency dwarfs the few-µs batching delay (see net-perf below for the resulting p99 win). On tmpfs the kernel completes reads inline at submit time, so any deferral pushes submissions off the inline-completion fast path. `file-perf` therefore initializes the scheduler with `ioUringFlushThreshold = 1`, equivalent to per-fiber submit. Measured under the default threshold (64), `16/1 randread` lands at ~1.6M IOPS and `16/16 randread` at ~4.2M -- the override recovers full throughput without any kernel or scheduler change.
 
 ---
 
 ## fio comparison (io_uring, /dev/shm, bs=4k, size=1 GiB)
 
 | numjobs | iodepth | mode | IOPS | BW | avg | p50 | p95 | p99 | p99.9 |
 |---|---|---|---|---|---|---|---|---|---|
-| 1 | 1 | randwrite | 63k | 246 MiB/s | 14 µs | 13 µs | 18 µs | 26 µs | 39 µs |
-| 1 | 16 | randwrite | 797k | 3114 MiB/s | 19 µs | 19 µs | 22 µs | 28 µs | 34 µs |
-| 16 | 1 | randwrite | 723k | 2824 MiB/s | 20 µs | 20 µs | 27 µs | 34 µs | 45 µs |
-| 16 | 16 | randwrite | 800k | 3124 MiB/s | 317 µs | 301 µs | 354 µs | 1778 µs | 5341 µs |
-| 1 | 1 | randread | 71k | 279 MiB/s | 12 µs | 13 µs | 16 µs | 25 µs | 38 µs |
-| 1 | 16 | randread | 973k | 3801 MiB/s | 16 µs | 15 µs | 22 µs | 29 µs | 46 µs |
-| 16 | 1 | randread | 1178k | 4601 MiB/s | 12 µs | 13 µs | 19 µs | 32 µs | 44 µs |
-| 16 | 16 | randread | 10464k | 40876 MiB/s | 23 µs | 20 µs | 39 µs | 75 µs | 127 µs |
+| 1 | 1 | randwrite | 63k | 247 MiB/s | 14 µs | 13 µs | 17 µs | 26 µs | 38 µs |
+| 1 | 16 | randwrite | 753k | 2943 MiB/s | 20 µs | 19 µs | 27 µs | 31 µs | 37 µs |
+| 16 | 1 | randwrite | 719k | 2808 MiB/s | 20 µs | 20 µs | 27 µs | 34 µs | 45 µs |
+| 16 | 16 | randwrite | 766k | 2991 MiB/s | 332 µs | 313 µs | 395 µs | 2040 µs | 5145 µs |
+| 1 | 1 | randread | 65k | 254 MiB/s | 14 µs | 13 µs | 18 µs | 26 µs | 38 µs |
+| 1 | 16 | randread | 967k | 3777 MiB/s | 16 µs | 14 µs | 23 µs | 30 µs | 46 µs |
+| 16 | 1 | randread | 1196k | 4670 MiB/s | 11 µs | 13 µs | 19 µs | 32 µs | 44 µs |
+| 16 | 16 | randread | 10778k | 42100 MiB/s | 23 µs | 20 µs | 38 µs | 68 µs | 102 µs |
 
-At `iodepth=1`, the fiber scheduler outperforms fio (2-3x): fio uses one OS thread per job, so each IO incurs a full OS scheduler round-trip. At `iodepth=16`, fio wins; the fiber scheduler batches all SQEs the fiber enqueued during a run into one `io_uring_submit` per fiber suspension, the same principle fio uses.
+At `iodepth=1`, the fiber scheduler outperforms fio (2-3x): fio uses one OS thread per job, so each IO incurs a full OS scheduler round-trip. At `iodepth=16`, fio batches all SQEs the worker enqueued into one submit per round-trip and wins; the fiber scheduler with batching disabled (file-perf opts out -- see file-perf note above) does the same per fiber but pays additional dispatch overhead for the IoFuture ring.
 
 | config | fiber IOPS | fio IOPS | ratio |
 |---|---|---|---|
-| 1 job, iodepth=1, randread | 214k | 71k | 3.0x |
-| 16 jobs, iodepth=1, randread | 2592k | 1178k | 2.2x |
-| 1 job, iodepth=16, randread | 655k | 973k | 0.67x |
-| 16 jobs, iodepth=16, randread | 5818k | 10464k | 0.56x |
+| 1 job, iodepth=1, randread | 197k | 65k | 3.0x |
+| 16 jobs, iodepth=1, randread | 2369k | 1196k | 1.98x |
+| 1 job, iodepth=16, randread | 640k | 967k | 0.66x |
+| 16 jobs, iodepth=16, randread | 6019k | 10778k | 0.56x |
 
 ---
 
@@ -59,12 +59,12 @@ Loopback TCP, 64 B messages, 60 s measurement, 10 s warmup. Socket I/O uses `Fib
 
 | connections | RPS | BW | avg | p50 | p95 | p99 | p99.9 |
 |---|---|---|---|---|---|---|---|
-| 1 | 42k | 3 MiB/s | 24 µs | 27 µs | 31 µs | 36 µs | 43 µs |
-| 256 | 1854k | 113 MiB/s | 138 µs | 122 µs | 319 µs | 338 µs | 364 µs |
-| 512 | 1870k | 114 MiB/s | 274 µs | 111 µs | 1086 µs | 1244 µs | 1305 µs |
-| 1024 | 1917k | 117 MiB/s | 534 µs | 154 µs | 3493 µs | 3607 µs | 3816 µs |
+| 1 | 39k | 2 MiB/s | 26 µs | 27 µs | 32 µs | 37 µs | 44 µs |
+| 256 | 1951k | 119 MiB/s | 131 µs | 111 µs | 296 µs | 367 µs | 405 µs |
+| 512 | 2162k | 132 MiB/s | 237 µs | 226 µs | 412 µs | 457 µs | 856 µs |
+| 1024 | 2223k | 136 MiB/s | 461 µs | 447 µs | 671 µs | 861 µs | 1787 µs |
 
-Throughput plateaus at ~1.85-1.92M req/s by 256 connections -- the server is fully saturated. The large gap between p50 and avg at high concurrency (e.g. 154 µs vs 534 µs at 1024 conns) reflects a bimodal distribution: most requests are served promptly but a tail stalls.
+Throughput scales from ~1.95M req/s at 256 conns to ~2.22M at 1024 conns. p50 and avg track closely at all concurrency levels (447 µs vs 461 µs at 1024 conns) — the distribution is roughly unimodal; tail spikes (p99.9 1.79 ms at 1024 conns) reflect occasional stalls but stay within ~4x of the median. Submission is bounded-batched at the dispatch boundary (see Latency profiler below); without that bound, this workload's p50 is lower but p95/p99/p99.9 inflate by 5-10x.
 
 ---
 
@@ -83,14 +83,14 @@ Same workload as net-perf above, reimplemented with Boost.Asio C++20 coroutines
 
 | connections | net-perf RPS | net-perf-asio RPS | ratio |
 |---|---|---|---|
-| 1 | 42k | 3k | **~14x** |
-| 256 | 1854k | 377k | **~4.9x** |
-| 512 | 1870k | 383k | **~4.9x** |
-| 1024 | 1917k | 380k | **~5.0x** |
+| 1 | 39k | 3k | **~13x** |
+| 256 | 1951k | 377k | **~5.2x** |
+| 512 | 2162k | 383k | **~5.6x** |
+| 1024 | 2223k | 380k | **~5.9x** |
 
 Two structural differences explain most of the gap. First, net-perf uses io_uring for all socket I/O while Asio uses epoll; io_uring avoids the per-operation `epoll_ctl` + `epoll_wait` + `recv`/`send` syscall chain. Second, the fiber scheduler's per-CPU pinned scheduler threads pick up completions via `io_uring_enter`, while Asio's reactor threads block in `epoll_wait` and resume via a pthread wakeup.
 
-The gap is largest at 1 connection (~14x) where per-operation scheduling overhead dominates with no parallelism to hide it, and narrows to ~5x at high connection counts where the server CPU half is the bottleneck.
+The gap is largest at 1 connection (~13x) where per-operation scheduling overhead dominates with no parallelism to hide it, and stays around 5-6x at high connection counts where the server CPU half is the bottleneck.
 
 ---
 
@@ -100,21 +100,21 @@ Same workload as net-perf above, reimplemented as the simplest efficient epoll l
 
 | connections | RPS | BW | avg | p50 | p95 | p99 | p99.9 |
 |---|---|---|---|---|---|---|---|
-| 1 | 40k | 2 MiB/s | 25 µs | 25 µs | 29 µs | 34 µs | 41 µs |
-| 256 | 2540k | 155 MiB/s | 101 µs | 97 µs | 155 µs | 171 µs | 189 µs |
-| 512 | 2545k | 155 MiB/s | 201 µs | 196 µs | 276 µs | 298 µs | 328 µs |
-| 1024 | 2411k | 147 MiB/s | 425 µs | 428 µs | 520 µs | 552 µs | 611 µs |
+| 1 | 42k | 3 MiB/s | 24 µs | 25 µs | 29 µs | 35 µs | 41 µs |
+| 256 | 2530k | 154 MiB/s | 101 µs | 98 µs | 150 µs | 169 µs | 285 µs |
+| 512 | 2532k | 155 MiB/s | 202 µs | 196 µs | 265 µs | 288 µs | 914 µs |
+| 1024 | 2507k | 153 MiB/s | 408 µs | 401 µs | 510 µs | 548 µs | 1972 µs |
 
 **Comparison with net-perf (fibers + io_uring), same-run measurements:**
 
 | connections | net-perf RPS | net-perf-epoll RPS | RPS ratio | net-perf p99 | net-perf-epoll p99 | p99 ratio |
 |---|---|---|---|---|---|---|
-| 1 | 42k | 40k | 0.95x | 36 µs | 34 µs | 0.94x |
-| 256 | 1854k | 2540k | **1.37x** | 338 µs | 171 µs | **0.51x** |
-| 512 | 1870k | 2545k | **1.36x** | 1244 µs | 298 µs | **0.24x** |
-| 1024 | 1917k | 2411k | **1.26x** | 3607 µs | 552 µs | **0.15x** |
+| 1 | 39k | 42k | 1.08x | 37 µs | 35 µs | 0.95x |
+| 256 | 1951k | 2530k | **1.30x** | 367 µs | 169 µs | **0.46x** |
+| 512 | 2162k | 2532k | **1.17x** | 457 µs | 288 µs | **0.63x** |
+| 1024 | 2223k | 2507k | **1.13x** | 861 µs | 548 µs | **0.64x** |
 
-At 1 connection both are equivalent -- the host has spare CPU and engine overhead is invisible. Past saturation raw epoll wins ~25-40% on throughput and 2-7x on p99 tail latency. Per-CPU rate at saturation (256 conns, 16 server CPUs): fibers ≈ 116k req/cpu (8.6 µs CPU/req), epoll ≈ 159k req/cpu (6.3 µs CPU/req); the 2.3 µs/req gap is the cost of the fiber abstraction in this workload -- fiber suspend/resume + io_uring SQE/CQE submission + ready-queue bookkeeping per round-trip. The epoll loop services its connections in round-robin within each worker, so per-connection treatment is uniform -- p99 stays close to p50 (552 µs vs 428 µs at 1024 conns), while net-perf shows a wide gap (3607 µs p99 vs 154 µs p50 at the same conn count).
+At 1 connection both are equivalent -- the host has spare CPU and engine overhead is invisible. Past saturation raw epoll wins 13-30% on throughput and 1.5-2.2x on p99 tail latency. Per-CPU rate at saturation (256 conns, 16 server CPUs): fibers ≈ 122k req/cpu (8.2 µs CPU/req), epoll ≈ 158k req/cpu (6.3 µs CPU/req); the 1.9 µs/req gap is the cost of the fiber abstraction in this workload -- fiber suspend/resume + io_uring SQE/CQE submission + ready-queue bookkeeping per round-trip. The epoll loop services its connections in round-robin within each worker, so per-connection treatment is uniform -- p99 stays close to p50 (548 µs vs 401 µs at 1024 conns); net-perf with bounded-batched submission also keeps a tight ratio (861 µs p99 vs 447 µs p50, ~1.9x).
 
 What raw epoll gives up: composability. The state machine can't naturally accommodate sleeps (no `--delay` support), multi-step protocols, or branching control flow without growing into a small interpreter. net-perf-epoll is the throughput floor; net-perf is the structure you'd actually program against.
 
@@ -214,26 +214,26 @@ Per-CPU profiler (opted in via `--print-counters`) emits log2 histograms for fiv
 | event | interval |
 |---|---|
 | `suspend_wait` | suspended -> next `enqueueReady` (blocked-on-condition latency) |
-| `io_submit` | `io_uring_submit` syscall per fiber-suspend flush |
+| `io_submit` | `io_uring_submit` syscall (one per dispatch batch) |
 | `io_wait` | `enqueueIo` -> CQE handled (kernel IO latency) |
 | `ready_wait` | `enqueueReady` -> dispatch (ready-queue dwell) |
 | `fiber_run` | `switchToFiberContext` -> return (on-CPU time per slice) |
 
-### Per-IO breakdown (net-perf, 1000 connections, 60 s, 10 s warmup, 1856k RPS)
+### Per-IO breakdown (net-perf, 1000 connections, 60 s, 10 s warmup, 2164k RPS)
 
 Reproduced with `./bb -b release net-perf --connections 1000 --duration 60s --warmup 10s --print-counters`.
 
 | event | p50 | p90 | p99 | p99.9 |
 |---|---|---|---|---|
-| `suspend_wait` | 37.3 µs | 593 µs | 3.6 ms | 4.1 ms |
-| `io_submit` | 4.1 µs | 7.9 µs | 15.5 µs | 24.5 µs |
-| `io_wait` | 42.6 µs | 597 µs | 3.6 ms | 4.1 ms |
-| `ready_wait` | 26.5 µs | 130 µs | 661 µs | 1.0 ms |
-| `fiber_run` | 199 ns | 347 ns | 500 ns | 3.3 µs |
+| `suspend_wait` | 189 µs | 396 µs | 817 µs | 1.0 ms |
+| `io_submit` | 124 µs | 234 µs | 260 µs | 446 µs |
+| `io_wait` | 189 µs | 396 µs | 817 µs | 1.0 ms |
+| `ready_wait` | 12.2 µs | 32.4 µs | 450 µs | 797 µs |
+| `fiber_run` | 193 ns | 376 ns | 1.2 µs | 5.5 µs |
 
-`fiber_run` p50 = 199 ns confirms the dispatch loop itself is essentially free; this workload is entirely IO-bound. `SchedulerSystemTime` totals 1060 CPU-s (55% of 32 cores x 60 s) -- almost entirely `io_uring_submit`: 258 M syscalls x 4.1 µs = 1058 s. User-mode fiber work consumes 51 s (2.7%); idle time is 165 s (8.6%).
+`fiber_run` p50 = 193 ns confirms the dispatch loop itself remains essentially free; this workload is IO-bound. SchedulerSystemTime totals 1052 CPU-s (55% of 32 cores x 60 s); the dominant cost is `io_uring_submit`, but it now amortizes over batched SQEs: 7.7 M syscalls averaging ~42 SQEs each (vs. ~258 M per-fiber syscalls before batching). User-mode fiber work consumes 67 s (3.5%); idle time is 194 s (10%).
 
-The profile pinpoints `io_uring_submit` as the dominant lever. SQPOLL is not enabled (the per-CPU pinned scheduler shares the CPU with any kernel poller). The next optimization to consider is batching submits at the `handleReadyQueue` boundary instead of per-fiber-suspend.
+Submission is bounded-batched at the dispatch boundary: `runFiber` calls `submitIo(false)` after each fiber, which threshold-gates the syscall (fires when SQ ring holds at least IO_URING_FLUSH_THRESHOLD = 64 SQEs); `handleReadyQueue` and `runStealLoop` end-of-drain calls `submitIo(true)` to force-flush partial batches. The threshold caps both per-syscall cost (linear in SQE count) and SQ ring overflow. SQPOLL is not enabled because the per-CPU pinned scheduler design would put the kernel poller in contention with the user-space scheduler thread. The remaining residual cost is the kernel's per-SQE work, which is already amortized as well as user-space batching can manage.
 
 ### Profiler overhead (net-perf, 1000 connections, 60 s, 10 s warmup)
 

diff --git a/include/silk/fibers/fiber.h b/include/silk/fibers/fiber.h
@@ -89,6 +89,35 @@ class FiberScheduler
      */
     struct Options
     {
+        // Per-fiber stack size in bytes. Must be a multiple of the system page size.
+        // The pool also reserves two guard pages adjacent to each stack.
+        uint64_t fiberStackSize = 64 * 1024;
+
+        // Per-CPU ready queue capacity (fibers). Must be a power of two and >= 2.
+        // Sized to absorb dispatch bursts without falling back to the global queue.
+        uint64_t readyQueueCapacity = 1024;
+
+        // Per-CPU io_uring SQ ring capacity. Must be a power of two; the kernel
+        // rounds up to the nearest supported size.
+        uint64_t ioUringQueueSize = 256;
+
+        // Mid-drain submit threshold: submitIo defers io_uring_enter until the SQ
+        // ring holds at least this many entries. Lower values approach per-fiber
+        // submit (better latency on inline-completion workloads); higher values
+        // amortize syscall cost across more SQEs (better throughput on networked
+        // workloads). Must be <= ioUringQueueSize.
+        uint32_t ioUringFlushThreshold = 64;
+
+        // Hash-table size for futex-style waiter lookups. Must be a power of two.
+        uint64_t waiterTableSize = 4096;
+
+        // Scheduler park backoff (nanoseconds). The dispatch loop spins for up to
+        // spinThresholdNs after going idle; past that it parks on the eventfd with
+        // an exponential backoff starting at initialWaitNs and capped at maxWaitNs.
+        uint64_t initialWaitNs = 1'000;
+        uint64_t maxWaitNs = 10'000'000;
+        uint64_t spinThresholdNs = 20'000;
+
         // Allocate per-CPU latency profilers.
         bool enableProfiler = false;
     };
@@ -100,6 +129,11 @@ class FiberScheduler
      */
     static void initialize(const Options * options = nullptr) noexcept;
 
+    /**
+     * Return the active configuration. Set by initialize; immutable thereafter.
+     */
+    static const Options & getOptions() noexcept { return options; }
+
     /**
      * Stop all scheduler threads and release all resources.
      * No fibers may be running or scheduled when this is called.
@@ -482,7 +516,8 @@ class FiberScheduler
     // State.
     //
 
-    inline static SchedulerState * scheduler;
+    static Options options;
+    static SchedulerState * scheduler;
 };
 
 } // namespace silk
diff --git a/include/silk/util/bounded-queue.h b/include/silk/util/bounded-queue.h
@@ -25,8 +25,23 @@ template <typename T>
 class BoundedQueue
 {
 public:
-    explicit BoundedQueue(uint64_t capacity) noexcept
+    /**
+     * Default-construct an empty, unusable queue. Call initialize before any
+     * enqueue/dequeue/empty access. Useful when the queue is a member of a
+     * larger structure whose capacity is known only after construction.
+     */
+    BoundedQueue() noexcept = default;
+
+    // TBD
+    explicit BoundedQueue(uint64_t capacity) noexcept { initialize(capacity); }
+
+    /**
+     * Allocate the slot array. The queue must be in its default-constructed
+     * state (no slots yet). Capacity must be a power of two and >= 2.
+     */
+    void initialize(uint64_t capacity) noexcept
     {
+        SILK_ASSERT(!slots);
         SILK_ASSERT(capacity >= 2 && (capacity & (capacity - 1)) == 0);
         mask = capacity - 1;
         slots = std::make_unique<Slot[]>(capacity);
@@ -36,8 +51,10 @@ class BoundedQueue
         }
     }
 
-    /** Append a value to the tail of the queue. Returns false if the queue is full or
-     *  a concurrent dequeue is in progress on the target slot (see class caveat). */
+    /**
+     * Append a value to the tail of the queue. Returns false if the queue is full or
+     * a concurrent dequeue is in progress on the target slot (see class caveat).
+     */
     [[nodiscard]] bool enqueue(T value) noexcept
     {
         uint64_t pos = enqueuePos.load(std::memory_order_relaxed);
@@ -66,8 +83,10 @@ class BoundedQueue
         }
     }
 
-    /** Write the head value into @p value. Returns false if the queue is empty or
-     *  a concurrent enqueue is in progress on the target slot (see class caveat). */
+    /**
+     * Write the head value into @p value. Returns false if the queue is empty or
+     * a concurrent enqueue is in progress on the target slot (see class caveat).
+     */
     [[nodiscard]] bool dequeue(T * value) noexcept
     {
         uint64_t pos = dequeuePos.load(std::memory_order_relaxed);