Skip to content

Stability and perf improvements (credit: ChannyAnh)#25

Merged
robinbraemer merged 10 commits into
connectfrom
fix/perf-and-stability-channyanh
May 12, 2026
Merged

Stability and perf improvements (credit: ChannyAnh)#25
robinbraemer merged 10 commits into
connectfrom
fix/perf-and-stability-channyanh

Conversation

@robinbraemer
Copy link
Copy Markdown
Member

@robinbraemer robinbraemer commented May 11, 2026

ChannyAnh sent a detailed diagnosis on 2026-05-10 covering 10 issues in the core module that surface as ping spikes, random disconnects, broken commands, and one latent mass-disconnect crash. This PR applies all of them as separate commits so each can be reviewed (and reverted) in isolation, plus adds the test coverage that didn't previously exist for this module.

Source fixes

# Commit Type What it fixes
1 fix(core): collect usernames in CommandUtil.getOnlineUsernames correctness forEach(this::method) discarded return values, so getOnlineUsernames() was always empty — broke tab completion and player lookup.
2 fix(core): guard against null bytes in Tunneler.onMessage crash defense getCastedValue could return null on either missing-field or runtime IllegalAccessException; Unpooled.wrappedBuffer(null) then NPE'd the OkHttp dispatcher thread and dropped every tunnel on it.
3 fix(core): use single ScheduledExecutorService instead of per-reconnect Timer thread leak new Timer() per reconnect leaked OS threads. Lazy single ScheduledExecutorService keyed by a Netty DefaultThreadFactory daemon thread; start() / stop() reuse cleanly; late callbacks null-guard the scheduler; scheduler/retryFuture/ws marked volatile; stop() is fully CAS-gated.
4 fix(core): bound HttpUtils async pool to 4 daemon threads perf newSingleThreadExecutor serialized skin/health/endpoint calls; threads were non-daemon.
5 fix(core): use concurrent collections for handler registries concurrency HashSet/HashMap in CommonPlatformInjector and PacketHandlersImpl raced under Netty I/O threads → CME → channel close → kicks. deregister() also null-guards the registerAll case (ConcurrentHashMap rejects null keys) and uses computeIfPresent to avoid the empty-set-removal race.
6 fix(core): reference inequality in packet-handler "replaced?" check semantic !res.equals(msg) on every packet; identity is the correct + cheaper semantics. Mirror change applied to outbound.
7 perf(core): batch flush in TunnelHandler.onReceive perf writeAndFlush per packet = one syscall per packet (100–500/s). Coalesced via voidPromise + AtomicBoolean. CAS lives inside the write task so flush is always enqueued after the write that needs it. onClose() now flushes before close to avoid losing the final payload. Both paths defend against RejectedExecutionException during JVM shutdown.
8 perf(core): reduce LocalSession connect timeout from 30s to 10s perf LocalChannel is in-process; 30s held EventLoop slots on failed attempts.

Test coverage

Bootstraps the test suite for the core module (previously zero tests). 38 unit tests, all green, covering each fix above.

Test infra: JUnit Jupiter 5.10.5 + Mockito 4.11.0 + Awaitility 4.2.2 + Netty test support + OkHttp MockWebServer 4.9.3 — all pinned to versions compatible with the Java 8 source target.

Notable patterns:

  • TunnelerTest drives a real OkHttp WebSocket via MockWebServer and uses sun.misc.Unsafe to force Tunneler.DATA = null to verify the fallback path.
  • TunnelHandlerTest uses a Mockito Channel with a real DefaultEventLoop to exercise CAS-based flush coalescing — EmbeddedChannel's pipeline re-enters the task queue mid-write and can't be used here.
  • PacketHandlersImplTest includes a stress loop for the concurrent-register-during-deregister race.

Verification

  • ./gradlew :core:test — 38 tests, all pass.
  • ./gradlew compileJava clean across all 5 modules.
  • Reviewed across multiple rounds — independent agent + three parallel simplify reviewers + six rounds of codex review --base connect. Each surfaced real issues that were folded back into the relevant commit via --fixup/--autosquash.

Caveats

  • Bug-fix → user-visible-symptom attribution comes from ChannyAnh's analysis and matches the code, but production impact for the perf fixes hasn't been measured. Frame as "stability/perf hardening" rather than "fixes X user report."
  • Built jars (connect-velocity.jar, connect-spigot.jar, connect-bungee.jar) available under */build/libs/ for local end-to-end testing.

Credit

Diagnoses and original patches by ChannyAnh.

Latest update

  • Ported Floodgate Spigot injector race fix: the temporary child-channel injector is now attached before the accepted channel continues through the pipeline, and Connect addon injection runs on child channel activation.
  • Re-verified locally with the Spigot compile plus core and Velocity tests; CI build is green.

forEach(this::getUsernameFromSource) discards the return value, so the
returned collection was always empty. ProfileAudienceParser depends on
this list for tab-completion and player lookup, silently breaking both.

Diagnosis credit to ChannyAnh.
ReflectionUtils.getCastedValue(bytes, DATA) can return null in two cases:
the DATA field is missing (e.g. after Okio relocation/shading), or
field.get() throws IllegalAccessException at runtime, which
ReflectionUtils.getValue swallows. Either way, null then flowed into
handler.onReceive(null) → Unpooled.wrappedBuffer(null) → NPE, which
crashes the OkHttp dispatcher thread and drops every tunnel sharing it.

Fall back to the safe copying path (bytes.toByteArray()) when the
zero-copy reflection is unavailable.

Diagnosis credit to ChannyAnh.
…ct Timer

java.util.Timer spawns one OS thread per instance. WatcherRegister
created a fresh Timer on every WatcherImpl.startResetBackOffTimer() call
and never released the prior one (cancel() ends scheduled tasks but
relies on the timer thread observing the cancel). Under reconnect storms
this leaked daemon threads, adding GC pressure and stop-the-world pauses
visible as server-wide ping spikes.

Replace both Timer fields (retry + reset-backoff) with a single
single-thread ScheduledExecutorService keyed by daemon thread, and
cancel ScheduledFutures instead of cancelling Timers.

Diagnosis credit to ChannyAnh.
newSingleThreadExecutor() serialized every async HTTP call (skin lookups,
health checks, endpoint-name lookups). A slow Mojang skin response would
back up the queue and delay anything else. The executor also used
non-daemon threads, blocking JVM shutdown.

Replace with newFixedThreadPool(4) of named daemon threads.

Diagnosis credit to ChannyAnh.
CommonPlatformInjector.injectedClients/addons and PacketHandlersImpl's
three maps/sets are read on Netty I/O threads (per-channel) and mutated
from registration calls that can originate on any thread. HashSet/HashMap
under concurrent access produces ConcurrentModificationException, which
Netty propagates to the uncaught-exception handler and closes the
channel — visible as random disconnects under load.

Swap injectedClients to a ConcurrentHashMap-backed set, addons to a
ConcurrentHashMap, and the three registries in PacketHandlersImpl to
ConcurrentHashMap / CopyOnWriteArraySet / CopyOnWriteArrayList.

Diagnosis credit to ChannyAnh.
ChannelIn/OutPacketHandler called !res.equals(msg) to detect whether a
handler had replaced the packet. The PacketHandler contract is "return
the same object or a new one", so identity is the correct semantics.
Using .equals() also risks an expensive deep field comparison on every
inbound/outbound packet if the underlying packet types ever override
equals.

Diagnosis credit to ChannyAnh. (ChannelOutPacketHandler change inferred
from the bug description — the same line existed verbatim.)
writeAndFlush on every incoming packet issued one kernel flush syscall
per packet. With normal game traffic (100-500 packets/s per session)
this is the dominant CPU cost on the receive path and shows up as ping
spikes under burst load.

Schedule each write on the channel's EventLoop with voidPromise() to
avoid ChannelFuture allocation, and coalesce all writes within an
EventLoop tick into a single flush guarded by an AtomicBoolean.

Diagnosis credit to ChannyAnh.
CONNECTION_TIMEOUT governs the Netty LocalChannel bootstrap, which is
an in-process connection that should complete in microseconds. The
original 30s held an EventLoop slot for half a minute on every failed
attempt; under congestion this starved healthy sessions on the same
EventLoop and surfaced as ping spikes. 10s is still well above any
plausible legitimate latency.

Diagnosis credit to ChannyAnh.
Bootstraps the test suite for the core module (previously had zero tests)
and adds coverage for each of the 8 fixes in this branch:

- CommandUtilTest — verifies getOnlineUsernames returns one entry per
  online player (was always empty).
- TunnelerTest — drives a real OkHttp WebSocket via MockWebServer; verifies
  the null-bytes fallback by forcing the reflected ByteString.data field
  to null and asserting the safe bytes.toByteArray() path delivers the
  correct payload.
- WatcherRegisterTest — verifies the lazy scheduler lifecycle: never
  created at field init, recreated on start/stop reuse, late-callback
  schedule sites tolerate a null scheduler, and Timer fields are gone.
- HttpUtilsTest — verifies the bounded daemon pool: at least 4 concurrent
  tasks, threads named connect-http-worker, all daemons.
- PacketHandlersImplTest — verifies register/deregister/registerAll
  including the null-key contract for global handlers and the atomic
  computeIfPresent guard against concurrent-register-during-deregister.
- CommonPlatformInjectorTest — concurrent add/remove of injected clients
  with parallel iteration; addon map mutation while iterated.
- ChannelIn/OutPacketHandlerTest — verifies the != identity check
  forwards new packet references that .equals() the original.
- TunnelHandlerTest — uses a Mockito Channel with a real DefaultEventLoop
  to faithfully exercise the CAS-based flush coalescing (EmbeddedChannel
  re-enters the task queue mid-write and can't be used here); verifies
  one flush per burst and FLUSH-before-CLOSE ordering.

Test deps pinned to versions still compatible with the Java 8 source
target: JUnit Jupiter 5.10.5, Mockito 4.11.0, Awaitility 4.2.2.
@robinbraemer robinbraemer merged commit d9ae896 into connect May 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant