Skip to content

connmgr: Implement exponential backoff with jitter.#3700

Open
davecgh wants to merge 31 commits into
decred:masterfrom
davecgh:connmgr_jitter
Open

connmgr: Implement exponential backoff with jitter.#3700
davecgh wants to merge 31 commits into
decred:masterfrom
davecgh:connmgr_jitter

Conversation

@davecgh
Copy link
Copy Markdown
Member

@davecgh davecgh commented May 24, 2026

This requires #3699.

The current code uses a simple deterministic linear backoff with a maximum capacity. While it works well enough, it has a big downside in that all of the attempts will tend to coalesce over time. This is especially true during a network outage since all connections will drop at the same time.

Also, exponential backoffs are preferred over linear to reduce the retry frequency more quickly and give the remote more time to come back online.

This addresses both of those points by changing the linear backoff for persistent peers to use an exponential backoff with jitter instead.

Due to the introduction of nondeterministic behavior, and considering there are various other aspects of the connection manager that would benefit from making use of randomness, this also includes a separate commit to introduce full infrastructure and support for deterministic reproducibility when running the tests.

See the individual commit descriptions for more details.

davecgh added 30 commits May 21, 2026 23:47
This adds a new context-aware semaphore type with Acquire and Release
methods for use in upcoming changes that aim to simplify connection
limiting by making use of semaphores for blocking until permits become
available.
This adds tests for the new context-aware semaphore to ensure the
acquire, release, and context cancel semantics work as expected.
The existing connection manager code was written well before contexts
were introduced.  Further, due to the old async model that has now been
converted to a synchronous model, it is based around connection requests
that have their state atomically updated asynchronously as various
things happen.

While it has undoubtedly worked well enough for over a decade, it has
always been a challenge to add new functionality to it and requires the
use of a lot of less than ideal and highly outdated techniques such as
polling for state changes.  It is also rather brittle in terms of
requiring output connections to be manually disconnected in the
connection manager after they've been closed to avoid things like
leaking goroutines and failing to update target outbound counts.

Moreover, it only tracks outgoing connections which ultimately forces a
lot of connection-related tasks to be split across different layers
instead of residing in the connection manager itself where they more
naturally belong.  Notably, that split, for all intents and purposes,
prevents implementing some desirable more advanced features such as
immediate connection shedding, different connection types, and listeners
tied to specific network types.

With the primary goal of addressing all of the aforementioned points and
providing a solid base to work on for adding new features moving
forward, this significantly reworks the connection manager to completely
get rid of the notion of exposed connection requests in favor of a new
custom connection type that wraps the underlying net.Conn.

The new wrapped connections automatically handle cleanup when closed and
have an associated connection type enum that allows easily
distinguishing inbound, outbound, and manual connections as well as
supporting new connection types in the future.

Another nice feature of the new wrapped connections is they provide
efficient access to concrete parsed address types which paves the way
for avoiding a lot of constant reparsing, repeated host/port splitting
and joining, and generally much more ergonomic immutable address types.

Since changing to wrapped connections basically required a rather
significant rewrite of large portions of the connection manager anyway,
this also takes the opportunity to improve several other aspects of the
connection manager in the process such as implementing full context
support, full tracking of all connection types by the manager itself,
much more robust semaphore-based automatic connection limiting, cleaner
persistent connection handling with independent limits, prevention of
multiple connections of any type to the same address:port, more useful
debug logging, and cleanly closing all connections during shutdown.

It is also important to note that the following overall semantics have
intentionally been changed versus the existing connection manager:

- A maximum of 8 persistent connections is now imposed and they no
  longer count toward the configured target number of automatic outbound
  peers to maintain
- Duplicate addresses (host:port) are now rejected by the connection
  manager for all types (inbound, outbound, manual, persistent)
  - Note that inbound conns from the same IP will necessarily have
    different ports, so the same max IP limits apply in that case
- RPC 'node connect' for all connection attempts now:
  - Supports the RPC connection and server contexts
  - Properly handles duplicate address rejection including pending
    attempts
- RPC 'node connect' for non-persistent conn attempts now:
  - Waits for the connection attempt result before returning
  - Returns an error if the connection attempt fails
  - Cancels the connection attempt if the RPC connection is closed
    before it succeeds
- RPC 'node remove' now supports removing a pending connection by its
  persistent connection ID (since no peer ID exists before a valid
  connection is established)
- It is no longer possible for state transitions to allow things like
  duplicate addresses or failed cancellation
The max retry duration is currently an unexported global variable that
the tests override at init time.  At least one of the tests also
additionally overrides it for that specified test too.  While this
works, it is somewhat brittle and prevents the tests from being run in
parallel.

This improves the situation by making the max retry duration a field on
the connection manager instead of a global variable and adding a test
helper for creating a new connection manager that overrides it by
default.  Then any tests that need a different value can simply override
it on their local instance.

It also makes the tests parallel since they can no longer clobber one
another.
This updates the test for checking the connection manager cleanly shuts
down with failed conns to actualy test what it is intended to. Manual
connections do not automatically retry, only persistent connections.
This adds tests to ensure closing a connection multiple times works as
intended.
This adds tests to ensure duplication connections are rejected for all
possible states.
This adds tests to ensure attempts to add more than the maximum allowed
number of persistent are rejected.
This adds tests to ensure the Disconnect method properly disconnects
pending and established connections for both non-persistent and
persistent connections.
This adds tests to ensure the Remove method properly disconnects and
removes pending and established connections for both non-persistent and
persistent connections.
This updates the connmgr package README.md to match the new design and
capabilities.
This adds a couple of test helpers for asserting the internal state of
the connection manager updates all tests to call the new helpers
throughout.

The first one asserts the internal maps are all coherent and do not
violate any preconditions.  The second one asserts clean shutdown.
Currently the whitelisting logic happens in the server which makes it
inaccessible to the connection manager.

In order to pave the way for supporting various connection-related logic
that currently happens in the server, but ideally should be happening in
the connection manager, this adds basic support for whitelisting CIDR
prefixes to the connection manager.

The connection manager config struct now accepts a slice of prefixes and
a new method named IsWhitelisted is added.

Note that this only adds support .  It does not update anything to use
the new functionality yet.
This adds tests to ensure the new whitelist detection method works as
expected.
This modifies the server to pass in the parsed whitelist entries to
the connection manager config and the relevant code to make use of the
new method it exposes.

Finally, it removes the no longer used local isWhitelisted method.
This adds a new TryAcquire method to the context-aware semaphore.  As
the name implies, the method supports conditionally acquiring the
semaphore only when resources are immediately available.  In other
words, it will not block when there are no resources immediately
available.
This adds tests for the new TryAcquire method on the context-aware
semaphore to ensure the semantics work as expected.
The current overall total connection limits are enforced by the server
rather than the connection manager.  This is not ideal for many reasons,
but one of the most important consequences is that it makes DoS attacks
easier.  Another example of some less than ideal behavior that it allows
is that some rare combinations of events can lead to temporary extra
connection churn.

It is much more robust and natural to perform the limiting in the
connection manager itself via semaphores.  That approach not only
significantly hardens the server against DoS attacks and solves various
edge cases present in the current code, it also paves the way for even
more advanced features such as traffic shaping in the future.

To that end, this adds semaphore-based limiting for the total overall
number of normal connections to the connection manager and removes the
relevant current limiting for it from the server.

Normal connections are the automatic outbound, manual outbound, and
inbound connections.  Persistent connections, on the other hand, are not
subject to the limit since they have their own limiting.  This is
consistent with them not being subject to the automatic target outbound
limit either.
This adds tests to ensure that the new max normal connection limiting
properly enforces the limit including automatic outbound, manual
outbound, and inbound connections.

It also ensures that it not applied to persistent connections.
Similar to the recent total normal connection limiting, the current
per-host connection limits are enforced by the server.  For similar
reasons, it is much more robust and natural to perform the limiting
early in the connection manager.

With that in mind, this implements the per-host connection limiting in
the connection manager and removes the relevant current limiting for it
from the server.  The limiting is applied to inbound, outbound, and
persistent connections.

The new limiting is handled early in both the inbound and outbound paths
now which allows it to take advantage of fast connection shedding for
inbound connections and to preemptively prevent all outbound attempts
that would exceed the limit regardless of source.  It also provides the
flexibility to apply independent special permissions in the future.

This also slightly changes the semantics to exempt whitelisted addresses
for both inbound and outbound connections as opposed to only inbound
connections.
This adds tests to ensure that the new max connections per host limiting
properly enforces the limit including automatic outbound, manual
outbound, inbound, and persistent connections.  It also tests
whitelisted addresses are exempt.
This moves the code related to mock addresses, connections, and
listeners to a separate file.  This helps keep it from cluttering up the
main test code and also makes it easier to reuse in other packages.
Most of the tests in this package were written well before some of the
more modern test conveniences like t.Cleanup were added.

As a result, almost all of the tests repeat the code related to waiting
for clean shutdown and asserting clean shutdown.

This updates the tests so that the main method used to run the
connection manager in all of the tests now registers a cleanup func via
t.Cleanup to wait for clean shutdown and assert the internal state is
clean as expected.

This approach is much less error prone and convenient.
This makes the name of the connection manager instances in the test code
match the naming used in the implmentation code.
This converts the max retry duration test over to use the synctest which
makes it more robust and no longer reliant on real time.
There are various aspects of the connection manager that would benefit
from making use of randomness.  For example, adding random jitter to
connection backoffs.

However, by default, randomness often makes reproducing test failures
extremely difficult since the next run will not not necessarily be
following the same code branches due to different random values.

In order to provide deterministic reproducibility, this adds a csprng
interface that makes use of the dcrd crypto/rand module by default.  All
code that sources randomness moving forward is expected to use the
interface.

It also adds test infrastructure to automatically generate a new seed
on each test iteration that is logged in the event of failure along with
the ability to specify the seed via the -seed parameter.

When running tests, the seed is then used to create a deterministic
math/v2/chacha8 instance to implement the csprng interface.
The current code uses a simple deterministic linear backoff with a
maximum capacity.  While it works well enough, it has a big downside in
that all of the attempts will tend to coalesce over time.  This is
especially true during a network outage since all connections will drop
at the same time.

Also, exponential backoffs are preferred over linear to reduce the retry
frequency more quickly and give the remote more time to come back
online.

This addresses both of those points by changing the linear backoff for
persistent peers to use an exponential backoff with jitter instead.
@davecgh davecgh added this to the 2.2.0 milestone May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant