Skip to content

srt: auto-restart listener on transient UDP errors#5712

Open
JohanG-LAS wants to merge 1 commit intobluenviron:mainfrom
JohanG-LAS:srt-listener-auto-restart
Open

srt: auto-restart listener on transient UDP errors#5712
JohanG-LAS wants to merge 1 commit intobluenviron:mainfrom
JohanG-LAS:srt-listener-auto-restart

Conversation

@JohanG-LAS
Copy link
Copy Markdown
Contributor

Cloud providers are regularly doing VM live migrations, causing VM's to freeze during a couple of seconds. See Azure and Google
TCP transport protocols often handles these "outages" well, but UDP protocols can struggle.

The MediaMTX SRT server would permanently shut down whenever the underlying gosrt listener returned any non-deadline error from its ReadFrom loop (e.g. a transient read udp: network is unreachable).
gosrt marks the listener as done on such errors and Accept2() returns the error. MediaMTX then logged it and broke out of its server loop, leaving SRT unavailable until the process was restarted manually.

Fix

Treat any non-ErrListenerClosed listener failure as recoverable: close the dead listener, retry srt.Listen with bounded exponential backoff plus jitter, and re-spawn the listener goroutine on success. ErrListenerClosed (returned during graceful shutdown) and context cancellation continue to exit the loop cleanly. Existing live connections, the server goroutine, the connection map, the API, and metrics are untouched: only the accept side is recreated.

Test

A package-level srtListen indirection allows tests to substitute a fake listener; three new tests cover the restart-on-transient, no-restart-on-closed, and give-up-on-context-cancel paths.

Made-with: Cursor

The MediaMTX SRT server would permanently shut down whenever the
underlying gosrt listener returned any non-deadline error from its
ReadFrom loop (e.g. a transient `read udp: network is unreachable`
caused by an Azure Accelerated Networking VF flap or any other
short-lived UDP socket fault). gosrt marks the listener as done on
such errors and Accept2() returns the error; MediaMTX then logged
it and broke out of its server loop, leaving SRT unavailable until
the process was restarted manually.

Treat any non-ErrListenerClosed listener failure as recoverable:
close the dead listener, retry srt.Listen with bounded exponential
backoff plus jitter, and re-spawn the listener goroutine on success.
ErrListenerClosed (returned during graceful shutdown) and context
cancellation continue to exit the loop cleanly. Existing live
connections, the server goroutine, the connection map, the API,
and metrics are untouched: only the accept side is recreated.

A package-level srtListen indirection allows tests to substitute
a fake listener; three new tests cover the restart-on-transient,
no-restart-on-closed, and give-up-on-context-cancel paths.

Made-with: Cursor
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.23%. Comparing base (cae9920) to head (0235ab3).
⚠️ Report is 28 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5712      +/-   ##
==========================================
+ Coverage   62.08%   63.23%   +1.14%     
==========================================
  Files         214      217       +3     
  Lines       17602    18280     +678     
==========================================
+ Hits        10929    11560     +631     
- Misses       5766     5783      +17     
- Partials      907      937      +30     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant