Fix #2086 — Recover TB connection after it is lost#2140
Open
apachler wants to merge 1 commit into
Open
Conversation
A gateway that lost its connection to ThingsBoard could get permanently stuck disconnected and never reconnect. Reconnection was split between two mechanisms with a gap: TBClient.connect()'s retry loop ran only once at startup, and after the first successful connect all reconnection was delegated to paho's network-loop thread. Nothing supervised that thread, so once it stopped (clean server disconnect, or an unhandled error in the loop / a callback) the gateway never tried again - TBClient.run() just idled. - Harden _on_connect/_on_disconnect: wrap their bodies (and each service subscription callback) in try/except. paho invokes these from its network-loop thread and re-raises any exception that escapes them (suppress_exceptions defaults to False), which would terminate the loop thread and stop reconnection for good. - Add a reconnection supervisor in TBClient.run(): once the client has connected at least once, re-drive connect() whenever paho's loop thread is dead and we are not stopped/paused/already connecting. paho keeps owning reconnection while its loop is alive, so we never fight its backoff. - Make connect() robust: serialize attempts with a per-client lock, catch per-iteration errors so an exception can never break out of the retry loop and silently return, and keep looping until we are both connected AND the network loop is alive (covers a loop thread that died leaving the client's internal connected flag stale). - Fix a units bug in RemoteConfigurator._apply_connection_config: apply_start was reset in milliseconds while compared against time() in seconds, which broke the 30s retry budget and stopped the new/old configuration fallback from ever toggling.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #2086. A gateway that lost its connection to ThingsBoard could become
permanently stuck disconnected and never reconnect — most visibly when
ThingsBoard is unavailable for a while after the gateway starts, or after
intermittent connectivity drops.
Root cause
Reconnection was split between two mechanisms with a gap between them:
TBClient.connect()has a retry loop, but it's called exactly once,synchronously, at startup (
tb_gateway_service.py). After the firstsuccessful connect that loop exits.
loop_start()background thread.
Nothing supervised that thread.
TBClient.run()(despiteTBClientbeing aThread) just idled on a stop event. So once paho's network-loop threadstopped, the gateway never tried again. The thread can stop on:
loop_foreverreturns and clearsits thread), or
exceptions thrown by
on_connect/on_disconnect(suppress_exceptionsdefaults to
False), which terminates the loop thread.Additionally,
connect()'s outertry/exceptwrapped the whole retry loop, soa single unexpected exception during the startup down-phase would abandon the
loop and return — with nothing to re-invoke it.
Changes
_on_connect/_on_disconnect(tb_client.py): wrap theirbodies, and each service-subscription callback, in
try/exceptso a callbackexception can never propagate back into paho and kill the network-loop thread.
TBClient.run(): once the client hasconnected at least once, re-drive
connect()whenever paho's loop thread isdead and the client is not stopped/paused/already connecting. paho keeps
owning reconnection while its loop is alive, so the supervisor never fights
paho's own backoff, and it never races the blocking startup connect or the
remote-configurator's connect management.
connect()robust: serialize attempts with a per-client lock; catchper-iteration errors so an exception can never break out of the retry loop and
silently return; and keep looping until the client is both connected
and the network loop is alive (covers a loop thread that died leaving the
client's internal connected flag stale).
RemoteConfigurator._apply_connection_config:apply_startwas reset in milliseconds (time() * 1000) while comparedagainst
time()in seconds, which broke the 30s retry budget and stopped thenew/old-configuration fallback from ever toggling.
Behavior after the fix
can no longer abandon the retry loop.
and reconnects, instead of the gateway sitting disconnected forever.
reconnection.
Reproduction / Verification
Repro (on the unfixed code):
MQTT port:
sudo iptables -A OUTPUT -p tcp --dport 1883 -j DROP(use8883for TLS).(
sudo iptables -D OUTPUT -p tcp --dport 1883 -j DROP/ restart the broker).Observed (before): the gateway logs the disconnect but never reconnects —
no further "Connecting to ThingsBoard…" attempts, and telemetry never resumes
even after TB is back. A startup variant reproduces the same dead state: start
the gateway while TB is down, leave it down for a while, then bring it up.
Expected (after): the supervisor detects the dead network loop and re-drives
connect(); the gateway reconnects once TB is reachable again and telemetryresumes. In the logs you'll see the supervisor kick in:
MQTT network loop is not running while connection is expected - restarting connection to ThingsBoard.followed by the usual connect/CONNACK lines.
Testing
__should_supervise_reconnect)and the new
connect()loop-exit condition across all states (dead/alive loopthread, never-connected, paused, stopped, in-progress, stale-connected-flag) —
all pass.
Notes
No changes to remote-configuration behavior beyond the timing-bug fix above; the
tb_gateway_remote_configurator.pyedit is purely the seconds/millisecondscorrection.
Closes #2086