Fix #2086 — Recover TB connection after it is lost by apachler · Pull Request #2140 · thingsboard/thingsboard-gateway

apachler · 2026-06-10T13:15:39Z

Summary

Fixes #2086. A gateway that lost its connection to ThingsBoard could become
permanently stuck disconnected and never reconnect — most visibly when
ThingsBoard is unavailable for a while after the gateway starts, or after
intermittent connectivity drops.

Root cause

Reconnection was split between two mechanisms with a gap between them:

TBClient.connect() has a retry loop, but it's called exactly once,
synchronously, at startup (tb_gateway_service.py). After the first
successful connect that loop exits.
From then on, all reconnection is delegated to paho-mqtt's loop_start()
background thread.

Nothing supervised that thread. TBClient.run() (despite TBClient being a
Thread) just idled on a stop event. So once paho's network-loop thread
stopped, the gateway never tried again. The thread can stop on:

a clean/server-initiated disconnect (paho's loop_forever returns and clears
its thread), or
an unhandled exception inside the loop or a callback — paho re-raises
exceptions thrown by on_connect/on_disconnect (suppress_exceptions
defaults to False), which terminates the loop thread.

Additionally, connect()'s outer try/except wrapped the whole retry loop, so
a single unexpected exception during the startup down-phase would abandon the
loop and return — with nothing to re-invoke it.

Changes

Harden _on_connect / _on_disconnect (tb_client.py): wrap their
bodies, and each service-subscription callback, in try/except so a callback
exception can never propagate back into paho and kill the network-loop thread.
Add a reconnection supervisor in TBClient.run(): once the client has
connected at least once, re-drive connect() whenever paho's loop thread is
dead and the client is not stopped/paused/already connecting. paho keeps
owning reconnection while its loop is alive, so the supervisor never fights
paho's own backoff, and it never races the blocking startup connect or the
remote-configurator's connect management.
Make connect() robust: serialize attempts with a per-client lock; catch
per-iteration errors so an exception can never break out of the retry loop and
silently return; and keep looping until the client is both connected
and the network loop is alive (covers a loop thread that died leaving the
client's internal connected flag stale).
Fix a units bug in RemoteConfigurator._apply_connection_config:
apply_start was reset in milliseconds (time() * 1000) while compared
against time() in seconds, which broke the 30s retry budget and stopped the
new/old-configuration fallback from ever toggling.

Behavior after the fix

TB unavailable at startup → keeps retrying (unchanged), and a transient error
can no longer abandon the retry loop.
Connection lost after being established → the supervisor restarts paho's loop
and reconnects, instead of the gateway sitting disconnected forever.
A throwing MQTT callback degrades gracefully (logged) instead of killing
reconnection.

Reproduction / Verification

Repro (on the unfixed code):

Start the gateway with ThingsBoard reachable and let it connect.
Make ThingsBoard unreachable for a while — stop the broker, or block the
MQTT port:
sudo iptables -A OUTPUT -p tcp --dport 1883 -j DROP (use 8883 for TLS).
Wait past the MQTT keepalive so the gateway notices the drop.
Restore connectivity
(sudo iptables -D OUTPUT -p tcp --dport 1883 -j DROP / restart the broker).

Observed (before): the gateway logs the disconnect but never reconnects —
no further "Connecting to ThingsBoard…" attempts, and telemetry never resumes
even after TB is back. A startup variant reproduces the same dead state: start
the gateway while TB is down, leave it down for a while, then bring it up.

Expected (after): the supervisor detects the dead network loop and re-drives
connect(); the gateway reconnects once TB is reachable again and telemetry
resumes. In the logs you'll see the supervisor kick in:
MQTT network loop is not running while connection is expected - restarting connection to ThingsBoard.
followed by the usual connect/CONNACK lines.

Testing

Both changed files byte-compile.
Standalone logic test of the supervisor decision (__should_supervise_reconnect)
and the new connect() loop-exit condition across all states (dead/alive loop
thread, never-connected, paused, stopped, in-progress, stale-connected-flag) —
all pass.
Manual port-block/restore cycle confirms reconnection.

Notes

No changes to remote-configuration behavior beyond the timing-bug fix above; the
tb_gateway_remote_configurator.py edit is purely the seconds/milliseconds
correction.

Closes #2086

A gateway that lost its connection to ThingsBoard could get permanently stuck disconnected and never reconnect. Reconnection was split between two mechanisms with a gap: TBClient.connect()'s retry loop ran only once at startup, and after the first successful connect all reconnection was delegated to paho's network-loop thread. Nothing supervised that thread, so once it stopped (clean server disconnect, or an unhandled error in the loop / a callback) the gateway never tried again - TBClient.run() just idled. - Harden _on_connect/_on_disconnect: wrap their bodies (and each service subscription callback) in try/except. paho invokes these from its network-loop thread and re-raises any exception that escapes them (suppress_exceptions defaults to False), which would terminate the loop thread and stop reconnection for good. - Add a reconnection supervisor in TBClient.run(): once the client has connected at least once, re-drive connect() whenever paho's loop thread is dead and we are not stopped/paused/already connecting. paho keeps owning reconnection while its loop is alive, so we never fight its backoff. - Make connect() robust: serialize attempts with a per-client lock, catch per-iteration errors so an exception can never break out of the retry loop and silently return, and keep looping until we are both connected AND the network loop is alive (covers a loop thread that died leaving the client's internal connected flag stale). - Fix a units bug in RemoteConfigurator._apply_connection_config: apply_start was reset in milliseconds while compared against time() in seconds, which broke the 30s retry budget and stopped the new/old configuration fallback from ever toggling.

github-actions Bot added bug core labels Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #2086 — Recover TB connection after it is lost#2140

Fix #2086 — Recover TB connection after it is lost#2140
apachler wants to merge 1 commit into
thingsboard:masterfrom
apachler:bugfix/stuck-tb-connection

apachler commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

apachler commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Changes

Behavior after the fix

Reproduction / Verification

Testing

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

apachler commented Jun 10, 2026 •

edited

Loading