Skip to content

Fix #2086 — Recover TB connection after it is lost#2140

Open
apachler wants to merge 1 commit into
thingsboard:masterfrom
apachler:bugfix/stuck-tb-connection
Open

Fix #2086 — Recover TB connection after it is lost#2140
apachler wants to merge 1 commit into
thingsboard:masterfrom
apachler:bugfix/stuck-tb-connection

Conversation

@apachler

@apachler apachler commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #2086. A gateway that lost its connection to ThingsBoard could become
permanently stuck disconnected and never reconnect — most visibly when
ThingsBoard is unavailable for a while after the gateway starts, or after
intermittent connectivity drops.

Root cause

Reconnection was split between two mechanisms with a gap between them:

  • TBClient.connect() has a retry loop, but it's called exactly once,
    synchronously, at startup (tb_gateway_service.py). After the first
    successful connect that loop exits.
  • From then on, all reconnection is delegated to paho-mqtt's loop_start()
    background thread.

Nothing supervised that thread. TBClient.run() (despite TBClient being a
Thread) just idled on a stop event. So once paho's network-loop thread
stopped, the gateway never tried again. The thread can stop on:

  • a clean/server-initiated disconnect (paho's loop_forever returns and clears
    its thread), or
  • an unhandled exception inside the loop or a callback — paho re-raises
    exceptions thrown by on_connect/on_disconnect (suppress_exceptions
    defaults to False), which terminates the loop thread.

Additionally, connect()'s outer try/except wrapped the whole retry loop, so
a single unexpected exception during the startup down-phase would abandon the
loop and return — with nothing to re-invoke it.

Changes

  • Harden _on_connect / _on_disconnect (tb_client.py): wrap their
    bodies, and each service-subscription callback, in try/except so a callback
    exception can never propagate back into paho and kill the network-loop thread.
  • Add a reconnection supervisor in TBClient.run(): once the client has
    connected at least once, re-drive connect() whenever paho's loop thread is
    dead and the client is not stopped/paused/already connecting. paho keeps
    owning reconnection while its loop is alive, so the supervisor never fights
    paho's own backoff, and it never races the blocking startup connect or the
    remote-configurator's connect management.
  • Make connect() robust: serialize attempts with a per-client lock; catch
    per-iteration errors so an exception can never break out of the retry loop and
    silently return; and keep looping until the client is both connected
    and the network loop is alive (covers a loop thread that died leaving the
    client's internal connected flag stale).
  • Fix a units bug in RemoteConfigurator._apply_connection_config:
    apply_start was reset in milliseconds (time() * 1000) while compared
    against time() in seconds, which broke the 30s retry budget and stopped the
    new/old-configuration fallback from ever toggling.

Behavior after the fix

  • TB unavailable at startup → keeps retrying (unchanged), and a transient error
    can no longer abandon the retry loop.
  • Connection lost after being established → the supervisor restarts paho's loop
    and reconnects, instead of the gateway sitting disconnected forever.
  • A throwing MQTT callback degrades gracefully (logged) instead of killing
    reconnection.

Reproduction / Verification

Repro (on the unfixed code):

  1. Start the gateway with ThingsBoard reachable and let it connect.
  2. Make ThingsBoard unreachable for a while — stop the broker, or block the
    MQTT port:
    sudo iptables -A OUTPUT -p tcp --dport 1883 -j DROP (use 8883 for TLS).
  3. Wait past the MQTT keepalive so the gateway notices the drop.
  4. Restore connectivity
    (sudo iptables -D OUTPUT -p tcp --dport 1883 -j DROP / restart the broker).

Observed (before): the gateway logs the disconnect but never reconnects —
no further "Connecting to ThingsBoard…" attempts, and telemetry never resumes
even after TB is back. A startup variant reproduces the same dead state: start
the gateway while TB is down, leave it down for a while, then bring it up.

Expected (after): the supervisor detects the dead network loop and re-drives
connect(); the gateway reconnects once TB is reachable again and telemetry
resumes. In the logs you'll see the supervisor kick in:
MQTT network loop is not running while connection is expected - restarting connection to ThingsBoard.
followed by the usual connect/CONNACK lines.

Testing

  • Both changed files byte-compile.
  • Standalone logic test of the supervisor decision (__should_supervise_reconnect)
    and the new connect() loop-exit condition across all states (dead/alive loop
    thread, never-connected, paused, stopped, in-progress, stale-connected-flag) —
    all pass.
  • Manual port-block/restore cycle confirms reconnection.

Notes

No changes to remote-configuration behavior beyond the timing-bug fix above; the
tb_gateway_remote_configurator.py edit is purely the seconds/milliseconds
correction.

Closes #2086

A gateway that lost its connection to ThingsBoard could get permanently
stuck disconnected and never reconnect. Reconnection was split between two
mechanisms with a gap: TBClient.connect()'s retry loop ran only once at
startup, and after the first successful connect all reconnection was
delegated to paho's network-loop thread. Nothing supervised that thread, so
once it stopped (clean server disconnect, or an unhandled error in the loop /
a callback) the gateway never tried again - TBClient.run() just idled.

- Harden _on_connect/_on_disconnect: wrap their bodies (and each service
  subscription callback) in try/except. paho invokes these from its
  network-loop thread and re-raises any exception that escapes them
  (suppress_exceptions defaults to False), which would terminate the loop
  thread and stop reconnection for good.
- Add a reconnection supervisor in TBClient.run(): once the client has
  connected at least once, re-drive connect() whenever paho's loop thread is
  dead and we are not stopped/paused/already connecting. paho keeps owning
  reconnection while its loop is alive, so we never fight its backoff.
- Make connect() robust: serialize attempts with a per-client lock, catch
  per-iteration errors so an exception can never break out of the retry loop
  and silently return, and keep looping until we are both connected AND the
  network loop is alive (covers a loop thread that died leaving the client's
  internal connected flag stale).
- Fix a units bug in RemoteConfigurator._apply_connection_config: apply_start
  was reset in milliseconds while compared against time() in seconds, which
  broke the 30s retry budget and stopped the new/old configuration fallback
  from ever toggling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Connection to TB stuck

1 participant