Skip to content

fix(mesh): radio + router reliability hardening#10424

Closed
DatanoiseTV wants to merge 8 commits into
meshtastic:developfrom
DatanoiseTV:fix/radio-mesh-reliability
Closed

fix(mesh): radio + router reliability hardening#10424
DatanoiseTV wants to merge 8 commits into
meshtastic:developfrom
DatanoiseTV:fix/radio-mesh-reliability

Conversation

@DatanoiseTV
Copy link
Copy Markdown
Contributor

@DatanoiseTV DatanoiseTV commented May 8, 2026

Summary

A bundle of small, independently-revertable fixes in the radio and routing layers. Each commit addresses one specific class of bug uncovered while reading the recent reconfigure-return-type churn (#10407 / #10411) and tracing the surrounding error paths.

Fixes (one logical commit per item)

  • radio: return real status from reconfigure() and stop panicking on SPI errors — extends Fix mesh reconfigure methods to return true instead of error code #10407. SX126x/SX128x/LR11x0/LR20x0/RF95 all returned true unconditionally even when individual SPI setters failed, so the reboot-on-bad-reconfigure path in RadioInterface.cpp:515 never tripped. Track failures in a local bool ok and return ok. Same drivers also assert(err == RADIOLIB_ERR_NONE) on standby/setSyncWord/setCurrentLimit/setPreambleLength/setOutputPower — converted to log-and-fail so a momentary SPI glitch doesn't panic-reset the firmware. LR20x0 (added in LR2021 radio on NRF_Promicro #10401) had the same bug as the others.
  • mesh: null-check getMeshNode in handleFromRadioMeshService.cpp:97 dereferenced nodeDB->getMeshNode(mp->from)->has_user without a null check. getMeshNode legitimately returns nullptr under heap pressure (see NodeDB::isFull and updateFrom early-out), so every received decoded packet from a new node could crash on a full device. Restructure with an init-statement so the predicate short-circuits.
  • radio: drop oversized RX packets instead of assertinghandleReceiveInterrupt bounded the encrypted-payload memcpy with assert((uint32_t)payloadLen <= sizeof(mp->encrypted.bytes)). assert vanishes in NDEBUG, and even when on, panicking on a malformed (or simply oversized future-modem) packet is the wrong response. Replace with a runtime length check that releases the pool entry, bumps rxBad, and returns.
  • mesh: add exponential backoff and jitter to retransmitsNextHopRouter::setNextTx scheduled every retry at the same offset returned by getRetransmissionMsec (driven by channel utilization, not attempt history). Two nodes that simultaneously NAK each other re-collide on near-identical schedules. Introduce 1×/2×/4× backoff capped at 8× plus random half-base jitter.
  • router: skip pre-encryption packet copy when MQTT does not need it — extends the spirit of [feat] Skip MQTT allocation when disabled #10411. Router::send unconditionally allocCopy/release'd a clone of every relayed packet to feed MQTT::onSend, even when MQTT is excluded at compile time, runtime-disabled, or the packet wasn't from us. Hoist the alloc behind the same predicate the call site already used. Halves the pool-alloc rate on the hot path for non-MQTT devices.
  • router: don't evict ACKs when a background-priority packet arrives fullfromRadioQueue is a plain FIFO of size 4. When full, the prior behaviour dropped the oldest entry — usually a routing ACK or reply — to make room for whatever just arrived, even if it was BACKGROUND priority. Reverse that for the low-priority case: BACKGROUND or below is dropped at the door; higher priorities still evict. PointerQueue doesn't expose iteration so a full priority refactor is bigger than needed here.
  • radio: poll TX_DONE and stuck-TX timeout from main loop — the 60 s stuck-TX guard in canSendImmediately only fires when more packets enter the queue, because that's its only call site. If the queue empties after the radio wedges, the device sits in busyTx forever. pollMissedIrqs is already invoked unconditionally from the main loop — extend it to poll TX_DONE when sendingPacket != NULL and trip the same reboot path independently of queue activity.
  • mesh: bound the user-facing notification sprintf calls — two sprintf(cn->message, …) sites in Router.cpp and NodeDB.cpp have no bound check. Today's format strings fit, but a later edit without re-checking the buffer would silently overrun. Switch to snprintf with sizeof.

Build verification

  • pio run -e t-deck-tft — SUCCESS, 38.4% RAM / 55.3% flash. No new warnings introduced (LSP diagnostics seen during dev are local PlatformIO-include-path misconfig, not real).
  • Did not run on hardware — please request community testing before merge. The reconfigure() and TX_DONE-poll changes especially deserve a packet-capture eye.

Attestations

  • I have tested that my proposed changes behave as described — review-level / static-analysis only; not on-air.
  • I have tested that my proposed changes do not cause any obvious regressions on the following devices:
    • Heltec (Lora32) V3
    • LilyGo T-Deck
    • LilyGo T-Beam
    • RAK WisBlock 4631
    • Seeed Studio T-1000E tracker card
    • Other — build-verified t-deck-tft only

Notes for reviewers

  • Each commit is independent; happy to split into 8 PRs if you'd prefer atomic review. Kept together because they share the radio/router context and were uncovered as a set.
  • Edits-by-maintainers should be enabled.

…I errors

reconfigure() in all five chip drivers (SX126x, SX128x, LR11x0, LR20x0,
RF95) was returning true unconditionally, even when individual SPI
setters failed. The caller in RadioInterface::cppInit() reboots the
device when reconfigure returns false, so the failure-recovery path was
silently disabled. Track failures in a local bool and return its
accumulated state.

Same drivers also asserted on remote SPI calls in reconfigure() and
setStandby() — a momentary glitch on the SPI bus would panic-reset the
firmware. Replaced with log-and-continue; the new return value lets the
caller make the reboot decision instead of crashing inline.
NodeDB::getMeshNode returns nullptr legitimately under heap pressure
(see NodeDB::isFull and updateFrom early-out paths). The condition at
MeshService.cpp:97 dereferenced the result directly to read has_user,
so every received decoded packet from a new node could crash on a
device near pool exhaustion. Restructure with an init-statement so the
node is fetched once and the rest of the predicate short-circuits on
null.
handleReceiveInterrupt bounded the encrypted-payload memcpy with an
assert, which a future modem with a longer max payload would turn into
a panic-reset on a malformed (or simply larger) air packet. Replace
with a runtime length check that releases the pool entry, bumps rxBad,
and returns. assert() also vanishes in NDEBUG builds, so there was no
actual bound check in release.
setNextTx scheduled every retry at the same base offset returned by
getRetransmissionMsec, which is driven by channel utilization rather
than per-attempt history. Two nodes that simultaneously NAK each other
re-collide on near-identical schedules and waste airtime. Use a 1x/2x/4x
backoff (with the multiplier capped at 8x for safety) plus random
half-base jitter so concurrent attempts spread out.
Router::send unconditionally allocCopy/release'd a pre-encryption clone
of every relayed packet, even on builds where MQTT was excluded
(MESHTASTIC_EXCLUDE_MQTT) or runtime-disabled (mqtt module nullptr or
moduleConfig.mqtt.enabled=false) or when the packet was a relay rather
than from us. The copy is only fed to MQTT::onSend, so guard the alloc
on the same predicate the call site already used. On nodes near pool
exhaustion this halves the alloc rate on the hot path.
fromRadioQueue is a plain FIFO of size 4. When it fills, the prior
behaviour was to drop the oldest packet — usually a routing ACK or
reply — to make room for whatever just arrived, even if the new packet
was BACKGROUND priority. Reverse that for the low-priority case: if
the incoming packet's priority is at or below BACKGROUND, drop it
instead. Higher-priority arrivals still evict to fit. PointerQueue
doesn't expose iteration, so a full priority-queue refactor would be a
larger change; this addresses the worst case.
The 60s stuck-TX detector in canSendImmediately only fires when more
packets enter the TX queue, because that's the only path that calls it.
If the queue empties after the radio wedges, no IRQ ever fires, no
poll runs, and the device sits in busyTx forever. pollMissedIrqs is
already invoked unconditionally from the main loop — extend it to
poll TX_DONE when sendingPacket is set, and trip the same reboot
path independently of queue activity.
Two sites built ClientNotification messages with sprintf into a
fixed-size proto buffer with no length cap. The current format strings
fit comfortably, but a future caller editing either format string
without rechecking the buffer size would get a silent stack/heap
overrun. Switch to snprintf with sizeof so the bound is enforced at
the call site.
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 8, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions Bot added the bugfix Pull request that fixes bugs label May 8, 2026
@thebentern
Copy link
Copy Markdown
Contributor

Hi. Please don't include so many unrelated changes in one PR

@thebentern thebentern closed this May 8, 2026
@NomDeTom
Copy link
Copy Markdown
Collaborator

NomDeTom commented May 9, 2026

I would check the jitter on packets - I believe there is already some random delay on rebroadcast.

Other question: how does a node know the reason for failure of its own packet?

@DatanoiseTV
Copy link
Copy Markdown
Contributor Author

Thanks @thebentern — fair point. Re-opened as 8 atomic PRs, one per fix:

Each is independent and revertable; happy to revise / drop any of them on their own thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Pull request that fixes bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants