fix(mesh): radio + router reliability hardening by DatanoiseTV · Pull Request #10424 · meshtastic/firmware

DatanoiseTV · 2026-05-08T20:13:38Z

Summary

A bundle of small, independently-revertable fixes in the radio and routing layers. Each commit addresses one specific class of bug uncovered while reading the recent reconfigure-return-type churn (#10407 / #10411) and tracing the surrounding error paths.

Fixes (one logical commit per item)

radio: return real status from reconfigure() and stop panicking on SPI errors — extends Fix mesh reconfigure methods to return true instead of error code #10407. SX126x/SX128x/LR11x0/LR20x0/RF95 all returned true unconditionally even when individual SPI setters failed, so the reboot-on-bad-reconfigure path in RadioInterface.cpp:515 never tripped. Track failures in a local bool ok and return ok. Same drivers also assert(err == RADIOLIB_ERR_NONE) on standby/setSyncWord/setCurrentLimit/setPreambleLength/setOutputPower — converted to log-and-fail so a momentary SPI glitch doesn't panic-reset the firmware. LR20x0 (added in LR2021 radio on NRF_Promicro #10401) had the same bug as the others.
mesh: null-check getMeshNode in handleFromRadio — MeshService.cpp:97 dereferenced nodeDB->getMeshNode(mp->from)->has_user without a null check. getMeshNode legitimately returns nullptr under heap pressure (see NodeDB::isFull and updateFrom early-out), so every received decoded packet from a new node could crash on a full device. Restructure with an init-statement so the predicate short-circuits.
radio: drop oversized RX packets instead of asserting — handleReceiveInterrupt bounded the encrypted-payload memcpy with assert((uint32_t)payloadLen <= sizeof(mp->encrypted.bytes)). assert vanishes in NDEBUG, and even when on, panicking on a malformed (or simply oversized future-modem) packet is the wrong response. Replace with a runtime length check that releases the pool entry, bumps rxBad, and returns.
mesh: add exponential backoff and jitter to retransmits — NextHopRouter::setNextTx scheduled every retry at the same offset returned by getRetransmissionMsec (driven by channel utilization, not attempt history). Two nodes that simultaneously NAK each other re-collide on near-identical schedules. Introduce 1×/2×/4× backoff capped at 8× plus random half-base jitter.
router: skip pre-encryption packet copy when MQTT does not need it — extends the spirit of [feat] Skip MQTT allocation when disabled #10411. Router::send unconditionally allocCopy/release'd a clone of every relayed packet to feed MQTT::onSend, even when MQTT is excluded at compile time, runtime-disabled, or the packet wasn't from us. Hoist the alloc behind the same predicate the call site already used. Halves the pool-alloc rate on the hot path for non-MQTT devices.
router: don't evict ACKs when a background-priority packet arrives full — fromRadioQueue is a plain FIFO of size 4. When full, the prior behaviour dropped the oldest entry — usually a routing ACK or reply — to make room for whatever just arrived, even if it was BACKGROUND priority. Reverse that for the low-priority case: BACKGROUND or below is dropped at the door; higher priorities still evict. PointerQueue doesn't expose iteration so a full priority refactor is bigger than needed here.
radio: poll TX_DONE and stuck-TX timeout from main loop — the 60 s stuck-TX guard in canSendImmediately only fires when more packets enter the queue, because that's its only call site. If the queue empties after the radio wedges, the device sits in busyTx forever. pollMissedIrqs is already invoked unconditionally from the main loop — extend it to poll TX_DONE when sendingPacket != NULL and trip the same reboot path independently of queue activity.
mesh: bound the user-facing notification sprintf calls — two sprintf(cn->message, …) sites in Router.cpp and NodeDB.cpp have no bound check. Today's format strings fit, but a later edit without re-checking the buffer would silently overrun. Switch to snprintf with sizeof.

Build verification

pio run -e t-deck-tft — SUCCESS, 38.4% RAM / 55.3% flash. No new warnings introduced (LSP diagnostics seen during dev are local PlatformIO-include-path misconfig, not real).
Did not run on hardware — please request community testing before merge. The reconfigure() and TX_DONE-poll changes especially deserve a packet-capture eye.

Attestations

I have tested that my proposed changes behave as described — review-level / static-analysis only; not on-air.
I have tested that my proposed changes do not cause any obvious regressions on the following devices:
- Heltec (Lora32) V3
- LilyGo T-Deck
- LilyGo T-Beam
- RAK WisBlock 4631
- Seeed Studio T-1000E tracker card
- Other — build-verified t-deck-tft only

Notes for reviewers

Each commit is independent; happy to split into 8 PRs if you'd prefer atomic review. Kept together because they share the radio/router context and were uncovered as a set.
Edits-by-maintainers should be enabled.

…I errors reconfigure() in all five chip drivers (SX126x, SX128x, LR11x0, LR20x0, RF95) was returning true unconditionally, even when individual SPI setters failed. The caller in RadioInterface::cppInit() reboots the device when reconfigure returns false, so the failure-recovery path was silently disabled. Track failures in a local bool and return its accumulated state. Same drivers also asserted on remote SPI calls in reconfigure() and setStandby() — a momentary glitch on the SPI bus would panic-reset the firmware. Replaced with log-and-continue; the new return value lets the caller make the reboot decision instead of crashing inline.

NodeDB::getMeshNode returns nullptr legitimately under heap pressure (see NodeDB::isFull and updateFrom early-out paths). The condition at MeshService.cpp:97 dereferenced the result directly to read has_user, so every received decoded packet from a new node could crash on a device near pool exhaustion. Restructure with an init-statement so the node is fetched once and the rest of the predicate short-circuits on null.

handleReceiveInterrupt bounded the encrypted-payload memcpy with an assert, which a future modem with a longer max payload would turn into a panic-reset on a malformed (or simply larger) air packet. Replace with a runtime length check that releases the pool entry, bumps rxBad, and returns. assert() also vanishes in NDEBUG builds, so there was no actual bound check in release.

setNextTx scheduled every retry at the same base offset returned by getRetransmissionMsec, which is driven by channel utilization rather than per-attempt history. Two nodes that simultaneously NAK each other re-collide on near-identical schedules and waste airtime. Use a 1x/2x/4x backoff (with the multiplier capped at 8x for safety) plus random half-base jitter so concurrent attempts spread out.

Router::send unconditionally allocCopy/release'd a pre-encryption clone of every relayed packet, even on builds where MQTT was excluded (MESHTASTIC_EXCLUDE_MQTT) or runtime-disabled (mqtt module nullptr or moduleConfig.mqtt.enabled=false) or when the packet was a relay rather than from us. The copy is only fed to MQTT::onSend, so guard the alloc on the same predicate the call site already used. On nodes near pool exhaustion this halves the alloc rate on the hot path.

fromRadioQueue is a plain FIFO of size 4. When it fills, the prior behaviour was to drop the oldest packet — usually a routing ACK or reply — to make room for whatever just arrived, even if the new packet was BACKGROUND priority. Reverse that for the low-priority case: if the incoming packet's priority is at or below BACKGROUND, drop it instead. Higher-priority arrivals still evict to fit. PointerQueue doesn't expose iteration, so a full priority-queue refactor would be a larger change; this addresses the worst case.

The 60s stuck-TX detector in canSendImmediately only fires when more packets enter the TX queue, because that's the only path that calls it. If the queue empties after the radio wedges, no IRQ ever fires, no poll runs, and the device sits in busyTx forever. pollMissedIrqs is already invoked unconditionally from the main loop — extend it to poll TX_DONE when sendingPacket is set, and trip the same reboot path independently of queue activity.

Two sites built ClientNotification messages with sprintf into a fixed-size proto buffer with no length cap. The current format strings fit comfortably, but a future caller editing either format string without rechecking the buffer size would get a silent stack/heap overrun. Switch to snprintf with sizeof so the bound is enforced at the call site.

CLAassistant · 2026-05-08T20:13:45Z

All committers have signed the CLA.

thebentern · 2026-05-08T20:34:33Z

Hi. Please don't include so many unrelated changes in one PR

NomDeTom · 2026-05-09T01:36:21Z

I would check the jitter on packets - I believe there is already some random delay on rebroadcast.

Other question: how does a node know the reason for failure of its own packet?

DatanoiseTV · 2026-05-09T01:59:32Z

Thanks @thebentern — fair point. Re-opened as 8 atomic PRs, one per fix:

fix(mesh): return real status from reconfigure() and stop panicking on SPI errors #10430 — radio: return real status from reconfigure() and stop panicking on SPI errors (extends Fix mesh reconfigure methods to return true instead of error code #10407)
fix(mesh): null-check getMeshNode in handleFromRadio #10431 — mesh: null-check getMeshNode in handleFromRadio
fix(mesh): drop oversized RX packets instead of asserting #10432 — radio: drop oversized RX packets instead of asserting
fix(mesh): add exponential backoff and jitter to retransmits #10433 — mesh: add exponential backoff and jitter to retransmits
fix(mesh): skip pre-encryption packet copy when MQTT does not need it #10434 — router: skip pre-encryption packet copy when MQTT does not need it (extends spirit of [feat] Skip MQTT allocation when disabled #10411)
fix(mesh): don't evict ACKs when a background-priority packet arrives full #10435 — router: don't evict ACKs when a background-priority packet arrives full
fix(mesh): poll TX_DONE and stuck-TX timeout from main loop #10436 — radio: poll TX_DONE and stuck-TX timeout from main loop
fix(mesh): bound the user-facing notification sprintf calls #10437 — mesh: bound the user-facing notification sprintf calls

Each is independent and revertable; happy to revise / drop any of them on their own thread.

DatanoiseTV added 8 commits May 8, 2026 21:32

github-actions Bot added the bugfix Pull request that fixes bugs label May 8, 2026

thebentern closed this May 8, 2026

DatanoiseTV mentioned this pull request May 9, 2026

fix(power): sleep/battery reliability hardening #10425

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(mesh): radio + router reliability hardening#10424

fix(mesh): radio + router reliability hardening#10424
DatanoiseTV wants to merge 8 commits into
meshtastic:developfrom
DatanoiseTV:fix/radio-mesh-reliability

DatanoiseTV commented May 8, 2026 •

edited

Loading

Uh oh!

CLAassistant commented May 8, 2026 •

edited

Loading

Uh oh!

thebentern commented May 8, 2026

Uh oh!

NomDeTom commented May 9, 2026

Uh oh!

DatanoiseTV commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

DatanoiseTV commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fixes (one logical commit per item)

Build verification

Attestations

Notes for reviewers

Uh oh!

CLAassistant commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thebentern commented May 8, 2026

Uh oh!

NomDeTom commented May 9, 2026

Uh oh!

DatanoiseTV commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DatanoiseTV commented May 8, 2026 •

edited

Loading

CLAassistant commented May 8, 2026 •

edited

Loading