fix(mesh): radio + router reliability hardening#10424
Closed
DatanoiseTV wants to merge 8 commits into
Closed
Conversation
…I errors reconfigure() in all five chip drivers (SX126x, SX128x, LR11x0, LR20x0, RF95) was returning true unconditionally, even when individual SPI setters failed. The caller in RadioInterface::cppInit() reboots the device when reconfigure returns false, so the failure-recovery path was silently disabled. Track failures in a local bool and return its accumulated state. Same drivers also asserted on remote SPI calls in reconfigure() and setStandby() — a momentary glitch on the SPI bus would panic-reset the firmware. Replaced with log-and-continue; the new return value lets the caller make the reboot decision instead of crashing inline.
NodeDB::getMeshNode returns nullptr legitimately under heap pressure (see NodeDB::isFull and updateFrom early-out paths). The condition at MeshService.cpp:97 dereferenced the result directly to read has_user, so every received decoded packet from a new node could crash on a device near pool exhaustion. Restructure with an init-statement so the node is fetched once and the rest of the predicate short-circuits on null.
handleReceiveInterrupt bounded the encrypted-payload memcpy with an assert, which a future modem with a longer max payload would turn into a panic-reset on a malformed (or simply larger) air packet. Replace with a runtime length check that releases the pool entry, bumps rxBad, and returns. assert() also vanishes in NDEBUG builds, so there was no actual bound check in release.
setNextTx scheduled every retry at the same base offset returned by getRetransmissionMsec, which is driven by channel utilization rather than per-attempt history. Two nodes that simultaneously NAK each other re-collide on near-identical schedules and waste airtime. Use a 1x/2x/4x backoff (with the multiplier capped at 8x for safety) plus random half-base jitter so concurrent attempts spread out.
Router::send unconditionally allocCopy/release'd a pre-encryption clone of every relayed packet, even on builds where MQTT was excluded (MESHTASTIC_EXCLUDE_MQTT) or runtime-disabled (mqtt module nullptr or moduleConfig.mqtt.enabled=false) or when the packet was a relay rather than from us. The copy is only fed to MQTT::onSend, so guard the alloc on the same predicate the call site already used. On nodes near pool exhaustion this halves the alloc rate on the hot path.
fromRadioQueue is a plain FIFO of size 4. When it fills, the prior behaviour was to drop the oldest packet — usually a routing ACK or reply — to make room for whatever just arrived, even if the new packet was BACKGROUND priority. Reverse that for the low-priority case: if the incoming packet's priority is at or below BACKGROUND, drop it instead. Higher-priority arrivals still evict to fit. PointerQueue doesn't expose iteration, so a full priority-queue refactor would be a larger change; this addresses the worst case.
The 60s stuck-TX detector in canSendImmediately only fires when more packets enter the TX queue, because that's the only path that calls it. If the queue empties after the radio wedges, no IRQ ever fires, no poll runs, and the device sits in busyTx forever. pollMissedIrqs is already invoked unconditionally from the main loop — extend it to poll TX_DONE when sendingPacket is set, and trip the same reboot path independently of queue activity.
Two sites built ClientNotification messages with sprintf into a fixed-size proto buffer with no length cap. The current format strings fit comfortably, but a future caller editing either format string without rechecking the buffer size would get a silent stack/heap overrun. Switch to snprintf with sizeof so the bound is enforced at the call site.
Contributor
|
Hi. Please don't include so many unrelated changes in one PR |
Collaborator
|
I would check the jitter on packets - I believe there is already some random delay on rebroadcast. Other question: how does a node know the reason for failure of its own packet? |
This was referenced May 9, 2026
Contributor
Author
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A bundle of small, independently-revertable fixes in the radio and routing layers. Each commit addresses one specific class of bug uncovered while reading the recent reconfigure-return-type churn (#10407 / #10411) and tracing the surrounding error paths.
Fixes (one logical commit per item)
radio: return real status from reconfigure() and stop panicking on SPI errors— extends Fix mesh reconfigure methods to return true instead of error code #10407. SX126x/SX128x/LR11x0/LR20x0/RF95 all returnedtrueunconditionally even when individual SPI setters failed, so the reboot-on-bad-reconfigure path inRadioInterface.cpp:515never tripped. Track failures in a localbool okandreturn ok. Same drivers alsoassert(err == RADIOLIB_ERR_NONE)on standby/setSyncWord/setCurrentLimit/setPreambleLength/setOutputPower — converted to log-and-fail so a momentary SPI glitch doesn't panic-reset the firmware. LR20x0 (added in LR2021 radio on NRF_Promicro #10401) had the same bug as the others.mesh: null-check getMeshNode in handleFromRadio—MeshService.cpp:97dereferencednodeDB->getMeshNode(mp->from)->has_userwithout a null check.getMeshNodelegitimately returns nullptr under heap pressure (seeNodeDB::isFullandupdateFromearly-out), so every received decoded packet from a new node could crash on a full device. Restructure with an init-statement so the predicate short-circuits.radio: drop oversized RX packets instead of asserting—handleReceiveInterruptbounded the encrypted-payload memcpy withassert((uint32_t)payloadLen <= sizeof(mp->encrypted.bytes)).assertvanishes in NDEBUG, and even when on, panicking on a malformed (or simply oversized future-modem) packet is the wrong response. Replace with a runtime length check that releases the pool entry, bumpsrxBad, and returns.mesh: add exponential backoff and jitter to retransmits—NextHopRouter::setNextTxscheduled every retry at the same offset returned bygetRetransmissionMsec(driven by channel utilization, not attempt history). Two nodes that simultaneously NAK each other re-collide on near-identical schedules. Introduce 1×/2×/4× backoff capped at 8× plus random half-base jitter.router: skip pre-encryption packet copy when MQTT does not need it— extends the spirit of [feat] Skip MQTT allocation when disabled #10411.Router::sendunconditionallyallocCopy/release'd a clone of every relayed packet to feedMQTT::onSend, even when MQTT is excluded at compile time, runtime-disabled, or the packet wasn't from us. Hoist the alloc behind the same predicate the call site already used. Halves the pool-alloc rate on the hot path for non-MQTT devices.router: don't evict ACKs when a background-priority packet arrives full—fromRadioQueueis a plain FIFO of size 4. When full, the prior behaviour dropped the oldest entry — usually a routing ACK or reply — to make room for whatever just arrived, even if it was BACKGROUND priority. Reverse that for the low-priority case: BACKGROUND or below is dropped at the door; higher priorities still evict.PointerQueuedoesn't expose iteration so a full priority refactor is bigger than needed here.radio: poll TX_DONE and stuck-TX timeout from main loop— the 60 s stuck-TX guard incanSendImmediatelyonly fires when more packets enter the queue, because that's its only call site. If the queue empties after the radio wedges, the device sits inbusyTxforever.pollMissedIrqsis already invoked unconditionally from the main loop — extend it to pollTX_DONEwhensendingPacket != NULLand trip the same reboot path independently of queue activity.mesh: bound the user-facing notification sprintf calls— twosprintf(cn->message, …)sites inRouter.cppandNodeDB.cpphave no bound check. Today's format strings fit, but a later edit without re-checking the buffer would silently overrun. Switch tosnprintfwithsizeof.Build verification
pio run -e t-deck-tft— SUCCESS, 38.4% RAM / 55.3% flash. No new warnings introduced (LSP diagnostics seen during dev are local PlatformIO-include-path misconfig, not real).Attestations
t-deck-tftonlyNotes for reviewers