Skip to content

BLE OTA: keepalive + longer supervision timeout to prevent mid-flash disconnect#113

Open
PaulDWhite wants to merge 6 commits into
masterfrom
ota-ble-keepalive-supervision
Open

BLE OTA: keepalive + longer supervision timeout to prevent mid-flash disconnect#113
PaulDWhite wants to merge 6 commits into
masterfrom
ota-ble-keepalive-supervision

Conversation

@PaulDWhite

Copy link
Copy Markdown
Collaborator

Problem

BLE OTA firmware updates abort partway through (~20–30%). The controller logs Device disconnected reason=531 mid-flash — HCI error 0x13, a central-initiated graceful disconnect (not RF dropout, not a supervision timeout, not a controller crash). Root cause: during OTA the firmware intentionally stops sending FastLink telemetry to give the flash full bandwidth, so the phone sees the link as idle and tears it down (its telemetry-stall watchdog, or an OS-level GATT idle teardown).

The primary fix lives in the phone app (it now suspends its telemetry-stall watchdog for the duration of the flash). These firmware changes are defense-in-depth so the link stays healthy regardless of the central's behaviour — protecting older app builds and OS-level teardown, which the app can't.

Changes

  • FastLink keepalive during OTA (fastlink_service.cpp): emit a ~1 Hz keepalive notify while OTA is in progress. The packet was already setValue()'d, so its advancing packet_id/uptime_ms register as telemetry progress on the app side with zero app changes. At 1 Hz against the 15 ms OTA connection interval it does not meaningfully slow the flash.
  • Longer OTA supervision timeout (ble_core.cpprequestFastConnParams): 2 s → 8 s, so a multi-second flash-erase stall or a sluggish phone can't drop the link at the link-layer level. OTA_TIMEOUT_MS (30 s) remains the dead-link backstop.

Verification

  • Builds clean: pio run -e OpenPPG-CESP32S3-CAN-SP140 (flash 36.9%, 1.23 MB image).
  • The companion app fix was verified on-device: a full 301-sector flash completed end-to-end (OTA Success → reboot → Boot complete) with no mid-flash disconnect.

Reliability over speed — the keepalive and longer timeout trade a negligible amount of flash throughput for a link that stays up across the whole flash.

🤖 Generated with Claude Code

…disconnect

During OTA the firmware suppresses the ~50 Hz FastLink telemetry stream to give
the flash full bandwidth. With no liveness signal a BLE central can tear down
the link mid-flash (HCI 0x13 / disconnect reason 531), aborting the update
partway through (~20-30%).

- fastlink_service.cpp: emit a ~1 Hz FastLink keepalive notify while OTA is in
  progress so the central keeps the link up. The keepalive ships the packet
  already setValue()'d, whose advancing packet_id/uptime_ms the app counts as
  telemetry progress (no app changes needed). At 1 Hz vs the 15 ms OTA interval
  it does not meaningfully slow the flash.
- ble_core.cpp (requestFastConnParams): lengthen the OTA-time supervision
  timeout from 2 s to 8 s so a multi-second flash-erase stall or a sluggish
  phone cannot drop the link at the link-layer level. OTA_TIMEOUT_MS (30 s)
  remains the dead-link backstop.

Reliability over speed: trades negligible flash throughput for a link that
stays up across the whole flash.
@zjwhitehead

Copy link
Copy Markdown
Member

Looks like this wont be required since the app changes alone will not need longer timeouts and keepalive. Will need a little more testing

Replace hard clamp and magic numbers in the climb-rate vario display with named constants. Introduce kVarioSegment (0.5 m/s per segment) and kVarioDeadzone (0.25 m/s) and compute sectionsToFill from kVarioSegment, capping at 6. Remove the previous ±0.6 m/s clamp and use the deadzone as the neutral threshold so values beyond ±3 m/s pin the gauge to full deflection rather than being clamped.
Reduce display redraws and SPI contention by introducing change-detecting LVGL setters and safer SPI handling. Key changes:

- Add resetLvglUpdateCache() and change-detecting helpers (setLabelText, setBgColor, etc.) to avoid redundant LVGL invalidations and per-frame full redraws.
- Rework many main-screen update paths (battery, power, altitude, climb rate, temps, icons) to compute desired state then diff-apply only changed widgets/styles.
- Add flushSkipped flag: if a display flush is skipped due to SPI busy, mark it and force a full invalidate on next refresh to recover stale pixels.
- Let LVGL read time directly (lv_tick_set_cb) and remove ad-hoc lv_tick_handler/lvgl_last_update; call lv_timer_handler() from updateLvgl().
- Move BMS SPI CS toggling to occur only after acquiring the shared SPI mutex and release the mutex immediately after the CAN library's update() (getters are read-only), preventing mid-transfer deselects and reducing wait times for display flushes.
- Increase UI task frequency to ~30 Hz (33 ms) to match LVGL refresh period.
- Update headers and tests to reflect removed variables/functions and new reset API.

These changes reduce unnecessary rendering, avoid lost mid-transfer display updates, and improve responsiveness under SPI contention.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants