feat(alertd): tier check thresholds and report host capacity by passcod · Pull Request #442 · beyondessential/bestool

passcod · 2026-05-31T06:00:19Z

🤖 Phase 4 of consolidating monitoring (TODO #10, plan in docs/plans/healthchecks-into-alertd.md). Stacked on #441.

Re-tunes check triggering thresholds and adds host capacity to the status payload. Every check that represents a bad condition has a FAIL threshold; some now also have a lower WARN tier. Since canopy runs its own alerting logic off the per-check results (and ignores the sweep's top-level healthy), warn-vs-fail only affects what bestool tamanu doctor shows as DEGRADED vs FAILING.

Host capacity (for Canopy + to contextualise the load row):

cpuCores (logical) and totalMemoryBytes added to the top-level status payload.
The load row now reads load average: 2.50, 2.10, 1.90 (4 cores) and tiers the 5-minute average against the core count: WARN above 1.5×cores, FAIL above 4×cores.

Sync thresholds, tightened to the real cadence (sync ~60s, lookup ~20s — an hour stale is far too late):

sync_lookup: WARN >2m stale, FAIL >5m (was >1h).
sync_facility_stale: per-facility minutes-since-last-success, WARN >10m, FAIL >30m, keeping the 48h-active guard; a facility active in the last 48h that has never succeeded counts as a fail.
sync_sessions (stuck): WARN >15m, FAIL >45m (was 1h/6h).
sync_restart_loop: WARN ≥5/facility/hr, FAIL ≥10.
sync_session_errors: WARN ≥1 recent error, FAIL ≥10.

Other tiers:

error-stream checks (fhir_job_errors, certificate_notification_errors, ips_errors, patient_communication_errors, report_errors): WARN ≥1, FAIL ≥10 in the window, via a generalised tiered_rows_check helper.
fhir_service_requests_unresolved: WARN >1h, FAIL >6h.
kopia_backup: add WARN >12h (FAIL stays >24h); uptime: WARN on a reboot under 10m ago; db_connect: WARN on >1s connect latency; tamanu_http: WARN on >2s response.

Left as-is: disk/memory/http_errors/fhir_jobs/caddy/migrations/time_sync and the binary connectivity checks.

Threshold boundaries are covered by pure-function unit tests; the DB-backed tests assert each check returns a valid status against the local central database.

Add cpuCores (logical CPU count via sysinfo) and totalMemoryBytes to the top-level status payload, populated directly in gather(). Tier the load check on the 5-minute average relative to core count (WARN >1.5x cores, FAIL >4x cores) and surface the core count in its summary and details. Co-authored-by: Claude <noreply@anthropic.com>

Re-tune the migrated checks to the real sync cadence and add WARN tiers: - sync_lookup: query staleness unfiltered, tier in Rust (WARN >2m, FAIL >5m); absent row passes as not-tracked. - sync_facility_stale: single query of minutes-since-last-success per facility active in the last 48h (WARN >10m, FAIL >30m). - sync_restart_loop: SQL HAVING >=5, tier per facility (WARN >=5, FAIL >=10 restarts/hr). - sync_session_errors: tier on combined row count (WARN >=1, FAIL >=10). - sync_sessions: tighten stuck thresholds to WARN >15m, FAIL >45m. - fhir_service_requests_unresolved: tier per request (WARN >1h, FAIL >6h). - error-stream checks (fhir_job_errors, certificate_notification_errors, ips_errors, patient_communication_errors, report_errors): WARN >=1, FAIL >=10 via a generalised tiered_rows_check helper replacing fail_if_any_rows. - kopia_backup: add WARN >12h. uptime: WARN <10m. db_connect: WARN latency >1s. tamanu_http: WARN latency >2s. Co-authored-by: Claude <noreply@anthropic.com>

All four phases shipped: doctor subsystem moved into bestool-alertd, the 16 YAML alerts migrated to checks, the YAML alert engine + standalone CLI retired, and check thresholds tiered with host capacity reported. windows_service was kept (the daemon still runs as a Windows service) and cpuCores/totalMemoryBytes were added to the status payload. Co-authored-by: Claude <noreply@anthropic.com>

passcod mentioned this pull request May 31, 2026

fix(tamanu): find caddy at C:\Caddy\caddy.exe when not on PATH #443

Merged

passcod and others added 3 commits June 2, 2026 16:44

passcod force-pushed the phase4-threshold-tuning branch from 91de5ab to 3ec22f5 Compare June 2, 2026 04:46

passcod force-pushed the phase3-retire-alert-engine branch from 776bf40 to 1422654 Compare June 2, 2026 04:46

passcod mentioned this pull request Jun 2, 2026

chore: clippy round across the workspace #449

Merged

Base automatically changed from phase3-retire-alert-engine to phase1-doctor-into-alertd June 2, 2026 14:46

passcod merged commit b746b5a into phase1-doctor-into-alertd Jun 2, 2026
7 of 16 checks passed

passcod deleted the phase4-threshold-tuning branch June 2, 2026 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(alertd): tier check thresholds and report host capacity#442

feat(alertd): tier check thresholds and report host capacity#442
passcod merged 3 commits into
phase1-doctor-into-alertdfrom
phase4-threshold-tuning

passcod commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

passcod commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant