Skip to content

feat(alertd): tier check thresholds and report host capacity#442

Merged
passcod merged 3 commits into
phase1-doctor-into-alertdfrom
phase4-threshold-tuning
Jun 2, 2026
Merged

feat(alertd): tier check thresholds and report host capacity#442
passcod merged 3 commits into
phase1-doctor-into-alertdfrom
phase4-threshold-tuning

Conversation

@passcod
Copy link
Copy Markdown
Member

@passcod passcod commented May 31, 2026

🤖 Phase 4 of consolidating monitoring (TODO #10, plan in docs/plans/healthchecks-into-alertd.md). Stacked on #441.

Re-tunes check triggering thresholds and adds host capacity to the status payload. Every check that represents a bad condition has a FAIL threshold; some now also have a lower WARN tier. Since canopy runs its own alerting logic off the per-check results (and ignores the sweep's top-level healthy), warn-vs-fail only affects what bestool tamanu doctor shows as DEGRADED vs FAILING.

Host capacity (for Canopy + to contextualise the load row):

  • cpuCores (logical) and totalMemoryBytes added to the top-level status payload.
  • The load row now reads load average: 2.50, 2.10, 1.90 (4 cores) and tiers the 5-minute average against the core count: WARN above 1.5×cores, FAIL above 4×cores.

Sync thresholds, tightened to the real cadence (sync ~60s, lookup ~20s — an hour stale is far too late):

  • sync_lookup: WARN >2m stale, FAIL >5m (was >1h).
  • sync_facility_stale: per-facility minutes-since-last-success, WARN >10m, FAIL >30m, keeping the 48h-active guard; a facility active in the last 48h that has never succeeded counts as a fail.
  • sync_sessions (stuck): WARN >15m, FAIL >45m (was 1h/6h).
  • sync_restart_loop: WARN ≥5/facility/hr, FAIL ≥10.
  • sync_session_errors: WARN ≥1 recent error, FAIL ≥10.

Other tiers:

  • error-stream checks (fhir_job_errors, certificate_notification_errors, ips_errors, patient_communication_errors, report_errors): WARN ≥1, FAIL ≥10 in the window, via a generalised tiered_rows_check helper.
  • fhir_service_requests_unresolved: WARN >1h, FAIL >6h.
  • kopia_backup: add WARN >12h (FAIL stays >24h); uptime: WARN on a reboot under 10m ago; db_connect: WARN on >1s connect latency; tamanu_http: WARN on >2s response.

Left as-is: disk/memory/http_errors/fhir_jobs/caddy/migrations/time_sync and the binary connectivity checks.

Threshold boundaries are covered by pure-function unit tests; the DB-backed tests assert each check returns a valid status against the local central database.

passcod and others added 3 commits June 2, 2026 16:44
Add cpuCores (logical CPU count via sysinfo) and totalMemoryBytes to the
top-level status payload, populated directly in gather(). Tier the load
check on the 5-minute average relative to core count (WARN >1.5x cores,
FAIL >4x cores) and surface the core count in its summary and details.

Co-authored-by: Claude <noreply@anthropic.com>
Re-tune the migrated checks to the real sync cadence and add WARN tiers:

- sync_lookup: query staleness unfiltered, tier in Rust (WARN >2m, FAIL
  >5m); absent row passes as not-tracked.
- sync_facility_stale: single query of minutes-since-last-success per
  facility active in the last 48h (WARN >10m, FAIL >30m).
- sync_restart_loop: SQL HAVING >=5, tier per facility (WARN >=5, FAIL
  >=10 restarts/hr).
- sync_session_errors: tier on combined row count (WARN >=1, FAIL >=10).
- sync_sessions: tighten stuck thresholds to WARN >15m, FAIL >45m.
- fhir_service_requests_unresolved: tier per request (WARN >1h, FAIL >6h).
- error-stream checks (fhir_job_errors, certificate_notification_errors,
  ips_errors, patient_communication_errors, report_errors): WARN >=1,
  FAIL >=10 via a generalised tiered_rows_check helper replacing
  fail_if_any_rows.
- kopia_backup: add WARN >12h. uptime: WARN <10m. db_connect: WARN
  latency >1s. tamanu_http: WARN latency >2s.

Co-authored-by: Claude <noreply@anthropic.com>
All four phases shipped: doctor subsystem moved into bestool-alertd, the 16
YAML alerts migrated to checks, the YAML alert engine + standalone CLI retired,
and check thresholds tiered with host capacity reported. windows_service was
kept (the daemon still runs as a Windows service) and cpuCores/totalMemoryBytes
were added to the status payload.

Co-authored-by: Claude <noreply@anthropic.com>
@passcod passcod force-pushed the phase4-threshold-tuning branch from 91de5ab to 3ec22f5 Compare June 2, 2026 04:46
@passcod passcod force-pushed the phase3-retire-alert-engine branch from 776bf40 to 1422654 Compare June 2, 2026 04:46
Base automatically changed from phase3-retire-alert-engine to phase1-doctor-into-alertd June 2, 2026 14:46
@passcod passcod merged commit b746b5a into phase1-doctor-into-alertd Jun 2, 2026
7 of 16 checks passed
@passcod passcod deleted the phase4-threshold-tuning branch June 2, 2026 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant