feat(alertd): tier check thresholds and report host capacity#442
Merged
Conversation
Add cpuCores (logical CPU count via sysinfo) and totalMemoryBytes to the top-level status payload, populated directly in gather(). Tier the load check on the 5-minute average relative to core count (WARN >1.5x cores, FAIL >4x cores) and surface the core count in its summary and details. Co-authored-by: Claude <noreply@anthropic.com>
Re-tune the migrated checks to the real sync cadence and add WARN tiers: - sync_lookup: query staleness unfiltered, tier in Rust (WARN >2m, FAIL >5m); absent row passes as not-tracked. - sync_facility_stale: single query of minutes-since-last-success per facility active in the last 48h (WARN >10m, FAIL >30m). - sync_restart_loop: SQL HAVING >=5, tier per facility (WARN >=5, FAIL >=10 restarts/hr). - sync_session_errors: tier on combined row count (WARN >=1, FAIL >=10). - sync_sessions: tighten stuck thresholds to WARN >15m, FAIL >45m. - fhir_service_requests_unresolved: tier per request (WARN >1h, FAIL >6h). - error-stream checks (fhir_job_errors, certificate_notification_errors, ips_errors, patient_communication_errors, report_errors): WARN >=1, FAIL >=10 via a generalised tiered_rows_check helper replacing fail_if_any_rows. - kopia_backup: add WARN >12h. uptime: WARN <10m. db_connect: WARN latency >1s. tamanu_http: WARN latency >2s. Co-authored-by: Claude <noreply@anthropic.com>
All four phases shipped: doctor subsystem moved into bestool-alertd, the 16 YAML alerts migrated to checks, the YAML alert engine + standalone CLI retired, and check thresholds tiered with host capacity reported. windows_service was kept (the daemon still runs as a Windows service) and cpuCores/totalMemoryBytes were added to the status payload. Co-authored-by: Claude <noreply@anthropic.com>
91de5ab to
3ec22f5
Compare
776bf40 to
1422654
Compare
Base automatically changed from
phase3-retire-alert-engine
to
phase1-doctor-into-alertd
June 2, 2026 14:46
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 Phase 4 of consolidating monitoring (TODO #10, plan in
docs/plans/healthchecks-into-alertd.md). Stacked on #441.Re-tunes check triggering thresholds and adds host capacity to the status payload. Every check that represents a bad condition has a FAIL threshold; some now also have a lower WARN tier. Since canopy runs its own alerting logic off the per-check results (and ignores the sweep's top-level
healthy), warn-vs-fail only affects whatbestool tamanu doctorshows as DEGRADED vs FAILING.Host capacity (for Canopy + to contextualise the load row):
cpuCores(logical) andtotalMemoryBytesadded to the top-level status payload.loadrow now readsload average: 2.50, 2.10, 1.90 (4 cores)and tiers the 5-minute average against the core count: WARN above 1.5×cores, FAIL above 4×cores.Sync thresholds, tightened to the real cadence (sync ~60s, lookup ~20s — an hour stale is far too late):
sync_lookup: WARN >2m stale, FAIL >5m (was >1h).sync_facility_stale: per-facility minutes-since-last-success, WARN >10m, FAIL >30m, keeping the 48h-active guard; a facility active in the last 48h that has never succeeded counts as a fail.sync_sessions(stuck): WARN >15m, FAIL >45m (was 1h/6h).sync_restart_loop: WARN ≥5/facility/hr, FAIL ≥10.sync_session_errors: WARN ≥1 recent error, FAIL ≥10.Other tiers:
fhir_job_errors,certificate_notification_errors,ips_errors,patient_communication_errors,report_errors): WARN ≥1, FAIL ≥10 in the window, via a generalisedtiered_rows_checkhelper.fhir_service_requests_unresolved: WARN >1h, FAIL >6h.kopia_backup: add WARN >12h (FAIL stays >24h);uptime: WARN on a reboot under 10m ago;db_connect: WARN on >1s connect latency;tamanu_http: WARN on >2s response.Left as-is: disk/memory/http_errors/fhir_jobs/caddy/migrations/time_sync and the binary connectivity checks.
Threshold boundaries are covered by pure-function unit tests; the DB-backed tests assert each check returns a valid status against the local central database.