Skip to content

feat(alertd): port Tamanu YAML alerts to healthchecks#440

Open
passcod wants to merge 3 commits into
phase1-doctor-into-alertdfrom
phase2-migrate-alerts
Open

feat(alertd): port Tamanu YAML alerts to healthchecks#440
passcod wants to merge 3 commits into
phase1-doctor-into-alertdfrom
phase2-migrate-alerts

Conversation

@passcod
Copy link
Copy Markdown
Member

@passcod passcod commented May 30, 2026

🤖 Phase 2 of consolidating monitoring (TODO #10, plan in docs/plans/healthchecks-into-alertd.md). Stacked on #438.

Ports Tamanu's production YAML alerts into doctor healthchecks. Each migrated check is central-only (skips on facility), skips when the DB is unavailable, and emits FAIL when its condition is met (a single severity — canopy owns alerting and decides what to do with the per-check result). The offending rows are attached to the check's details, bounded: every query is wrapped as SELECT to_jsonb(sub) AS row FROM ( <sql> ) sub LIMIT 101 so at most 100 rows are reported (with a truncated flag) and Postgres never hands back more.

Ten new checks:

  • Recent-error (1h lookback, tunable in the threshold-review phase): certificate_notification_errors, ips_errors, patient_communication_errors, report_errors, fhir_job_errors
  • Sync: sync_session_errors (mobile + server, with the benign-error exclusions), sync_facility_stale (facilities not syncing / no recent success), sync_lookup (lookup table stale — also closes the standalone sync_lookup TODO), sync_restart_loop
  • FHIR: fhir_service_requests_unresolved

The fhir queue-size / stuck-job alerts and sync-long are already covered by the existing fhir_jobs and sync_sessions checks, so they're not duplicated.

Each check has DB-backed tests against the local tamanu-central database (asserting the SQL runs against the real schema) and a facility test asserting it skips. All 12 distinct queries run cleanly against the real central schema.

Note: the four generic error-table checks were server-kind: unset in the original YAML; they're gated central-only here (those tables are central concepts). The lookback/window constants are intentionally conservative and will be tuned in the threshold-review phase.

passcod and others added 3 commits May 30, 2026 17:57
Add a shared row-fetching helper and the five recent-error checks ported
from the certificate-notification, ips, patient-communications, report,
and fhir YAML alerts. Each is central-only, skips when the DB is
unavailable, and fails when any matching row exists within the lookback
window.

Co-authored-by: Claude <noreply@anthropic.com>
…kup)

Add sync_session_errors (mobile + server, benign-error exclusions baked
in), sync_facility_stale (not-syncing + no-recent-success), sync_lookup
(lookup-table staleness, closes TODO #8), and sync_restart_loop. All
central-only, skip when the DB is unavailable, and fail on any matching
rows.

Co-authored-by: Claude <noreply@anthropic.com>
… a check

Add fhir_service_requests_unresolved, central-only, skipping when the DB
is unavailable and failing when a lab-linked FHIR service request has
stayed unresolved for over an hour.

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant