feat(alertd): port Tamanu YAML alerts to healthchecks#440
Open
passcod wants to merge 3 commits into
Open
Conversation
Add a shared row-fetching helper and the five recent-error checks ported from the certificate-notification, ips, patient-communications, report, and fhir YAML alerts. Each is central-only, skips when the DB is unavailable, and fails when any matching row exists within the lookback window. Co-authored-by: Claude <noreply@anthropic.com>
…kup) Add sync_session_errors (mobile + server, benign-error exclusions baked in), sync_facility_stale (not-syncing + no-recent-success), sync_lookup (lookup-table staleness, closes TODO #8), and sync_restart_loop. All central-only, skip when the DB is unavailable, and fail on any matching rows. Co-authored-by: Claude <noreply@anthropic.com>
… a check Add fhir_service_requests_unresolved, central-only, skipping when the DB is unavailable and failing when a lab-linked FHIR service request has stayed unresolved for over an hour. Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 Phase 2 of consolidating monitoring (TODO #10, plan in
docs/plans/healthchecks-into-alertd.md). Stacked on #438.Ports Tamanu's production YAML alerts into doctor healthchecks. Each migrated check is central-only (skips on facility), skips when the DB is unavailable, and emits FAIL when its condition is met (a single severity — canopy owns alerting and decides what to do with the per-check result). The offending rows are attached to the check's details, bounded: every query is wrapped as
SELECT to_jsonb(sub) AS row FROM ( <sql> ) sub LIMIT 101so at most 100 rows are reported (with atruncatedflag) and Postgres never hands back more.Ten new checks:
certificate_notification_errors,ips_errors,patient_communication_errors,report_errors,fhir_job_errorssync_session_errors(mobile + server, with the benign-error exclusions),sync_facility_stale(facilities not syncing / no recent success),sync_lookup(lookup table stale — also closes the standalone sync_lookup TODO),sync_restart_loopfhir_service_requests_unresolvedThe fhir queue-size / stuck-job alerts and sync-long are already covered by the existing
fhir_jobsandsync_sessionschecks, so they're not duplicated.Each check has DB-backed tests against the local
tamanu-centraldatabase (asserting the SQL runs against the real schema) and a facility test asserting it skips. All 12 distinct queries run cleanly against the real central schema.Note: the four generic error-table checks were
server-kind: unsetin the original YAML; they're gated central-only here (those tables are central concepts). The lookback/window constants are intentionally conservative and will be tuned in the threshold-review phase.