From e1ecfed623f868613c3e4a5df3e13a4d290aac22 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?F=C3=A9lix=20Saparelli?= Date: Sat, 30 May 2026 17:24:18 +1200 Subject: [PATCH 01/12] plan: consolidate healthchecks into alertd, retire YAML alert engine Approved plan for TODO #10: invert the crate relationship so bestool-alertd owns the doctor framework + checks (calling bestool-tamanu for common domain code), migrate the 16 YAML alerts to checks (default FAIL, canopy owns alerting), retire the YAML alert engine + standalone CLI, then review thresholds across all checks. Co-authored-by: Claude --- docs/plans/healthchecks-into-alertd.md | 72 ++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) create mode 100644 docs/plans/healthchecks-into-alertd.md diff --git a/docs/plans/healthchecks-into-alertd.md b/docs/plans/healthchecks-into-alertd.md new file mode 100644 index 00000000..edcf65b6 --- /dev/null +++ b/docs/plans/healthchecks-into-alertd.md @@ -0,0 +1,72 @@ +# Plan: consolidate healthchecks into alertd, migrate YAML alerts to checks, retire the alert engine (TODO #10) + +## Context + +Two monitoring systems run in parallel: + +- **Healthchecks** (doctor): code-defined `Check`s run as a concurrent "sweep" → pass/warning/fail + JSON details, POSTed to **canopy** (`POST /status/{server_id}`). Today these live in `crates/tamanu/src/doctor/` (the `Check` type + checks), with the sweep orchestration (`perform_sweep`), canopy posting, and the `DoctorTask` background task in `crates/bestool/`. Viewable via `bestool tamanu doctor`. +- **alertd** (`crates/alertd/`): a daemon loading **YAML alert definitions** (deployed in Tamanu installs at `/etc/tamanu/alerts`, …), scheduling each on its own interval, evaluating SQL/shell/event sources, and dispatching to email/Slack/canopy `/events`. Ships a standalone `bestool-alertd` binary + library; also hosts the `DoctorTask`. + +Decisions: + +1. **Invert the crate relationship.** Move the whole doctor subsystem (framework + checks + sweep + canopy posting + `DoctorTask`) **into `bestool-alertd`**, which calls into `bestool-tamanu` for common Tamanu domain utilities. alertd becomes the monitoring engine that owns both the framework and the checks. No dependency cycle: `bestool-tamanu` never depends on alertd. +2. **Migrate** all 16 production YAML alerts (`~/code/work/tamanu/alerts`) into checks. Migrated checks default to **`Check::fail`** when triggered (single severity, no warn tier). +3. **Canopy owns alerting** and has its own logic — it ignores the sweep's top-level `healthy:false`, so the warn-vs-fail-for-top-level distinction is irrelevant at the canopy level. bestool just posts the sweep; drop email/Slack/per-alert targets, dedup, hysteresis, cadence. +4. **Retire the YAML alert engine and the standalone CLI**; alertd keeps only the daemon framework + the doctor subsystem. +5. **Then review thresholds** across all checks (migrated and pre-existing). + +Note: deployed installs still have YAML files under `/etc/tamanu/alerts`; once the loader is removed they're simply ignored (no error). Operators can delete them later. + +## Target architecture + +- **`bestool-alertd`**: owns the monitoring framework (`BackgroundTask` daemon, http server) **and** the doctor subsystem — `Check`/`CheckStatus`/`OverallResult` wire types, `CheckContext`, the registry + `checks/*`, `progress`, the `ServerInfo` facts, `perform_sweep` + `SweepResult` + canopy status posting, and a built-in `DoctorTask` it registers itself. Depends on `bestool-tamanu` (common domain), `bestool-canopy`, `bestool-postgres`, `bestool-kopia`. +- **`bestool-tamanu`**: common Tamanu domain library only — `config`, `roots`, `connection_url`, `services`, `systemd`, `pm2`, `server_info` (DB queries: metaServerId, patient-portal), `versions`, `ApiServerKind`, `find_tamanu`, `detect_kind`. The `doctor` module and `doctor` feature are removed; description updated. +- **`bestool`**: thin CLI. `bestool tamanu doctor` keeps arg parsing + human rendering + daemon-fetch (`/tasks/doctor/latest`/`recompute`) and calls `bestool_alertd::doctor` for local sweeps + types. `bestool tamanu alertd` configures and runs the alertd daemon (which self-registers its `DoctorTask`). + +## Phase 1 — Invert: move the doctor subsystem into alertd (behaviour-preserving refactor) + +- Relocate `crates/tamanu/src/doctor/{check,checks,checks/*,progress,server_info}.rs` → `crates/alertd/src/doctor/…`. +- Move `perform_sweep` + `SweepResult` + canopy status posting from `crates/bestool/src/actions/tamanu/doctor.rs` into alertd (e.g. `bestool_alertd::doctor::perform_sweep`). +- Move `DoctorTask` (`crates/bestool/src/actions/tamanu/alertd/doctor_task.rs`) into alertd as the built-in task; alertd registers it (or exposes a constructor) so bestool no longer wires it. +- Add `bestool-tamanu` as an alertd dependency; rewrite check imports from `crate::{ApiServerKind, config::TamanuConfig, services, systemd, pm2, server_info, detect_kind, versions}` → `bestool_tamanu::{…}`. +- Move doctor-only deps (`bestool-kopia`, `hickory-resolver`, and `reqwest`/`owo-colors` as needed) from `crates/tamanu/Cargo.toml` to `crates/alertd/Cargo.toml`; remove tamanu's `doctor` feature and update its package description. +- bestool side: `doctor.rs` keeps CLI args + rendering + daemon-fetch, calling `bestool_alertd::doctor`; delete the moved `doctor_task` module; retarget Cargo features (`bestool-tamanu/doctor` → alertd). +- **Behaviour-preserving** — no check logic changes. This is large but mechanical (mostly imports + module moves). + +## Phase 2 — Migrate the 16 YAML alerts to checks (now in alertd), default FAIL, central-only + +Migrated checks emit `Check::fail` when triggered, skip on Facility (gate on `ctx.kind`, mirroring `fhir_jobs`), and attach offending rows as `details`. A shared "recent error rows" helper serves the 7 recent-error alerts (run query → `fail` with rows if any match, else `pass`); the old per-alert `$1 = now - interval` becomes a per-check lookback constant. Verbatim SQL is in `~/code/work/tamanu/alerts/.yml`. + +**New checks (~10):** +| Alert(s) | New check | Style | +|---|---|---| +| certificate-notification-error | `certificate_notification_errors` | recent-error | +| ips-error | `ips_errors` | recent-error | +| patient-communications-error | `patient_communication_errors` | recent-error | +| report-error | `report_errors` | recent-error | +| fhir-error | `fhir_job_errors` | recent-error | +| sync-errors-mobile + sync-errors-server | `sync_session_errors` (one check; detail splits mobile/server; keep benign-error exclusions) | recent-error | +| sync-facility-not-syncing + sync-no-sessions | `sync_facility_stale` (one check; facilities with no recent successful sync) | stuck | +| sync-lookup-stale | `sync_lookup` (**= TODO #8**) | stuck | +| sync-restart-loop | `sync_restart_loop` | threshold | +| fhir-unresolvable-service-requests-labs | `fhir_service_requests_unresolved` | stuck | + +**Already covered (confirm/extend detail, no new check):** fhir-queue-incredibly-large, fhir-queued-job-long, fhir-running-job-long → `fhir_jobs`; sync-long → `sync_sessions`. + +Add via the registry pattern: `pub mod ;` + `entry!("", )` in the registry; `pub async fn run(ctx: CheckContext) -> Check`. Split into ~3 PRs by theme (error-notification / sync / fhir+reconcile). + +## Phase 3 — Retire the YAML alert engine + standalone CLI + +- Remove from alertd: `alert.rs`, `loader.rs`, `glob_resolver.rs`, `events.rs`, `targets.rs` + `targets/*`, `templates.rs`, per-alert `state_file.rs`, the alert parts of `scheduler.rs`, `commands.rs` + `commands/*`, `main.rs`, the `[[bin]]` + `cli` feature, `windows_service.rs`. Trim `DaemonConfig` (drop `alert_globs`, `email`, `server_kind`, alert `dry_run`; keep `pg_pool`, `database_url`, `device_key_pem`, `tamanu_version`, `no_server`, `server_addrs`, `watchdog_timeout`, `background_tasks`), `daemon.rs`, `http_server` (drop `/alerts`,`/targets`,`/validate`,`/reload`,`/pause`; keep `/`,`/status`,`/health`,`/metrics`,`/tasks/*`), and `lib.rs` exports. Relocate `InternalContext` out of `alert.rs` into `daemon.rs`/`context.rs`, slimmed to `{ pg_pool, http_client, canopy_client }`. +- bestool: simplify `tamanu alertd` (drop alert-dir discovery/globs, email/Mailgun flags, alert-filtering `server_kind`, and the passthrough subcommands `status`/`reload`/`pause`/`validate`/`loaded-alerts`); keep pg pool, device-key fetch (canopy auth), `tamanu_version`, build `DaemonConfig`, run. Remove the legacy `bestool tamanu alerts` command + module. Delete example alerts (`alerts/`) and alert test fixtures (`crates/bestool/tests/cmd/alerts*`). +- Gated after Phase 2 so coverage isn't lost. Optional follow-up (not in scope): rename `bestool-alertd` / `bestool tamanu alertd` now that it owns healthchecks, not alerts — deferred to avoid crates.io + systemd/install churn. + +## Phase 4 — Threshold review (all checks) + +After migration, review every check (the 10 migrated + the pre-existing ones) for triggering behaviour: warn-vs-fail, threshold values, central/facility gating, and whether any migrated check should be a warning rather than fail. Produce a short follow-up (possibly its own plan) and adjust. Migrated checks land at FAIL in Phase 2; this pass tunes them. + +## Verification + +- **Phase 1 (refactor)**: `cargo build`/`clippy` across the workspace and all feature combos; `cargo check -p bestool --target x86_64-pc-windows-gnu`; confirm identical behaviour — `bestool tamanu doctor` (local `--no-daemon` and daemon-fetch `--fresh`), canopy `/status` posting, and `/tasks/doctor/{latest,recompute}` all work; grep for dangling `bestool_tamanu::doctor` references. +- **Phase 2 (checks)**: against the local `tamanu-central` / `tamanu-facility` databases, `cargo test -p bestool-alertd` (DB-backed tests where feasible) and `bestool tamanu doctor --json --no-daemon`; confirm each new check appears as pass/fail with `details`, and is skipped on a facility install. +- **Phase 3 (teardown)**: full-workspace `cargo build`/`clippy` + Windows cross-check (windows_service removed); `bestool tamanu alertd` starts, ticks the sweep, posts to canopy, and `bestool tamanu doctor` still fetches from it; grep for leftover references (`loader`, `targets`, `templates`, `AlertDefinition`, `tamanu alerts`). From 00e4596f329d0cc03ced5517cadb42f45a0fa009 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?F=C3=A9lix=20Saparelli?= Date: Sat, 30 May 2026 17:56:23 +1200 Subject: [PATCH 02/12] feat(alertd/checks): port error-notification alerts to checks Add a shared row-fetching helper and the five recent-error checks ported from the certificate-notification, ips, patient-communications, report, and fhir YAML alerts. Each is central-only, skips when the DB is unavailable, and fails when any matching row exists within the lookback window. Co-authored-by: Claude --- crates/alertd/src/doctor/checks.rs | 85 +++++++++++++++++ .../checks/certificate_notification_errors.rs | 58 +++++++++++ .../src/doctor/checks/fhir_job_errors.rs | 61 ++++++++++++ crates/alertd/src/doctor/checks/ips_errors.rs | 57 +++++++++++ .../checks/patient_communication_errors.rs | 58 +++++++++++ .../alertd/src/doctor/checks/report_errors.rs | 58 +++++++++++ crates/alertd/src/doctor/checks/util.rs | 95 +++++++++++++++++++ 7 files changed, 472 insertions(+) create mode 100644 crates/alertd/src/doctor/checks/certificate_notification_errors.rs create mode 100644 crates/alertd/src/doctor/checks/fhir_job_errors.rs create mode 100644 crates/alertd/src/doctor/checks/ips_errors.rs create mode 100644 crates/alertd/src/doctor/checks/patient_communication_errors.rs create mode 100644 crates/alertd/src/doctor/checks/report_errors.rs create mode 100644 crates/alertd/src/doctor/checks/util.rs diff --git a/crates/alertd/src/doctor/checks.rs b/crates/alertd/src/doctor/checks.rs index e998bae8..8a83e9a6 100644 --- a/crates/alertd/src/doctor/checks.rs +++ b/crates/alertd/src/doctor/checks.rs @@ -13,17 +13,24 @@ use bestool_tamanu::{ApiServerKind, config::TamanuConfig}; use super::check::Check; +pub mod util; + pub mod caddy_version; +pub mod certificate_notification_errors; pub mod db_connect; pub mod db_version; pub mod disk_free; pub mod external_users; +pub mod fhir_job_errors; pub mod fhir_jobs; pub mod http_errors; +pub mod ips_errors; pub mod kopia_backup; pub mod load; pub mod memory; pub mod migrations; +pub mod patient_communication_errors; +pub mod report_errors; pub mod server_id; pub mod sync_sessions; pub mod tailscale; @@ -143,5 +150,83 @@ pub fn all() -> Vec { entry!("sync_sessions", sync_sessions), entry!("fhir_jobs", fhir_jobs), entry!("kopia_backup", kopia_backup), + entry!( + "certificate_notification_errors", + certificate_notification_errors + ), + entry!("ips_errors", ips_errors), + entry!("patient_communication_errors", patient_communication_errors), + entry!("report_errors", report_errors), + entry!("fhir_job_errors", fhir_job_errors), ] } + +#[cfg(test)] +pub mod test_support { + //! Helpers for DB-backed check tests. + //! + //! Each check is central-only and DB-backed, so its tests need a + //! [`CheckContext`] wired to one of the local `tamanu-central` / + //! `tamanu-facility` databases. These connect lazily and return `None` when + //! the DB is unavailable so the suite degrades gracefully off-CI. + + use std::sync::Arc; + + use node_semver::Version; + + use bestool_tamanu::{ApiServerKind, config::TamanuConfig}; + + use super::CheckContext; + + fn central_config() -> TamanuConfig { + serde_json::from_value(serde_json::json!({ + "db": { "name": "tamanu-central", "username": "u", "password": "p" }, + })) + .expect("central test config should parse") + } + + fn facility_config() -> TamanuConfig { + serde_json::from_value(serde_json::json!({ + "db": { "name": "tamanu-facility", "username": "u", "password": "p" }, + "serverFacilityIds": ["facility-1"], + })) + .expect("facility test config should parse") + } + + async fn connect(db_name: &str) -> Option> { + let url = format!("postgresql://localhost/{db_name}"); + match bestool_postgres::pool::connect_one(&url, "bestool-alertd-test").await { + Ok(client) => Some(Arc::new(client)), + Err(_) => None, + } + } + + /// A central [`CheckContext`] backed by `tamanu-central`, or `None` if that + /// DB can't be reached. + pub async fn central_ctx() -> Option { + let db = connect("tamanu-central").await?; + Some(CheckContext { + tamanu_version: Version::parse("0.0.0").unwrap(), + tamanu_root: std::path::PathBuf::from("/nonexistent"), + config: Arc::new(central_config()), + kind: ApiServerKind::Central, + database_url: "postgresql://localhost/tamanu-central".into(), + db: Some(db), + http_client: reqwest::Client::new(), + }) + } + + /// A facility [`CheckContext`] with no DB; central-only checks skip on it + /// before ever touching the database. + pub fn facility_ctx() -> CheckContext { + CheckContext { + tamanu_version: Version::parse("0.0.0").unwrap(), + tamanu_root: std::path::PathBuf::from("/nonexistent"), + config: Arc::new(facility_config()), + kind: ApiServerKind::Facility, + database_url: "postgresql://localhost/tamanu-facility".into(), + db: None, + http_client: reqwest::Client::new(), + } + } +} diff --git a/crates/alertd/src/doctor/checks/certificate_notification_errors.rs b/crates/alertd/src/doctor/checks/certificate_notification_errors.rs new file mode 100644 index 00000000..450c2843 --- /dev/null +++ b/crates/alertd/src/doctor/checks/certificate_notification_errors.rs @@ -0,0 +1,58 @@ +//! Certificate notifications that errored within the lookback window. + +use jiff::{Timestamp, ToSpan}; + +use super::{CheckContext, util::fail_if_any_rows}; +use crate::doctor::check::Check; +use bestool_tamanu::ApiServerKind; + +const NAME: &str = "certificate_notification_errors"; +const SQL: &str = "SELECT * FROM certificate_notifications \ + WHERE status = 'Error' AND created_at > $1 ORDER BY created_at DESC"; + +// Lookback window for recent-error checks. +const LOOKBACK_HOURS: i64 = 1; + +pub async fn run(ctx: CheckContext) -> Check { + if ctx.kind != ApiServerKind::Central { + return Check::skip( + NAME, + "not applicable on facility server", + "central-only check", + ); + } + let Some(client) = ctx.db.as_ref() else { + return Check::skip(NAME, "no DB connection", "db unavailable"); + }; + + let since = Timestamp::now() - LOOKBACK_HOURS.hours(); + fail_if_any_rows( + client, + "certificate_notification_errors", + "no recent certificate notification errors", + "certificate notification errors: ", + SQL, + &[&since], + ) + .await +} + +#[cfg(test)] +mod tests { + use crate::doctor::checks::test_support::{central_ctx, facility_ctx}; + + #[tokio::test] + async fn runs_against_central() { + let Some(ctx) = central_ctx().await else { + return; + }; + let check = super::run(ctx).await; + assert_eq!(check.name, "certificate_notification_errors"); + } + + #[tokio::test] + async fn skips_on_facility() { + let check = super::run(facility_ctx()).await; + assert!(check.status.is_skip()); + } +} diff --git a/crates/alertd/src/doctor/checks/fhir_job_errors.rs b/crates/alertd/src/doctor/checks/fhir_job_errors.rs new file mode 100644 index 00000000..806e6706 --- /dev/null +++ b/crates/alertd/src/doctor/checks/fhir_job_errors.rs @@ -0,0 +1,61 @@ +//! FHIR jobs that recorded an error within the lookback window. +//! +//! Distinct from `fhir_jobs`, which measures live queue depth: this surfaces +//! individual jobs that errored recently. + +use jiff::{Timestamp, ToSpan}; + +use super::{CheckContext, util::fail_if_any_rows}; +use crate::doctor::check::Check; +use bestool_tamanu::ApiServerKind; + +const NAME: &str = "fhir_job_errors"; +const SQL: &str = + "SELECT * FROM fhir.jobs WHERE error IS NOT NULL AND created_at > $1 ORDER BY created_at DESC"; + +// Lookback window for recent-error checks. +const LOOKBACK_HOURS: i64 = 1; + +pub async fn run(ctx: CheckContext) -> Check { + if ctx.kind != ApiServerKind::Central { + return Check::skip( + NAME, + "not applicable on facility server", + "central-only check", + ); + } + let Some(client) = ctx.db.as_ref() else { + return Check::skip(NAME, "no DB connection", "db unavailable"); + }; + + let since = Timestamp::now() - LOOKBACK_HOURS.hours(); + fail_if_any_rows( + client, + "fhir_job_errors", + "no recent FHIR job errors", + "FHIR job errors: ", + SQL, + &[&since], + ) + .await +} + +#[cfg(test)] +mod tests { + use crate::doctor::checks::test_support::{central_ctx, facility_ctx}; + + #[tokio::test] + async fn runs_against_central() { + let Some(ctx) = central_ctx().await else { + return; + }; + let check = super::run(ctx).await; + assert_eq!(check.name, "fhir_job_errors"); + } + + #[tokio::test] + async fn skips_on_facility() { + let check = super::run(facility_ctx()).await; + assert!(check.status.is_skip()); + } +} diff --git a/crates/alertd/src/doctor/checks/ips_errors.rs b/crates/alertd/src/doctor/checks/ips_errors.rs new file mode 100644 index 00000000..1812e8b8 --- /dev/null +++ b/crates/alertd/src/doctor/checks/ips_errors.rs @@ -0,0 +1,57 @@ +//! IPS requests that errored within the lookback window. + +use jiff::{Timestamp, ToSpan}; + +use super::{CheckContext, util::fail_if_any_rows}; +use crate::doctor::check::Check; +use bestool_tamanu::ApiServerKind; + +const NAME: &str = "ips_errors"; +const SQL: &str = "SELECT * FROM ips_requests WHERE status = 'Error' AND created_at > $1 ORDER BY created_at DESC"; + +// Lookback window for recent-error checks. +const LOOKBACK_HOURS: i64 = 1; + +pub async fn run(ctx: CheckContext) -> Check { + if ctx.kind != ApiServerKind::Central { + return Check::skip( + NAME, + "not applicable on facility server", + "central-only check", + ); + } + let Some(client) = ctx.db.as_ref() else { + return Check::skip(NAME, "no DB connection", "db unavailable"); + }; + + let since = Timestamp::now() - LOOKBACK_HOURS.hours(); + fail_if_any_rows( + client, + "ips_errors", + "no recent IPS request errors", + "IPS request errors: ", + SQL, + &[&since], + ) + .await +} + +#[cfg(test)] +mod tests { + use crate::doctor::checks::test_support::{central_ctx, facility_ctx}; + + #[tokio::test] + async fn runs_against_central() { + let Some(ctx) = central_ctx().await else { + return; + }; + let check = super::run(ctx).await; + assert_eq!(check.name, "ips_errors"); + } + + #[tokio::test] + async fn skips_on_facility() { + let check = super::run(facility_ctx()).await; + assert!(check.status.is_skip()); + } +} diff --git a/crates/alertd/src/doctor/checks/patient_communication_errors.rs b/crates/alertd/src/doctor/checks/patient_communication_errors.rs new file mode 100644 index 00000000..74d5c537 --- /dev/null +++ b/crates/alertd/src/doctor/checks/patient_communication_errors.rs @@ -0,0 +1,58 @@ +//! Patient communications that errored within the lookback window. + +use jiff::{Timestamp, ToSpan}; + +use super::{CheckContext, util::fail_if_any_rows}; +use crate::doctor::check::Check; +use bestool_tamanu::ApiServerKind; + +const NAME: &str = "patient_communication_errors"; +const SQL: &str = "SELECT * FROM patient_communications \ + WHERE status = 'Error' AND created_at > $1 ORDER BY created_at DESC"; + +// Lookback window for recent-error checks. +const LOOKBACK_HOURS: i64 = 1; + +pub async fn run(ctx: CheckContext) -> Check { + if ctx.kind != ApiServerKind::Central { + return Check::skip( + NAME, + "not applicable on facility server", + "central-only check", + ); + } + let Some(client) = ctx.db.as_ref() else { + return Check::skip(NAME, "no DB connection", "db unavailable"); + }; + + let since = Timestamp::now() - LOOKBACK_HOURS.hours(); + fail_if_any_rows( + client, + "patient_communication_errors", + "no recent patient communication errors", + "patient communication errors: ", + SQL, + &[&since], + ) + .await +} + +#[cfg(test)] +mod tests { + use crate::doctor::checks::test_support::{central_ctx, facility_ctx}; + + #[tokio::test] + async fn runs_against_central() { + let Some(ctx) = central_ctx().await else { + return; + }; + let check = super::run(ctx).await; + assert_eq!(check.name, "patient_communication_errors"); + } + + #[tokio::test] + async fn skips_on_facility() { + let check = super::run(facility_ctx()).await; + assert!(check.status.is_skip()); + } +} diff --git a/crates/alertd/src/doctor/checks/report_errors.rs b/crates/alertd/src/doctor/checks/report_errors.rs new file mode 100644 index 00000000..4dd0b2ce --- /dev/null +++ b/crates/alertd/src/doctor/checks/report_errors.rs @@ -0,0 +1,58 @@ +//! Report requests that errored within the lookback window. + +use jiff::{Timestamp, ToSpan}; + +use super::{CheckContext, util::fail_if_any_rows}; +use crate::doctor::check::Check; +use bestool_tamanu::ApiServerKind; + +const NAME: &str = "report_errors"; +const SQL: &str = "SELECT * FROM report_requests \ + WHERE status = 'Error' AND created_at > $1 ORDER BY created_at DESC"; + +// Lookback window for recent-error checks. +const LOOKBACK_HOURS: i64 = 1; + +pub async fn run(ctx: CheckContext) -> Check { + if ctx.kind != ApiServerKind::Central { + return Check::skip( + NAME, + "not applicable on facility server", + "central-only check", + ); + } + let Some(client) = ctx.db.as_ref() else { + return Check::skip(NAME, "no DB connection", "db unavailable"); + }; + + let since = Timestamp::now() - LOOKBACK_HOURS.hours(); + fail_if_any_rows( + client, + "report_errors", + "no recent report errors", + "report errors: ", + SQL, + &[&since], + ) + .await +} + +#[cfg(test)] +mod tests { + use crate::doctor::checks::test_support::{central_ctx, facility_ctx}; + + #[tokio::test] + async fn runs_against_central() { + let Some(ctx) = central_ctx().await else { + return; + }; + let check = super::run(ctx).await; + assert_eq!(check.name, "report_errors"); + } + + #[tokio::test] + async fn skips_on_facility() { + let check = super::run(facility_ctx()).await; + assert!(check.status.is_skip()); + } +} diff --git a/crates/alertd/src/doctor/checks/util.rs b/crates/alertd/src/doctor/checks/util.rs new file mode 100644 index 00000000..3cfbcdc2 --- /dev/null +++ b/crates/alertd/src/doctor/checks/util.rs @@ -0,0 +1,95 @@ +//! Shared helpers for SQL-backed checks. +//! +//! Each check fails when its query returns any rows and attaches the +//! offending rows (capped) to `details`. To avoid the generic tokio-postgres +//! row→JSON conversion and to bound memory, every query is wrapped so Postgres +//! returns one JSONB column per row, capped just past the reporting limit. + +use std::sync::Arc; + +use serde_json::Value; +use tokio_postgres::{Client as PgClient, types::ToSql}; + +use super::fmt_db_error; +use crate::doctor::check::Check; + +/// Rows reported in `details` are capped here; one extra row is fetched to +/// detect truncation. +const REPORT_CAP: usize = 100; +const FETCH_CAP: usize = REPORT_CAP + 1; + +/// Wrap the check's SQL so Postgres hands back one JSONB column (`row`) per +/// matching row, capped at [`FETCH_CAP`]. +fn wrap(sql: &str) -> String { + format!("SELECT to_jsonb(sub) AS row FROM ( {sql} ) sub LIMIT {FETCH_CAP}") +} + +/// Outcome of running one wrapped query: the rows (capped at +/// [`REPORT_CAP`]) and whether more existed than were reported. +pub struct RowSet { + pub rows: Vec, + pub truncated: bool, +} + +impl RowSet { + pub fn is_empty(&self) -> bool { + self.rows.is_empty() + } + + /// Number to report: the exact count, or `"100+"` when truncated. + pub fn count(&self) -> Value { + if self.truncated { + Value::from(format!("{REPORT_CAP}+")) + } else { + Value::from(self.rows.len()) + } + } +} + +/// Run a wrapped query and collect its rows. The `to_jsonb` wrapping is +/// applied here, so callers pass the check's SQL. +pub async fn fetch_rows( + client: &Arc, + sql: &str, + params: &[&(dyn ToSql + Sync)], +) -> Result { + let wrapped = wrap(sql); + let raw = client.query(&wrapped, params).await?; + let truncated = raw.len() > REPORT_CAP; + let rows = raw + .into_iter() + .take(REPORT_CAP) + .map(|r| r.get::<_, Value>("row")) + .collect(); + Ok(RowSet { rows, truncated }) +} + +/// Run a single wrapped query: fail (with capped rows + count) if it +/// returns any rows, else pass. +/// +/// `summary_pass` is the headline shown when nothing matched; +/// `summary_fail_prefix` is prepended to the count when rows are found. +pub async fn fail_if_any_rows( + client: &Arc, + name: &'static str, + summary_pass: &str, + summary_fail_prefix: &str, + sql: &str, + params: &[&(dyn ToSql + Sync)], +) -> Check { + match fetch_rows(client, sql, params).await { + Ok(set) if set.is_empty() => Check::pass(name, summary_pass.to_string()), + Ok(set) => { + let count = set.count(); + Check::fail( + name, + format!("{summary_fail_prefix}{count}"), + format!("{} matching row(s)", count), + ) + .with_detail("rows", Value::Array(set.rows)) + .with_detail("truncated", set.truncated) + .with_detail("count", count) + } + Err(err) => Check::fail(name, "query failed", fmt_db_error(&err)), + } +} From 10749c5edb89af06d1d21505044c487871578e01 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?F=C3=A9lix=20Saparelli?= Date: Sun, 31 May 2026 17:57:27 +1200 Subject: [PATCH 03/12] feat(alertd/doctor): report cpu cores and total memory in status payload Add cpuCores (logical CPU count via sysinfo) and totalMemoryBytes to the top-level status payload, populated directly in gather(). Tier the load check on the 5-minute average relative to core count (WARN >1.5x cores, FAIL >4x cores) and surface the core count in its summary and details. Co-authored-by: Claude --- crates/alertd/src/doctor/checks/load.rs | 73 +++++++++++++++++++++++-- crates/alertd/src/doctor/server_info.rs | 16 +++++- 2 files changed, 84 insertions(+), 5 deletions(-) diff --git a/crates/alertd/src/doctor/checks/load.rs b/crates/alertd/src/doctor/checks/load.rs index 32ed48ab..6cc9fc8e 100644 --- a/crates/alertd/src/doctor/checks/load.rs +++ b/crates/alertd/src/doctor/checks/load.rs @@ -1,7 +1,14 @@ -use sysinfo::System; +use sysinfo::{CpuRefreshKind, RefreshKind, System}; use super::CheckContext; -use crate::doctor::check::Check; +use crate::doctor::check::{Check, CheckStatus}; + +/// Multiplier on the logical core count above which the 5-minute load average +/// is treated as a hard failure. +const FAIL_PER_CORE: f64 = 4.0; +/// Multiplier on the logical core count above which the 5-minute load average +/// is treated as a warning. +const WARN_PER_CORE: f64 = 1.5; pub async fn run(_ctx: CheckContext) -> Check { if cfg!(target_os = "windows") { @@ -12,13 +19,71 @@ pub async fn run(_ctx: CheckContext) -> Check { ); } + let sys = + System::new_with_specifics(RefreshKind::nothing().with_cpu(CpuRefreshKind::nothing())); + let cores = sys.cpus().len().max(1); + let load = System::load_average(); let summary = format!( - "load average: {:.2}, {:.2}, {:.2}", + "load average: {:.2}, {:.2}, {:.2} ({cores} cores)", load.one, load.five, load.fifteen ); - Check::pass("load", summary) + + let check = match tier(load.five, cores) { + CheckStatus::Fail(_) => Check::fail( + "load", + summary, + format!( + "5-min load {:.2} over {:.1}x cores ({cores})", + load.five, FAIL_PER_CORE + ), + ), + CheckStatus::Warning(_) => Check::warning( + "load", + summary, + format!( + "5-min load {:.2} over {:.1}x cores ({cores})", + load.five, WARN_PER_CORE + ), + ), + _ => Check::pass("load", summary), + }; + + check .with_detail("one_min", load.one) .with_detail("five_min", load.five) .with_detail("fifteen_min", load.fifteen) + .with_detail("cores", cores) +} + +/// Tier the 5-minute load average against the logical core count. +fn tier(five: f64, cores: usize) -> CheckStatus { + let cores = cores as f64; + if five > FAIL_PER_CORE * cores { + CheckStatus::Fail(String::new()) + } else if five > WARN_PER_CORE * cores { + CheckStatus::Warning(String::new()) + } else { + CheckStatus::Pass + } +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn tier_boundaries() { + assert!(matches!(tier(5.9, 4), CheckStatus::Pass)); + assert!(matches!(tier(6.1, 4), CheckStatus::Warning(_))); + assert!(matches!(tier(15.9, 4), CheckStatus::Warning(_))); + assert!(matches!(tier(16.1, 4), CheckStatus::Fail(_))); + } + + #[test] + fn tier_single_core() { + assert!(matches!(tier(1.4, 1), CheckStatus::Pass)); + assert!(matches!(tier(1.6, 1), CheckStatus::Warning(_))); + assert!(matches!(tier(4.1, 1), CheckStatus::Fail(_))); + } } diff --git a/crates/alertd/src/doctor/server_info.rs b/crates/alertd/src/doctor/server_info.rs index ba113a86..887116ce 100644 --- a/crates/alertd/src/doctor/server_info.rs +++ b/crates/alertd/src/doctor/server_info.rs @@ -8,7 +8,7 @@ use std::{ }; use serde::Serialize; -use sysinfo::{Disks, System}; +use sysinfo::{CpuRefreshKind, Disks, MemoryRefreshKind, RefreshKind, System}; use tokio::net::TcpStream; use tracing::debug; @@ -63,6 +63,10 @@ pub struct ServerInfo { pub pg_version: Option, pub uptime_secs: u64, + /// Logical CPU count (what load average is relative to, i.e. `nproc`). + pub cpu_cores: usize, + /// Total physical memory, in bytes. + pub total_memory_bytes: u64, pub os_kind: &'static str, #[serde(skip_serializing_if = "Option::is_none")] pub os_name: Option, @@ -115,6 +119,14 @@ pub async fn gather(bestool_version: &str, tamanu_version: &str, facts: ServerFa .iana_name() .map(|s| s.to_string()); + let sys = System::new_with_specifics( + RefreshKind::nothing() + .with_cpu(CpuRefreshKind::nothing()) + .with_memory(MemoryRefreshKind::nothing().with_ram()), + ); + let cpu_cores = sys.cpus().len(); + let total_memory_bytes = sys.total_memory(); + ServerInfo { bestool_version: bestool_version.to_string(), tamanu_version: tamanu_version.to_string(), @@ -126,6 +138,8 @@ pub async fn gather(bestool_version: &str, tamanu_version: &str, facts: ServerFa os_timezone, pg_version: facts.pg_version, uptime_secs: System::uptime(), + cpu_cores, + total_memory_bytes, os_kind: if cfg!(target_os = "linux") { "linux" } else if cfg!(target_os = "windows") { From ca8612519020999873b2fe55e1a9cebded6a3503 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?F=C3=A9lix=20Saparelli?= Date: Sat, 30 May 2026 17:57:10 +1200 Subject: [PATCH 04/12] feat(alertd/checks): port sync alerts to checks (closes TODO sync_lookup) Add sync_session_errors (mobile + server, benign-error exclusions baked in), sync_facility_stale (not-syncing + no-recent-success), sync_lookup (lookup-table staleness, closes TODO #8), and sync_restart_loop. All central-only, skip when the DB is unavailable, and fail on any matching rows. Co-authored-by: Claude --- crates/alertd/src/doctor/checks.rs | 8 ++ .../src/doctor/checks/sync_facility_stale.rs | 98 +++++++++++++++++++ .../alertd/src/doctor/checks/sync_lookup.rs | 56 +++++++++++ .../src/doctor/checks/sync_restart_loop.rs | 59 +++++++++++ .../src/doctor/checks/sync_session_errors.rs | 97 ++++++++++++++++++ 5 files changed, 318 insertions(+) create mode 100644 crates/alertd/src/doctor/checks/sync_facility_stale.rs create mode 100644 crates/alertd/src/doctor/checks/sync_lookup.rs create mode 100644 crates/alertd/src/doctor/checks/sync_restart_loop.rs create mode 100644 crates/alertd/src/doctor/checks/sync_session_errors.rs diff --git a/crates/alertd/src/doctor/checks.rs b/crates/alertd/src/doctor/checks.rs index 8a83e9a6..ad615d66 100644 --- a/crates/alertd/src/doctor/checks.rs +++ b/crates/alertd/src/doctor/checks.rs @@ -32,6 +32,10 @@ pub mod migrations; pub mod patient_communication_errors; pub mod report_errors; pub mod server_id; +pub mod sync_facility_stale; +pub mod sync_lookup; +pub mod sync_restart_loop; +pub mod sync_session_errors; pub mod sync_sessions; pub mod tailscale; pub mod tamanu_found; @@ -158,6 +162,10 @@ pub fn all() -> Vec { entry!("patient_communication_errors", patient_communication_errors), entry!("report_errors", report_errors), entry!("fhir_job_errors", fhir_job_errors), + entry!("sync_session_errors", sync_session_errors), + entry!("sync_facility_stale", sync_facility_stale), + entry!("sync_lookup", sync_lookup), + entry!("sync_restart_loop", sync_restart_loop), ] } diff --git a/crates/alertd/src/doctor/checks/sync_facility_stale.rs b/crates/alertd/src/doctor/checks/sync_facility_stale.rs new file mode 100644 index 00000000..e25c21bb --- /dev/null +++ b/crates/alertd/src/doctor/checks/sync_facility_stale.rs @@ -0,0 +1,98 @@ +//! Facilities whose sync has gone stale. +//! +//! Flags facilities that synced in the last 48h but have had no completion in +//! the last 30m, as well as facilities whose last successful sync was over an +//! hour ago. + +use serde_json::Value; + +use super::{CheckContext, util::fetch_rows}; +use crate::doctor::check::Check; +use bestool_tamanu::ApiServerKind; + +const NAME: &str = "sync_facility_stale"; + +const NOT_SYNCING_SQL: &str = "with sync_sessions_with_facility_id as ( \ + select created_at, completed_at, \ + jsonb_array_elements_text(parameters->'facilityIds') as facility_id \ + from sync_sessions where parameters->>'isMobile' <> 'true' \ + ) \ + select distinct facility_id from sync_sessions_with_facility_id \ + where created_at > current_timestamp - '48 hours'::interval \ + except \ + select facility_id from sync_sessions_with_facility_id \ + where completed_at > current_timestamp - '30 minutes'::interval \ + group by facility_id order by facility_id"; + +const NO_RECENT_SUCCESS_SQL: &str = "SELECT facility_id, last_successful_sync FROM ( \ + SELECT facility_id, max(completed_at) as last_successful_sync FROM ( \ + SELECT jsonb_array_elements_text(parameters->'facilityIds') as facility_id, completed_at \ + FROM sync_sessions WHERE errors IS NULL \ + ) AS successful_syncs GROUP BY facility_id \ + ) AS last_successful_facility_syncs \ + WHERE last_successful_sync < now() - interval '1 hour'"; + +pub async fn run(ctx: CheckContext) -> Check { + if ctx.kind != ApiServerKind::Central { + return Check::skip( + NAME, + "not applicable on facility server", + "central-only check", + ); + } + let Some(client) = ctx.db.as_ref() else { + return Check::skip(NAME, "no DB connection", "db unavailable"); + }; + + let not_syncing = match fetch_rows(client, NOT_SYNCING_SQL, &[]).await { + Ok(set) => set, + Err(err) => return Check::fail(NAME, "query failed", super::fmt_db_error(&err)), + }; + let no_recent_success = match fetch_rows(client, NO_RECENT_SUCCESS_SQL, &[]).await { + Ok(set) => set, + Err(err) => return Check::fail(NAME, "query failed", super::fmt_db_error(&err)), + }; + + if not_syncing.is_empty() && no_recent_success.is_empty() { + return Check::pass(NAME, "all facilities syncing"); + } + + let (not_syncing_count, not_syncing_truncated) = (not_syncing.count(), not_syncing.truncated); + let (no_recent_count, no_recent_truncated) = + (no_recent_success.count(), no_recent_success.truncated); + + let check = Check::fail( + NAME, + format!( + "stale sync: {not_syncing_count} not syncing, {no_recent_count} with no recent success" + ), + "facility sync stale", + ); + check + .with_detail("not_syncing", Value::Array(not_syncing.rows)) + .with_detail("not_syncing_count", not_syncing_count) + .with_detail("not_syncing_truncated", not_syncing_truncated) + .with_detail("no_recent_success", Value::Array(no_recent_success.rows)) + .with_detail("no_recent_success_count", no_recent_count) + .with_detail("no_recent_success_truncated", no_recent_truncated) +} + +#[cfg(test)] +mod tests { + use crate::doctor::checks::test_support::{central_ctx, facility_ctx}; + + #[tokio::test] + async fn runs_against_central() { + let Some(ctx) = central_ctx().await else { + return; + }; + let check = super::run(ctx).await; + assert_eq!(check.name, "sync_facility_stale"); + } + + #[tokio::test] + async fn skips_on_facility() { + let check = super::run(facility_ctx()).await; + assert!(check.status.is_skip()); + } +} diff --git a/crates/alertd/src/doctor/checks/sync_lookup.rs b/crates/alertd/src/doctor/checks/sync_lookup.rs new file mode 100644 index 00000000..0a7a72ad --- /dev/null +++ b/crates/alertd/src/doctor/checks/sync_lookup.rs @@ -0,0 +1,56 @@ +//! Lookup table update staleness. +//! +//! Fails when the central server hasn't recorded a successful lookup-table +//! update in over an hour. + +use super::{CheckContext, util::fail_if_any_rows}; +use crate::doctor::check::Check; +use bestool_tamanu::ApiServerKind; + +const NAME: &str = "sync_lookup"; +const SQL: &str = "SELECT key, value AS last_sync_tick, updated_at::text AS last_updated, \ + (now() - updated_at)::text AS time_since_update FROM local_system_facts \ + WHERE key = 'lastSuccessfulLookupTableUpdate' AND updated_at < now() - interval '1 hour'"; + +pub async fn run(ctx: CheckContext) -> Check { + if ctx.kind != ApiServerKind::Central { + return Check::skip( + NAME, + "not applicable on facility server", + "central-only check", + ); + } + let Some(client) = ctx.db.as_ref() else { + return Check::skip(NAME, "no DB connection", "db unavailable"); + }; + + fail_if_any_rows( + client, + NAME, + "lookup table up to date", + "lookup table stale: ", + SQL, + &[], + ) + .await +} + +#[cfg(test)] +mod tests { + use crate::doctor::checks::test_support::{central_ctx, facility_ctx}; + + #[tokio::test] + async fn runs_against_central() { + let Some(ctx) = central_ctx().await else { + return; + }; + let check = super::run(ctx).await; + assert_eq!(check.name, "sync_lookup"); + } + + #[tokio::test] + async fn skips_on_facility() { + let check = super::run(facility_ctx()).await; + assert!(check.status.is_skip()); + } +} diff --git a/crates/alertd/src/doctor/checks/sync_restart_loop.rs b/crates/alertd/src/doctor/checks/sync_restart_loop.rs new file mode 100644 index 00000000..09d36bc2 --- /dev/null +++ b/crates/alertd/src/doctor/checks/sync_restart_loop.rs @@ -0,0 +1,59 @@ +//! Facilities stuck in a sync restart loop. +//! +//! Fails when a facility has accumulated 10 or more `snapshot-for-pushing` sync +//! errors in the last hour, which indicates the sync is repeatedly restarting +//! rather than progressing. + +use super::{CheckContext, util::fail_if_any_rows}; +use crate::doctor::check::Check; +use bestool_tamanu::ApiServerKind; + +const NAME: &str = "sync_restart_loop"; +const SQL: &str = "SELECT jsonb_array_elements_text(parameters->'facilityIds') AS facility_id, \ + COUNT(*) AS error_count FROM sync_sessions \ + WHERE created_at > now() - interval '1 hour' AND errors IS NOT NULL \ + AND cardinality(errors) = 1 AND errors[1] LIKE '%snapshot-for-pushing%' \ + GROUP BY facility_id HAVING COUNT(*) >= 10 ORDER BY error_count DESC"; + +pub async fn run(ctx: CheckContext) -> Check { + if ctx.kind != ApiServerKind::Central { + return Check::skip( + NAME, + "not applicable on facility server", + "central-only check", + ); + } + let Some(client) = ctx.db.as_ref() else { + return Check::skip(NAME, "no DB connection", "db unavailable"); + }; + + fail_if_any_rows( + client, + NAME, + "no sync restart loops", + "facilities in sync restart loop: ", + SQL, + &[], + ) + .await +} + +#[cfg(test)] +mod tests { + use crate::doctor::checks::test_support::{central_ctx, facility_ctx}; + + #[tokio::test] + async fn runs_against_central() { + let Some(ctx) = central_ctx().await else { + return; + }; + let check = super::run(ctx).await; + assert_eq!(check.name, "sync_restart_loop"); + } + + #[tokio::test] + async fn skips_on_facility() { + let check = super::run(facility_ctx()).await; + assert!(check.status.is_skip()); + } +} diff --git a/crates/alertd/src/doctor/checks/sync_session_errors.rs b/crates/alertd/src/doctor/checks/sync_session_errors.rs new file mode 100644 index 00000000..f29d893d --- /dev/null +++ b/crates/alertd/src/doctor/checks/sync_session_errors.rs @@ -0,0 +1,97 @@ +//! Recent mobile and server sync-session errors, with benign-error exclusions +//! baked into the SQL. +//! +//! The window is a tight `updated_at > now() - interval '1 minute'`; the sweep +//! runs every 60s, so this still catches each error once. + +use serde_json::Value; + +use super::{CheckContext, util::fetch_rows}; +use crate::doctor::check::Check; +use bestool_tamanu::ApiServerKind; + +const NAME: &str = "sync_session_errors"; + +const MOBILE_SQL: &str = "SELECT id, errors::text, \ + jsonb_array_elements_text(parameters->'facilityIds') AS facility_id, \ + created_at::text AS created, (completed_at - created_at)::text AS duration \ + FROM sync_sessions \ + WHERE updated_at > now() - interval '1 minute' \ + AND parameters->>'isMobile' = 'true' \ + AND errors IS NOT NULL \ + AND errors <> ARRAY['Session marked as completed due to its device reconnecting'] \ + AND errors <> ARRAY['could not serialize access due to concurrent update'] \ + ORDER BY created_at DESC"; + +const SERVER_SQL: &str = "SELECT id, errors::text, \ + jsonb_array_elements_text(parameters->'facilityIds') AS facility_id, \ + created_at::text AS created, (completed_at - created_at)::text AS duration \ + FROM sync_sessions \ + WHERE updated_at > now() - interval '1 minute' \ + AND parameters->>'isMobile' IS DISTINCT FROM 'true' \ + AND errors IS NOT NULL \ + AND errors <> ARRAY['could not serialize access due to concurrent update'] \ + AND NOT (cardinality(errors) = 1 AND errors[1] LIKE '%snapshot-for-pushing%') \ + ORDER BY created_at DESC"; + +pub async fn run(ctx: CheckContext) -> Check { + if ctx.kind != ApiServerKind::Central { + return Check::skip( + NAME, + "not applicable on facility server", + "central-only check", + ); + } + let Some(client) = ctx.db.as_ref() else { + return Check::skip(NAME, "no DB connection", "db unavailable"); + }; + + let mobile = match fetch_rows(client, MOBILE_SQL, &[]).await { + Ok(set) => set, + Err(err) => return Check::fail(NAME, "query failed", super::fmt_db_error(&err)), + }; + let server = match fetch_rows(client, SERVER_SQL, &[]).await { + Ok(set) => set, + Err(err) => return Check::fail(NAME, "query failed", super::fmt_db_error(&err)), + }; + + if mobile.is_empty() && server.is_empty() { + return Check::pass(NAME, "no recent sync session errors"); + } + + let (mobile_count, mobile_truncated) = (mobile.count(), mobile.truncated); + let (server_count, server_truncated) = (server.count(), server.truncated); + + let check = Check::fail( + NAME, + format!("sync session errors: {mobile_count} mobile, {server_count} server"), + "recent sync session error(s)", + ); + check + .with_detail("mobile", Value::Array(mobile.rows)) + .with_detail("mobile_count", mobile_count) + .with_detail("mobile_truncated", mobile_truncated) + .with_detail("server", Value::Array(server.rows)) + .with_detail("server_count", server_count) + .with_detail("server_truncated", server_truncated) +} + +#[cfg(test)] +mod tests { + use crate::doctor::checks::test_support::{central_ctx, facility_ctx}; + + #[tokio::test] + async fn runs_against_central() { + let Some(ctx) = central_ctx().await else { + return; + }; + let check = super::run(ctx).await; + assert_eq!(check.name, "sync_session_errors"); + } + + #[tokio::test] + async fn skips_on_facility() { + let check = super::run(facility_ctx()).await; + assert!(check.status.is_skip()); + } +} From 0f75e828f6454d87c187362a8f6417242754e475 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?F=C3=A9lix=20Saparelli?= Date: Sat, 30 May 2026 18:06:49 +1200 Subject: [PATCH 05/12] feat(bestool): remove the legacy tamanu alerts command The YAML alert command, its definition/template/target parsing, the PostgreSQL-to-JSON helper it relied on, and its trycmd fixtures (plus the now-orphaned Postgres test fixture) are removed. The alerts engine is being retired in favour of the doctor healthcheck sweep. Co-authored-by: Claude --- crates/bestool/src/actions/tamanu.rs | 2 - crates/bestool/src/actions/tamanu/alerts.rs | 10 - .../src/actions/tamanu/alerts/command.rs | 285 ------------------ .../src/actions/tamanu/alerts/definition.rs | 190 ------------ .../src/actions/tamanu/alerts/pg_interval.rs | 23 -- .../src/actions/tamanu/alerts/targets.rs | 205 ------------- .../actions/tamanu/alerts/targets/canopy.rs | 127 -------- .../actions/tamanu/alerts/targets/email.rs | 67 ---- .../actions/tamanu/alerts/targets/slack.rs | 112 ------- .../actions/tamanu/alerts/targets/zendesk.rs | 89 ------ .../src/actions/tamanu/alerts/templates.rs | 159 ---------- .../src/actions/tamanu/alerts/tests.rs | 226 -------------- crates/bestool/src/lib.rs | 3 - crates/bestool/src/postgres_to_value.rs | 82 ----- crates/bestool/tests/cli_tests.rs | 20 +- .../tests/cmd/alerts.in/alerts/sql.yml | 16 - .../tests/cmd/alerts.in/tamanu/package.json | 3 - .../central-server/config/default.json5 | 1 - .../central-server/config/local.json5 | 13 - crates/bestool/tests/cmd/alerts.stdout | 15 - crates/bestool/tests/cmd/alerts.toml | 2 - crates/bestool/tests/fixture.sql | 17 -- crates/bestool/tests/fixture_pg.rs | 133 -------- 23 files changed, 3 insertions(+), 1797 deletions(-) delete mode 100644 crates/bestool/src/actions/tamanu/alerts.rs delete mode 100644 crates/bestool/src/actions/tamanu/alerts/command.rs delete mode 100644 crates/bestool/src/actions/tamanu/alerts/definition.rs delete mode 100644 crates/bestool/src/actions/tamanu/alerts/pg_interval.rs delete mode 100644 crates/bestool/src/actions/tamanu/alerts/targets.rs delete mode 100644 crates/bestool/src/actions/tamanu/alerts/targets/canopy.rs delete mode 100644 crates/bestool/src/actions/tamanu/alerts/targets/email.rs delete mode 100644 crates/bestool/src/actions/tamanu/alerts/targets/slack.rs delete mode 100644 crates/bestool/src/actions/tamanu/alerts/targets/zendesk.rs delete mode 100644 crates/bestool/src/actions/tamanu/alerts/templates.rs delete mode 100644 crates/bestool/src/actions/tamanu/alerts/tests.rs delete mode 100644 crates/bestool/src/postgres_to_value.rs delete mode 100644 crates/bestool/tests/cmd/alerts.in/alerts/sql.yml delete mode 100644 crates/bestool/tests/cmd/alerts.in/tamanu/package.json delete mode 100644 crates/bestool/tests/cmd/alerts.in/tamanu/packages/central-server/config/default.json5 delete mode 100644 crates/bestool/tests/cmd/alerts.in/tamanu/packages/central-server/config/local.json5 delete mode 100644 crates/bestool/tests/cmd/alerts.stdout delete mode 100644 crates/bestool/tests/cmd/alerts.toml delete mode 100644 crates/bestool/tests/fixture.sql delete mode 100644 crates/bestool/tests/fixture_pg.rs diff --git a/crates/bestool/src/actions/tamanu.rs b/crates/bestool/src/actions/tamanu.rs index ea1895af..9b0ab262 100644 --- a/crates/bestool/src/actions/tamanu.rs +++ b/crates/bestool/src/actions/tamanu.rs @@ -42,8 +42,6 @@ super::subcommands! { Ok((action, ctx)) }] - #[cfg(feature = "tamanu-alerts")] - alerts => Alerts(AlertsArgs), #[cfg(feature = "tamanu-alertd")] alertd => Alertd(AlertdArgs), #[cfg(feature = "tamanu-artifacts")] diff --git a/crates/bestool/src/actions/tamanu/alerts.rs b/crates/bestool/src/actions/tamanu/alerts.rs deleted file mode 100644 index f34e8818..00000000 --- a/crates/bestool/src/actions/tamanu/alerts.rs +++ /dev/null @@ -1,10 +0,0 @@ -pub use command::*; - -mod command; -mod definition; -mod pg_interval; -mod targets; -mod templates; - -#[cfg(test)] -mod tests; diff --git a/crates/bestool/src/actions/tamanu/alerts/command.rs b/crates/bestool/src/actions/tamanu/alerts/command.rs deleted file mode 100644 index 242a43d0..00000000 --- a/crates/bestool/src/actions/tamanu/alerts/command.rs +++ /dev/null @@ -1,285 +0,0 @@ -use std::{ - collections::HashMap, - convert::Infallible, - env::current_dir, - path::{Path, PathBuf}, - sync::Arc, - time::Duration, -}; - -use clap::Parser; -use futures::{TryFutureExt, future::join_all}; -use miette::{Context as _, IntoDiagnostic, Result}; -use tokio::{task::JoinSet, time::timeout}; -use tracing::{debug, error, info, warn}; -use walkdir::WalkDir; - -use bestool_tamanu::{ - config::load_config, - server_info::{fetch_device_key_with, query_device_key_row}, -}; - -use super::{definition::AlertDefinition, targets::AlertTargets}; -use crate::actions::{ - Context, - tamanu::{TamanuArgs, find_tamanu}, -}; - -fn parse_friendly_duration(s: &str) -> Result { - let signed: jiff::SignedDuration = s.parse().map_err(|e: jiff::Error| e.to_string())?; - signed.try_into().map_err(|e: jiff::Error| e.to_string()) -} - -/// Execute alert definitions against Tamanu -/// -/// DEPRECATED. Use `bestool tamanu alertd` for all new deployments. -/// -/// The alert and target definitions are documented online at: -/// -/// and . -#[derive(Debug, Clone, Parser)] -#[clap(verbatim_doc_comment)] -pub struct AlertsArgs { - /// Folder containing alert definitions. - /// - /// This folder will be read recursively for files with the `.yaml` or `.yml` extension. - /// - /// Files that don't match the expected format will be skipped, as will files with - /// `enabled: false` at the top level. Syntax errors will be reported for YAML files. - /// - /// It's entirely valid to provide a folder that only contains a `_targets.yml` file. - /// - /// Can be provided multiple times. Defaults to (depending on platform): `C:\Tamanu\alerts`, - /// `C:\Tamanu\{current-version}\alerts`, `/opt/tamanu-toolbox/alerts`, `/etc/tamanu/alerts`, - /// `/alerts`, and `./alerts`. - #[arg(long)] - pub dir: Vec, - - /// How far back to look for alerts. - /// - /// This is a duration string, e.g. `1d` for one day, `1h` for one hour, etc. It should match - /// the task scheduling / cron interval for this command. - #[arg(long, default_value = "15m", value_parser = parse_friendly_duration)] - pub interval: Duration, - - /// Timeout for each alert. - /// - /// If an alert takes longer than this to query the database or run the shell script, it will be - /// skipped. Defaults to 30 seconds. - /// - /// This is a duration string, e.g. `1d` for one day, `1h` for one hour, etc. - #[arg(long, default_value = "30s", value_parser = parse_friendly_duration)] - pub timeout: Duration, - - /// Don't actually send alerts, just print them to stdout. - #[arg(long)] - pub dry_run: bool, -} - -pub struct InternalContext { - pub pg_client: tokio_postgres::Client, - pub http_client: reqwest::Client, - pub canopy_client: Option>, -} - -async fn default_dirs(root: &Path) -> Vec { - let mut dirs = vec![ - PathBuf::from(r"C:\Tamanu\alerts"), - root.join("alerts"), - PathBuf::from("/opt/tamanu-toolbox/alerts"), - PathBuf::from("/etc/tamanu/alerts"), - PathBuf::from("/alerts"), - ]; - if let Ok(cwd) = current_dir() { - dirs.push(cwd.join("alerts")); - } - - join_all( - dirs.into_iter() - .map(|dir| async { if dir.exists() { Some(dir) } else { None } }), - ) - .await - .into_iter() - .flatten() - .collect() -} - -pub async fn run(args: AlertsArgs, ctx: Context) -> Result<()> { - let (version, root) = find_tamanu(ctx.require::())?; - let config = load_config(&root, None)?; - debug!(?config, "parsed Tamanu config"); - - let dirs = if args.dir.is_empty() { - default_dirs(&root).await - } else { - args.dir - }; - debug!(?dirs, "searching for alerts"); - - let mut alerts = Vec::::new(); - let mut external_targets = HashMap::new(); - for dir in dirs { - let external_targets_path = dir.join("_targets.yml"); - if let Some(AlertTargets { targets }) = std::fs::read_to_string(&external_targets_path) - .ok() - .and_then(|content| { - debug!(path=?external_targets_path, "parsing external targets"); - serde_yaml::from_str::(&content) - .map_err( - |err| warn!(path=?external_targets_path, "_targets.yml has errors! {err}"), - ) - .ok() - }) { - for target in targets { - external_targets - .entry(target.id().into()) - .or_insert(Vec::new()) - .push(target); - } - } - - alerts.extend( - WalkDir::new(dir) - .into_iter() - .filter_map(|e| e.ok()) - .filter(|e| e.file_type().is_file()) - .map(|entry| { - let file = entry.path(); - - if !file.extension().is_some_and(|e| e == "yaml" || e == "yml") { - return Ok(None); - } - - if file.file_stem().is_some_and(|n| n == "_targets") { - return Ok(None); - } - - debug!(?file, "parsing YAML file"); - let content = std::fs::read_to_string(file) - .into_diagnostic() - .wrap_err(format!("{file:?}"))?; - let mut alert: AlertDefinition = serde_yaml::from_str(&content) - .into_diagnostic() - .wrap_err(format!("{file:?}"))?; - - alert.file = file.to_path_buf(); - alert.interval = args.interval; - debug!(?alert, "parsed alert file"); - Ok(if alert.enabled { Some(alert) } else { None }) - }) - .filter_map(|def: Result>| match def { - Err(err) => { - error!("{err:?}"); - None - } - Ok(def) => def, - }), - ); - } - - if alerts.is_empty() { - info!("no alerts found, doing nothing"); - return Ok(()); - } - - if !external_targets.is_empty() { - debug!(count=%external_targets.len(), "found some external targets"); - } - - for alert in &mut alerts { - *alert = std::mem::take(alert).normalise(&external_targets); - } - debug!(count=%alerts.len(), "found some alerts"); - - let mut pg_config = tokio_postgres::Config::default(); - pg_config.application_name(format!( - "{}/{} (tamanu alerts)", - env!("CARGO_PKG_NAME"), - env!("CARGO_PKG_VERSION") - )); - if let Some(host) = &config.db.host { - pg_config.host(host); - } else { - pg_config.host("localhost"); - } - pg_config.user(&config.db.username); - pg_config.password(&config.db.password); - pg_config.dbname(&config.db.name); - info!(config=?pg_config, "connecting to Tamanu database"); - let (client, connection) = pg_config - .connect(tokio_postgres::NoTls) - .await - .into_diagnostic()?; - tokio::spawn(async move { - if let Err(e) = connection.await { - eprintln!("connection error: {}", e); - } - }); - - let device_key_pem = fetch_device_key_with(|| query_device_key_row(&client)).await; - - let canopy_client = match bestool_canopy::CanopyClient::new( - version.to_string(), - device_key_pem.as_deref(), - crate::http::client_builder, - ) - .await - { - Ok(Some(client)) => { - if client.is_tailscale().await { - info!("canopy client ready via tailscale for legacy alerts"); - } else { - info!("canopy client ready via mTLS for legacy alerts"); - } - Some(Arc::new(client)) - } - Ok(None) => { - info!("no canopy auth path available; canopy targets will be skipped"); - None - } - Err(err) => { - error!("failed to build canopy client: {err:?}"); - None - } - }; - - let config = Arc::new(config); - let internal_ctx = Arc::new(InternalContext { - pg_client: client, - http_client: crate::http::client(), - canopy_client, - }); - - let mut set = JoinSet::new(); - for alert in alerts { - let internal_ctx = internal_ctx.clone(); - let dry_run = args.dry_run; - let timeout_d = args.timeout; - let name = alert.file.clone(); - let config = config.clone(); - set.spawn( - timeout(timeout_d, async move { - let error = format!("while executing alert: {}", alert.file.display()); - if let Err(err) = alert - .execute(internal_ctx, &config, dry_run) - .await - .wrap_err(error) - { - eprintln!("{err:?}"); - } - }) - .or_else(move |elapsed| async move { - error!(alert=?name, "timeout: {elapsed:?}"); - Ok::<_, Infallible>(()) - }), - ); - } - - while let Some(res) = set.join_next().await { - if let Err(err) = res { - error!("task: {err:?}"); - } - } - - Ok(()) -} diff --git a/crates/bestool/src/actions/tamanu/alerts/definition.rs b/crates/bestool/src/actions/tamanu/alerts/definition.rs deleted file mode 100644 index 87d154f7..00000000 --- a/crates/bestool/src/actions/tamanu/alerts/definition.rs +++ /dev/null @@ -1,190 +0,0 @@ -use std::{ - collections::HashMap, io::Write, ops::ControlFlow, path::PathBuf, process::Stdio, sync::Arc, - time::Duration, -}; - -use jiff::Timestamp; -use miette::{Context as _, IntoDiagnostic, Result, miette}; -use tera::Context as TeraCtx; -use tokio::io::AsyncReadExt as _; -use tokio_postgres::types::ToSql; -use tracing::{debug, error, info, instrument, warn}; - -use bestool_tamanu::config::TamanuConfig; - -use crate::postgres_to_value::rows_to_value_map; - -use super::{ - InternalContext, - pg_interval::Interval, - targets::{ExternalTarget, SendTarget}, - templates::build_context, -}; - -fn enabled() -> bool { - true -} - -#[derive(serde::Deserialize, Debug, Default)] -pub struct AlertDefinition { - #[serde(default, skip)] - pub file: PathBuf, - - #[serde(default = "enabled")] - pub enabled: bool, - #[serde(skip)] - pub interval: Duration, - #[serde(default)] - pub send: Vec, - - #[serde(flatten)] - pub source: TicketSource, -} - -#[derive(serde::Deserialize, Debug, Default)] -#[serde(untagged, deny_unknown_fields)] -pub enum TicketSource { - Sql { - sql: String, - }, - Shell { - shell: String, - run: String, - }, - - #[default] - None, -} - -impl AlertDefinition { - pub fn normalise(mut self, external_targets: &HashMap>) -> Self { - self.send = self - .send - .iter() - .flat_map(|target| match target { - target @ SendTarget::External { id, .. } => target - .resolve_external(external_targets) - .unwrap_or_else(|| { - error!(id, "external target not found"); - Vec::new() - }), - other => vec![other.clone()], - }) - .collect(); - - self - } - - #[instrument(skip(self, client, not_before, context))] - pub async fn read_sources( - &self, - client: &tokio_postgres::Client, - not_before: Timestamp, - context: &mut TeraCtx, - ) -> Result> { - match &self.source { - TicketSource::None => { - debug!(?self.file, "no source, skipping"); - return Ok(ControlFlow::Break(())); - } - TicketSource::Sql { sql } => { - let statement = client.prepare(sql).await.into_diagnostic()?; - - let interval = Interval(self.interval); - let all_params: Vec<&(dyn ToSql + Sync)> = vec![¬_before, &interval]; - - let rows = client - .query(&statement, &all_params[..statement.params().len()]) - .await - .into_diagnostic() - .wrap_err("querying database")?; - - if rows.is_empty() { - debug!(?self.file, "no rows returned, skipping"); - return Ok(ControlFlow::Break(())); - } - info!(?self.file, rows=%rows.len(), "alert triggered"); - - let context_rows = rows_to_value_map(&rows); - - context.insert("rows", &context_rows); - } - TicketSource::Shell { shell, run } => { - let mut script = tempfile::Builder::new().tempfile().into_diagnostic()?; - write!(script.as_file_mut(), "{run}").into_diagnostic()?; - - let mut shell = tokio::process::Command::new(shell) - .arg(script.path()) - .stdin(Stdio::null()) - .stdout(Stdio::piped()) - .spawn() - .into_diagnostic()?; - - let mut output = Vec::new(); - let mut stdout = shell - .stdout - .take() - .ok_or_else(|| miette!("getting the child stdout handle"))?; - let output_future = - futures::future::try_join(shell.wait(), stdout.read_to_end(&mut output)); - - let Ok(res) = tokio::time::timeout(self.interval, output_future).await else { - warn!(?self.file, "the script timed out, skipping"); - shell.kill().await.into_diagnostic()?; - return Ok(ControlFlow::Break(())); - }; - - let (status, output_size) = res.into_diagnostic().wrap_err("running the shell")?; - - if status.success() { - debug!(?self.file, "the script succeeded, skipping"); - return Ok(ControlFlow::Break(())); - } - info!(?self.file, ?status, ?output_size, "alert triggered"); - - context.insert("output", &String::from_utf8_lossy(&output)); - } - } - Ok(ControlFlow::Continue(())) - } - - pub async fn execute( - self, - ctx: Arc, - config: &TamanuConfig, - dry_run: bool, - ) -> Result<()> { - info!(?self.file, "executing alert"); - - let now = crate::now_time(); - let not_before = now - self.interval; - info!(?now, ?not_before, interval=?self.interval, "date range for alert"); - - let mut tera_ctx = build_context(&self, now); - if self - .read_sources(&ctx.pg_client, not_before, &mut tera_ctx) - .await? - .is_break() - { - // Alert didn't trigger this run — fire clear events for stateful - // targets (canopy). Non-stateful targets no-op. - for target in &self.send { - if let Err(err) = target.send_clear(&self, &ctx, dry_run).await { - error!("sending clear: {err:?}"); - } - } - return Ok(()); - } - - for target in &self.send { - if let Err(err) = target - .send(&self, ctx.clone(), &mut tera_ctx, config, dry_run) - .await - { - error!("sending: {err:?}"); - } - } - - Ok(()) - } -} diff --git a/crates/bestool/src/actions/tamanu/alerts/pg_interval.rs b/crates/bestool/src/actions/tamanu/alerts/pg_interval.rs deleted file mode 100644 index b54c8cfa..00000000 --- a/crates/bestool/src/actions/tamanu/alerts/pg_interval.rs +++ /dev/null @@ -1,23 +0,0 @@ -use std::{error::Error, time::Duration}; - -use bytes::{BufMut, BytesMut}; -use miette::Result; -use tokio_postgres::types::{IsNull, ToSql, Type}; - -#[derive(Debug)] -pub struct Interval(pub Duration); - -impl ToSql for Interval { - fn to_sql(&self, _: &Type, out: &mut BytesMut) -> Result> { - out.put_i64(self.0.as_micros().try_into().unwrap_or_default()); - out.put_i32(0); - out.put_i32(0); - Ok(IsNull::No) - } - - fn accepts(ty: &Type) -> bool { - matches!(*ty, Type::INTERVAL) - } - - tokio_postgres::types::to_sql_checked!(); -} diff --git a/crates/bestool/src/actions/tamanu/alerts/targets.rs b/crates/bestool/src/actions/tamanu/alerts/targets.rs deleted file mode 100644 index 73c4af44..00000000 --- a/crates/bestool/src/actions/tamanu/alerts/targets.rs +++ /dev/null @@ -1,205 +0,0 @@ -use std::{collections::HashMap, sync::Arc}; - -use miette::Result; - -use bestool_tamanu::config::TamanuConfig; - -use super::{ - InternalContext, - definition::AlertDefinition, - templates::{load_templates, render_alert}, -}; - -pub(super) mod canopy; -mod email; -mod slack; -pub(super) mod zendesk; - -#[derive(serde::Deserialize, Debug, Clone)] -#[serde(rename_all = "snake_case", tag = "target")] -pub enum SendTarget { - Email { - subject: Option, - template: String, - #[serde(flatten)] - conn: email::TargetEmail, - }, - Zendesk { - subject: Option, - template: String, - #[serde(flatten)] - conn: zendesk::TargetZendesk, - }, - Slack { - subject: Option, - template: String, - #[serde(flatten)] - conn: slack::TargetSlack, - }, - Canopy { - subject: Option, - template: String, - #[serde(flatten)] - conn: canopy::TargetCanopy, - }, - External { - subject: Option, - template: String, - id: String, - }, -} - -impl SendTarget { - /// Returns the resolved target id used for canopy ref construction. - /// - /// For external-id-referenced targets that have been resolved into a typed - /// variant we lose the original `_targets.yml` id; the legacy command's - /// inline targets just get the literal "send". - fn target_id(&self) -> &str { - "send" - } - - pub fn resolve_external( - &self, - external_targets: &HashMap>, - ) -> Option> { - match self { - Self::External { - id, - subject, - template, - } => external_targets.get(id).map(|exts| { - exts.iter() - .map(|ext| match ext { - ExternalTarget::Email { conn, .. } => SendTarget::Email { - subject: subject.clone(), - template: template.clone(), - conn: conn.clone(), - }, - ExternalTarget::Zendesk { conn, .. } => SendTarget::Zendesk { - subject: subject.clone(), - template: template.clone(), - conn: conn.clone(), - }, - ExternalTarget::Slack { conn, .. } => SendTarget::Slack { - subject: subject.clone(), - template: template.clone(), - conn: conn.clone(), - }, - ExternalTarget::Canopy { conn, .. } => SendTarget::Canopy { - subject: subject.clone(), - template: template.clone(), - conn: conn.clone(), - }, - }) - .collect() - }), - _ => None, - } - } - - pub async fn send( - &self, - alert: &AlertDefinition, - ctx: Arc, - tera_ctx: &mut tera::Context, - config: &TamanuConfig, - dry_run: bool, - ) -> Result<()> { - let tera = load_templates(self)?; - let (subject, body, requester) = render_alert(&tera, tera_ctx)?; - - match self { - SendTarget::Email { conn, .. } => { - conn.send(alert, config, &subject, &body, dry_run).await?; - } - - SendTarget::Slack { conn, .. } => { - conn.send(slack::SlackSendParams { - alert, - ctx: &ctx, - subject: &subject, - body: &body, - tera: &tera, - tera_ctx, - dry_run, - }) - .await?; - } - - SendTarget::Zendesk { conn, .. } => { - conn.send(alert, &ctx, &subject, &body, requester.as_deref(), dry_run) - .await?; - } - - SendTarget::Canopy { conn, .. } => { - conn.send(&ctx, alert, self.target_id(), &subject, &body, dry_run) - .await?; - } - - SendTarget::External { .. } => { - unreachable!("external targets should be resolved before here"); - } - } - - Ok(()) - } - - /// Send a "cleared" notification for stateful targets (canopy). - /// - /// Non-stateful targets (email, slack, zendesk) return Ok immediately. - pub async fn send_clear( - &self, - alert: &AlertDefinition, - ctx: &InternalContext, - dry_run: bool, - ) -> Result<()> { - match self { - SendTarget::Canopy { conn, .. } => { - conn.send_clear(ctx, alert, self.target_id(), dry_run).await - } - _ => Ok(()), - } - } -} - -#[derive(serde::Deserialize, Debug)] -pub struct AlertTargets { - pub targets: Vec, -} - -#[derive(serde::Deserialize, Clone, Debug)] -#[serde(rename_all = "snake_case", tag = "target")] -pub enum ExternalTarget { - Email { - id: String, - #[serde(flatten)] - conn: email::TargetEmail, - }, - Zendesk { - id: String, - #[serde(flatten)] - conn: zendesk::TargetZendesk, - }, - Slack { - id: String, - #[serde(flatten)] - conn: slack::TargetSlack, - }, - Canopy { - id: String, - #[serde(flatten)] - conn: canopy::TargetCanopy, - }, -} - -impl ExternalTarget { - pub fn id(&self) -> &str { - match self { - Self::Email { id, .. } => id, - Self::Zendesk { id, .. } => id, - Self::Slack { id, .. } => id, - Self::Canopy { id, .. } => id, - } - } -} diff --git a/crates/bestool/src/actions/tamanu/alerts/targets/canopy.rs b/crates/bestool/src/actions/tamanu/alerts/targets/canopy.rs deleted file mode 100644 index 3e8c1cd0..00000000 --- a/crates/bestool/src/actions/tamanu/alerts/targets/canopy.rs +++ /dev/null @@ -1,127 +0,0 @@ -use bestool_canopy::{DEFAULT_CANOPY_URL, NewEvent, Severity}; -use jiff::Timestamp; -use miette::{Result, miette}; -use reqwest::Url; -use sysinfo::System; -use tracing::debug; - -use crate::actions::tamanu::alerts::{InternalContext, definition::AlertDefinition}; - -fn default_canopy_url() -> Url { - DEFAULT_CANOPY_URL.parse().expect("default canopy URL is valid") -} - -#[derive(serde::Deserialize, Clone, Debug)] -#[serde(rename_all = "snake_case")] -pub struct TargetCanopy { - #[serde(default = "default_canopy_url")] - pub url: Url, - pub source: String, - #[serde(default)] - pub severity: Option, -} - -/// Build the deduplication ref for a canopy event. -/// -/// Combines hostname, alert file stem, and target id so the same alert firing -/// on different hosts or to different canopy targets produces distinct issues. -fn build_ref(alert: &AlertDefinition, target_id: &str) -> String { - let hostname = System::host_name().unwrap_or_else(|| "unknown".into()); - let stem = alert - .file - .file_stem() - .map(|s| s.to_string_lossy().into_owned()) - .unwrap_or_else(|| "alert".into()); - format!("{hostname}/{stem}:{target_id}") -} - -impl TargetCanopy { - pub async fn send( - &self, - ctx: &InternalContext, - alert: &AlertDefinition, - target_id: &str, - subject: &str, - body: &str, - dry_run: bool, - ) -> Result<()> { - let r#ref = build_ref(alert, target_id); - - if dry_run { - println!("-------------------------------"); - println!("Alert: {}", alert.file.display()); - println!("Recipients: canopy:{}", self.url); - println!("Source: {}", self.source); - println!("Ref: {ref}", ref = r#ref); - println!("Severity: {:?}", self.severity.unwrap_or(Severity::Error)); - println!("Active: true"); - println!("Subject: {subject}"); - println!("Body: {body}"); - return Ok(()); - } - - let client = ctx - .canopy_client - .as_deref() - .ok_or_else(|| miette!("canopy target {target_id} configured but no device key was loaded"))?; - - debug!(?alert.file, target_id, "sending canopy trigger event"); - - client - .post_event( - &self.url, - NewEvent { - source: &self.source, - r#ref: &r#ref, - message: body, - description: Some(subject), - severity: Some(self.severity.unwrap_or(Severity::Error)), - occurred_at: Some(Timestamp::now()), - active: Some(true), - }, - ) - .await - } - - pub async fn send_clear( - &self, - ctx: &InternalContext, - alert: &AlertDefinition, - target_id: &str, - dry_run: bool, - ) -> Result<()> { - let r#ref = build_ref(alert, target_id); - - if dry_run { - println!("-------------------------------"); - println!("Alert (cleared): {}", alert.file.display()); - println!("Recipients: canopy:{}", self.url); - println!("Source: {}", self.source); - println!("Ref: {ref}", ref = r#ref); - println!("Active: false"); - return Ok(()); - } - - let Some(client) = ctx.canopy_client.as_deref() else { - debug!(target_id, "no device key loaded, skipping canopy clear"); - return Ok(()); - }; - - debug!(?alert.file, target_id, "sending canopy clear event"); - - client - .post_event( - &self.url, - NewEvent { - source: &self.source, - r#ref: &r#ref, - message: "alert cleared", - description: None, - severity: Some(self.severity.unwrap_or(Severity::Error)), - occurred_at: Some(Timestamp::now()), - active: Some(false), - }, - ) - .await - } -} diff --git a/crates/bestool/src/actions/tamanu/alerts/targets/email.rs b/crates/bestool/src/actions/tamanu/alerts/targets/email.rs deleted file mode 100644 index a96fd492..00000000 --- a/crates/bestool/src/actions/tamanu/alerts/targets/email.rs +++ /dev/null @@ -1,67 +0,0 @@ -use mailgun_rs::{EmailAddress, Mailgun, Message}; -use miette::{IntoDiagnostic, Result, WrapErr, miette}; -use tracing::debug; - -use bestool_tamanu::config::TamanuConfig; - -use crate::actions::tamanu::alerts::definition::AlertDefinition; - -#[derive(serde::Deserialize, Clone, Debug)] -#[serde(rename_all = "snake_case")] -pub struct TargetEmail { - pub addresses: Vec, -} - -impl TargetEmail { - pub async fn send( - &self, - alert: &AlertDefinition, - config: &TamanuConfig, - subject: &str, - body: &str, - dry_run: bool, - ) -> Result<()> { - let body = { - let parser = pulldown_cmark::Parser::new(body); - let mut html_output = String::new(); - pulldown_cmark::html::push_html(&mut html_output, parser); - html_output - }; - - if dry_run { - println!("-------------------------------"); - println!("Alert: {}", alert.file.display()); - println!("Recipients: {}", self.addresses.join(", ")); - println!("Subject: {subject}"); - println!("Body: {body}"); - return Ok(()); - } - - debug!(?self.addresses, "sending email"); - let mailgun_config = config - .mailgun - .as_ref() - .ok_or_else(|| miette!("missing mailgun config"))?; - let sender = EmailAddress::address(&mailgun_config.sender); - let mailgun = Mailgun { - api_key: mailgun_config.api_key.clone(), - domain: mailgun_config.domain.clone(), - }; - let message = Message { - to: self - .addresses - .iter() - .map(|email| EmailAddress::address(email)) - .collect(), - subject: subject.into(), - html: body, - ..Default::default() - }; - mailgun - .async_send(mailgun_rs::MailgunRegion::US, &sender, message, None) - .await - .into_diagnostic() - .wrap_err("sending email") - .map(drop) - } -} diff --git a/crates/bestool/src/actions/tamanu/alerts/targets/slack.rs b/crates/bestool/src/actions/tamanu/alerts/targets/slack.rs deleted file mode 100644 index 72cc05b4..00000000 --- a/crates/bestool/src/actions/tamanu/alerts/targets/slack.rs +++ /dev/null @@ -1,112 +0,0 @@ -use std::collections::HashMap; - -use miette::{IntoDiagnostic, Result, WrapErr}; -use reqwest::Url; -use tera::Tera; -use tracing::debug; - -use crate::actions::tamanu::alerts::{ - InternalContext, definition::AlertDefinition, templates::TemplateField, -}; - -/// Parameters for sending a Slack alert -pub struct SlackSendParams<'a> { - pub alert: &'a AlertDefinition, - pub ctx: &'a InternalContext, - pub subject: &'a str, - pub body: &'a str, - pub tera: &'a Tera, - pub tera_ctx: &'a tera::Context, - pub dry_run: bool, -} - -#[derive(serde::Deserialize, Clone, Debug)] -#[serde(rename_all = "snake_case")] -pub struct TargetSlack { - pub webhook: Url, - - #[serde(default = "SlackField::default_set")] - pub fields: Vec, -} - -#[derive(serde::Deserialize, Clone, Debug)] -#[serde(untagged, rename_all = "snake_case")] -pub enum SlackField { - Fixed { name: String, value: String }, - Field { name: String, field: TemplateField }, -} - -impl SlackField { - pub fn default_set() -> Vec { - vec![ - Self::Field { - name: "hostname".into(), - field: TemplateField::Hostname, - }, - Self::Field { - name: "filename".into(), - field: TemplateField::Filename, - }, - Self::Field { - name: "subject".into(), - field: TemplateField::Subject, - }, - Self::Field { - name: "message".into(), - field: TemplateField::Body, - }, - ] - } -} - -impl TargetSlack { - pub async fn send(&self, params: SlackSendParams<'_>) -> Result<()> { - let SlackSendParams { - alert, - ctx, - subject, - body, - tera, - tera_ctx, - dry_run, - } = params; - if dry_run { - println!("-------------------------------"); - println!("Alert: {}", alert.file.display()); - println!("Recipients: slack"); - println!("Subject: {subject}"); - println!("Body: {body}"); - return Ok(()); - } - - let payload: HashMap<&String, String> = self - .fields - .iter() - .map(|field| match field { - SlackField::Fixed { name, value } => (name, value.clone()), - SlackField::Field { name, field } => ( - name, - tera.render(field.as_str(), tera_ctx) - .ok() - .or_else(|| { - tera_ctx.get(field.as_str()).map(|v| match v.as_str() { - Some(t) => t.to_owned(), - None => v.to_string(), - }) - }) - .unwrap_or_default(), - ), - }) - .collect(); - - debug!(?self.webhook, ?payload, "posting to slack webhook"); - ctx.http_client - .post(self.webhook.clone()) - .json(&payload) - .send() - .await - .into_diagnostic() - .wrap_err("posting to slack webhook") - .map(drop) - } -} diff --git a/crates/bestool/src/actions/tamanu/alerts/targets/zendesk.rs b/crates/bestool/src/actions/tamanu/alerts/targets/zendesk.rs deleted file mode 100644 index 00dab629..00000000 --- a/crates/bestool/src/actions/tamanu/alerts/targets/zendesk.rs +++ /dev/null @@ -1,89 +0,0 @@ -use miette::{IntoDiagnostic, Result, WrapErr}; -use reqwest::Url; -use serde_json::json; - -use crate::actions::tamanu::alerts::{InternalContext, definition::AlertDefinition}; - -#[derive(serde::Deserialize, Clone, Debug)] -#[serde(rename_all = "snake_case")] -pub struct TargetZendesk { - pub endpoint: Url, - - #[serde(flatten)] - pub method: ZendeskMethod, - - pub ticket_form_id: Option, - - #[serde(default)] - pub custom_fields: Vec, -} - -#[derive(serde::Deserialize, Clone, Debug)] -#[serde(untagged, deny_unknown_fields)] -pub enum ZendeskMethod { - // Make credentials and requester fields exclusive as specifying the requester object in authorized - // request is invalid. We may be able to specify some account as the requester, but it's not - // necessary. That's because the requester defaults to the authenticated account. - Authorized { credentials: ZendeskCredentials }, - Anonymous { requester: String }, -} - -#[derive(serde::Deserialize, Clone, Debug)] -pub struct ZendeskCredentials { - pub email: String, - pub password: String, -} - -#[derive(serde::Deserialize, serde::Serialize, Clone, Debug)] -pub struct ZendeskCustomField { - pub id: u64, - pub value: String, -} - -impl TargetZendesk { - pub async fn send( - &self, - alert: &AlertDefinition, - ctx: &InternalContext, - subject: &str, - body: &str, - requester: Option<&str>, - dry_run: bool, - ) -> Result<()> { - if dry_run { - println!("-------------------------------"); - println!("Alert: {}", alert.file.display()); - println!("Endpoint: {}", self.endpoint); - println!("Subject: {subject}"); - println!("Body: {body}"); - return Ok(()); - } - - let req = json!({ - "request": { - "subject": subject, - "ticket_form_id": self.ticket_form_id, - "custom_fields": self.custom_fields, - "comment": { "html_body": body }, - "requester": requester.map(|r| json!({ "name": r })) - } - }); - - let mut req_builder = ctx.http_client.post(self.endpoint.clone()).json(&req); - - if let ZendeskMethod::Authorized { - credentials: ZendeskCredentials { email, password }, - } = &self.method - { - req_builder = - req_builder.basic_auth(std::format_args!("{email}/token"), Some(password)); - } - - req_builder - .send() - .await - .into_diagnostic() - .wrap_err("creating Zendesk ticket") - .map(drop) - } -} diff --git a/crates/bestool/src/actions/tamanu/alerts/templates.rs b/crates/bestool/src/actions/tamanu/alerts/templates.rs deleted file mode 100644 index 4d259d92..00000000 --- a/crates/bestool/src/actions/tamanu/alerts/templates.rs +++ /dev/null @@ -1,159 +0,0 @@ -use std::{fmt::Display, time::Duration}; - -use miette::{Context as _, IntoDiagnostic, Result}; -use sysinfo::System; -use tera::{Context as TeraCtx, Tera}; -use tracing::{instrument, warn}; - -use super::{ - definition::AlertDefinition, - targets::{ - SendTarget, - zendesk::{TargetZendesk, ZendeskMethod}, - }, -}; - -const DEFAULT_SUBJECT_TEMPLATE: &str = "[Tamanu Alert] {{ filename }} ({{ hostname }})"; - -#[derive(serde::Deserialize, Clone, Copy, Debug)] -#[serde(rename_all = "snake_case")] -pub enum TemplateField { - Filename, - Subject, - Body, - Hostname, - Requester, - Interval, -} - -impl TemplateField { - pub fn as_str(self) -> &'static str { - match self { - Self::Filename => "filename", - Self::Subject => "subject", - Self::Body => "body", - Self::Hostname => "hostname", - Self::Requester => "requester", - Self::Interval => "interval", - } - } -} - -impl Display for TemplateField { - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { - write!(f, "{}", self.as_str()) - } -} - -#[instrument] -pub fn load_templates(target: &SendTarget) -> Result { - let mut tera = tera::Tera::default(); - - match target { - SendTarget::Email { - subject, template, .. - } - | SendTarget::Zendesk { - subject, template, .. - } - | SendTarget::Slack { - subject, template, .. - } - | SendTarget::Canopy { - subject, template, .. - } - | SendTarget::External { - subject, template, .. - } => { - tera.add_raw_template( - TemplateField::Subject.as_str(), - subject.as_deref().unwrap_or(DEFAULT_SUBJECT_TEMPLATE), - ) - .into_diagnostic() - .wrap_err("compiling subject template")?; - tera.add_raw_template(TemplateField::Body.as_str(), template) - .into_diagnostic() - .wrap_err("compiling body template")?; - } - } - - if let SendTarget::Zendesk { - conn: TargetZendesk { - method: ZendeskMethod::Anonymous { requester }, - .. - }, - .. - } = target - { - tera.add_raw_template(TemplateField::Requester.as_str(), requester) - .into_diagnostic() - .wrap_err("compiling requester template")?; - } - Ok(tera) -} - -/// Format a duration as a single human-friendly unit, dropping any remainder. -/// -/// E.g. 90 minutes prints as "1h"; 1 day as "1d"; 30 seconds as "30s". -pub(crate) fn humanize_duration(dur: Duration) -> String { - let secs = dur.as_secs(); - if secs >= 86400 { - format!("{}d", secs / 86400) - } else if secs >= 3600 { - format!("{}h", secs / 3600) - } else if secs >= 60 { - format!("{}m", secs / 60) - } else { - format!("{}s", secs) - } -} - -#[instrument(skip(alert, now))] -pub fn build_context(alert: &AlertDefinition, now: jiff::Timestamp) -> TeraCtx { - let mut context = TeraCtx::new(); - context.insert( - TemplateField::Interval.as_str(), - &humanize_duration(alert.interval), - ); - context.insert( - TemplateField::Hostname.as_str(), - System::host_name().as_deref().unwrap_or("unknown"), - ); - context.insert( - TemplateField::Filename.as_str(), - &alert.file.file_name().unwrap().to_string_lossy(), - ); - context.insert("now", &now.to_string()); - - context -} - -#[instrument(skip(tera, context))] -pub fn render_alert( - tera: &Tera, - context: &mut TeraCtx, -) -> Result<(String, String, Option)> { - let subject = tera - .render(TemplateField::Subject.as_str(), context) - .into_diagnostic() - .wrap_err("rendering subject template")?; - - context.insert(TemplateField::Subject.as_str(), &subject.to_string()); - - let body = tera - .render(TemplateField::Body.as_str(), context) - .into_diagnostic() - .wrap_err("rendering email template")?; - - let requester = tera - .render(TemplateField::Requester.as_str(), context) - .map(Some) - .or_else(|err| match err.kind { - tera::ErrorKind::TemplateNotFound(_) => Ok(None), - _ => Err(err), - }) - .into_diagnostic() - .wrap_err("rendering requester template")?; - - Ok((subject, body, requester)) -} diff --git a/crates/bestool/src/actions/tamanu/alerts/tests.rs b/crates/bestool/src/actions/tamanu/alerts/tests.rs deleted file mode 100644 index d89e915e..00000000 --- a/crates/bestool/src/actions/tamanu/alerts/tests.rs +++ /dev/null @@ -1,226 +0,0 @@ -use std::{path::PathBuf, time::Duration}; - -use jiff::Timestamp; - -use super::{ - definition::{AlertDefinition, TicketSource}, - targets::SendTarget, - templates::build_context, -}; - -fn interval_context(dur: Duration) -> Option { - let alert = AlertDefinition { - file: PathBuf::from("test.yaml"), - enabled: true, - interval: dur, - source: TicketSource::Sql { sql: "".into() }, - send: vec![], - }; - build_context(&alert, Timestamp::now()) - .get("interval") - .and_then(|v| v.as_str()) - .map(|s| s.to_owned()) -} - -#[test] -fn test_interval_format_minutes() { - assert_eq!( - interval_context(Duration::from_secs(15 * 60)).as_deref(), - Some("15m"), - ); -} - -#[test] -fn test_interval_format_hour() { - assert_eq!( - interval_context(Duration::from_secs(60 * 60)).as_deref(), - Some("1h"), - ); -} - -#[test] -fn test_interval_format_day() { - assert_eq!( - interval_context(Duration::from_secs(24 * 60 * 60)).as_deref(), - Some("1d"), - ); -} - -#[test] -fn test_alert_parse_email() { - let alert = r#" -sql: SELECT $1::timestamptz; -send: -- target: email - addresses: [test@example.com] - subject: "[Tamanu Alert] Example ({{ hostname }})" - template: | -

Server: {{ hostname }}

-

There are {{ rows | length }} rows.

-"#; - let alert: AlertDefinition = serde_yaml::from_str(alert).unwrap(); - let alert = alert.normalise(&Default::default()); - assert_eq!(alert.interval, std::time::Duration::default()); - assert!(matches!(alert.source, TicketSource::Sql { sql } if sql == "SELECT $1::timestamptz;")); - assert!(matches!(alert.send[0], SendTarget::Email { .. })); -} - -#[test] -fn test_alert_parse_shell() { - let alert = r#" -shell: bash -run: echo foobar -"#; - let alert: AlertDefinition = serde_yaml::from_str(alert).unwrap(); - let alert = alert.normalise(&Default::default()); - assert_eq!(alert.interval, std::time::Duration::default()); - assert!( - matches!(alert.source, TicketSource::Shell { shell, run } if shell == "bash" && run == "echo foobar") - ); -} - -#[test] -fn test_alert_parse_invalid_source() { - let alert = r#" -shell: bash -"#; - assert!(serde_yaml::from_str::(alert).is_err()); - let alert = r#" -run: echo foo -"#; - assert!(serde_yaml::from_str::(alert).is_err()); - let alert = r#" -sql: SELECT $1::timestamptz; -run: echo foo -"#; - assert!(serde_yaml::from_str::(alert).is_err()); - let alert = r#" -sql: SELECT $1::timestamptz; -shell: bash -"#; - assert!(serde_yaml::from_str::(alert).is_err()); - let alert = r#" -sql: SELECT $1::timestamptz; -shell: bash -run: echo foo -"#; - assert!(serde_yaml::from_str::(alert).is_err()); -} - -#[test] -fn test_alert_parse_zendesk_authorized() { - let alert = r#" -sql: SELECT $1::timestamptz; -send: -- target: zendesk - endpoint: https://example.zendesk.com/api/v2/requests - credentials: - email: foo@example.com - password: pass - subject: "[Tamanu Alert] Example ({{ hostname }})" - template: "Output: {{ output }}""#; - let alert: AlertDefinition = serde_yaml::from_str(alert).unwrap(); - assert!(matches!(alert.send[0], SendTarget::Zendesk { .. })); -} - -#[test] -fn test_alert_parse_zendesk_anon() { - let alert = r#" -sql: SELECT $1::timestamptz; -send: -- target: zendesk - endpoint: https://example.zendesk.com/api/v2/requests - requester: "{{ hostname }}" - subject: "[Tamanu Alert] Example ({{ hostname }})" - template: "Output: {{ output }}""#; - let alert: AlertDefinition = serde_yaml::from_str(alert).unwrap(); - assert!(matches!(alert.send[0], SendTarget::Zendesk { .. })); -} - -#[test] -fn test_alert_parse_zendesk_form_fields() { - let alert = r#" -sql: SELECT $1::timestamptz; -send: -- target: zendesk - endpoint: https://example.zendesk.com/api/v2/requests - requester: "{{ hostname }}" - subject: "[Tamanu Alert] Example ({{ hostname }})" - template: "Output: {{ output }}" - ticket_form_id: 500 - custom_fields: - - id: 100 - value: tamanu_ - - id: 200 - value: Test -"#; - let alert: AlertDefinition = serde_yaml::from_str(alert).unwrap(); - assert!(matches!(alert.send[0], SendTarget::Zendesk { .. })); -} - -#[test] -fn test_alert_parse_slack() { - let alert = r#" -sql: SELECT $1::timestamptz; -send: -- target: slack - webhook: https://hooks.slack.com/triggers/ - template: Something happened -"#; - let alert: AlertDefinition = serde_yaml::from_str(alert).unwrap(); - assert!(matches!(alert.send[0], SendTarget::Slack { .. })); -} - -#[test] -fn test_alert_parse_canopy_inline() { - let alert = r#" -sql: SELECT $1::timestamptz; -send: -- target: canopy - source: my-tamanu - severity: warning - subject: "{{ hostname }}: low disk" - template: "There are {{ rows | length }} rows." -"#; - let alert: AlertDefinition = serde_yaml::from_str(alert).unwrap(); - assert!(matches!(alert.send[0], SendTarget::Canopy { .. })); -} - -#[test] -fn test_alert_parse_canopy_default_url() { - let alert = r#" -sql: SELECT $1::timestamptz; -send: -- target: canopy - source: my-tamanu - subject: "{{ hostname }}: alert" - template: "Something happened" -"#; - let alert: AlertDefinition = serde_yaml::from_str(alert).unwrap(); - match &alert.send[0] { - SendTarget::Canopy { conn, .. } => { - assert_eq!(conn.url.as_str(), "https://meta.tamanu.app/"); - assert_eq!(conn.source, "my-tamanu"); - assert_eq!(conn.severity, None); - } - _ => panic!("expected canopy target"), - } -} - -#[test] -fn test_alert_parse_slack_fields() { - let alert = r#" -sql: SELECT $1::timestamptz; -send: -- target: slack - webhook: https://hooks.slack.com/triggers/ - template: Something happened - fields: - - name: alertname - field: filename - - name: deployment - value: production -"#; - let alert: AlertDefinition = serde_yaml::from_str(alert).unwrap(); - assert!(matches!(alert.send[0], SendTarget::Slack { .. })); -} diff --git a/crates/bestool/src/lib.rs b/crates/bestool/src/lib.rs index a2211b42..67d10d0c 100644 --- a/crates/bestool/src/lib.rs +++ b/crates/bestool/src/lib.rs @@ -10,9 +10,6 @@ pub(crate) mod download; pub mod find_postgres; pub(crate) mod http; -#[cfg(feature = "tamanu-alerts")] -pub(crate) mod postgres_to_value; - #[cfg(doc)] pub mod __help { //! Documentation-only module containing the help pages for the CLI tool. diff --git a/crates/bestool/src/postgres_to_value.rs b/crates/bestool/src/postgres_to_value.rs deleted file mode 100644 index 482cd4c3..00000000 --- a/crates/bestool/src/postgres_to_value.rs +++ /dev/null @@ -1,82 +0,0 @@ -// Copied from https://docs.rs/crate/serde_postgres/latest/source -// which seems to be gone from github and used outdated dependencies. -// Copyright to the original authors (1aim), MIT+Apache-2.0 licensed. - -use std::{collections::HashMap, error::Error, ops::Deref}; - -use jiff::Timestamp; -use tokio_postgres::{ - Row, - types::{FromSql, Type}, -}; -use uuid::Uuid; - -/// The raw bytes of a value, allowing "conversion" from any postgres type. -/// -/// This type intentionally cannot be converted from `NULL`, and attempting to -/// do so will result in an error. Instead, use `Option`. -pub struct Raw<'a>(pub &'a [u8]); - -impl<'a> FromSql<'a> for Raw<'a> { - fn from_sql(_ty: &Type, raw: &'a [u8]) -> Result> { - Ok(Raw(raw)) - } - - fn accepts(_ty: &Type) -> bool { - true - } -} - -impl<'a> Deref for Raw<'a> { - type Target = [u8]; - - fn deref(&self) -> &Self::Target { - self.0 - } -} - -pub fn col_to_value( - col: &tokio_postgres::Column, - row: &tokio_postgres::Row, - i: usize, -) -> serde_json::Value { - use serde_json::Value; - use tokio_postgres::types::Type; - - if let Ok(None) = row.try_get::<_, Option>>(i) { - return Value::Null; - } - - match col.type_() { - t if *t == Type::BOOL => Value::Bool(row.try_get(i).unwrap()), - t if *t == Type::INT2 => { - Value::Number(serde_json::Number::from(row.try_get::<_, i16>(i).unwrap())) - } - t if *t == Type::INT4 => { - Value::Number(serde_json::Number::from(row.try_get::<_, i32>(i).unwrap())) - } - t if *t == Type::INT8 => { - Value::Number(serde_json::Number::from(row.try_get::<_, i64>(i).unwrap())) - } - // TODO: BYTEA - _ if row.try_get::<_, Timestamp>(i).is_ok() => { - Value::String(row.try_get::<_, Timestamp>(i).unwrap().to_string()) - } - _ if row.try_get::<_, Uuid>(i).is_ok() => { - Value::String(row.try_get::<_, Uuid>(i).unwrap().to_string()) - } - _ => Value::String(row.try_get(i).unwrap_or("(unknown)".into())), - } -} - -pub fn rows_to_value_map(rows: &[Row]) -> Vec> { - rows.iter() - .map(|row| { - let mut map = HashMap::new(); - for (i, col) in row.columns().iter().enumerate() { - map.insert(col.name().to_owned(), col_to_value(col, row, i)); - } - map - }) - .collect() -} diff --git a/crates/bestool/tests/cli_tests.rs b/crates/bestool/tests/cli_tests.rs index 37b17314..cde8553d 100644 --- a/crates/bestool/tests/cli_tests.rs +++ b/crates/bestool/tests/cli_tests.rs @@ -1,22 +1,8 @@ -mod fixture_pg; - -use fixture_pg::{init_db, run_db}; - #[test] fn cli_tests() { - let cases = trycmd::TestCases::new(); - cases + trycmd::TestCases::new() .env("BESTOOL_MOCK_TIME", "1") .env("NO_COLOR", "1") - .case("tests/cmd/*.toml"); - - let handle_res = init_db().and_then(run_db); - - // Ignore tests that depend on Postgres if the Postgres test fixture failed. - // Add more `cases.skip()` here if any test use Postgres. - if handle_res.is_err() { - cases.skip("tests/cmd/alerts.toml"); - } - - cases.run(); + .case("tests/cmd/*.toml") + .run(); } diff --git a/crates/bestool/tests/cmd/alerts.in/alerts/sql.yml b/crates/bestool/tests/cmd/alerts.in/alerts/sql.yml deleted file mode 100644 index 605cdd40..00000000 --- a/crates/bestool/tests/cmd/alerts.in/alerts/sql.yml +++ /dev/null @@ -1,16 +0,0 @@ -send: - - target: email - addresses: - - test@example.com - subject: "Tamanu Alert - {{ now }}" - template: | - Automated alert! There have been {{ rows | length }} jobs - with errors in the past {{ interval }}. Here are the first 2: - {% for row in rows | slice(end=2) %} - - {{ row.topic }}: {{ row.error }} - {% endfor %} - -sql: | - SELECT * FROM jobs - WHERE error IS NOT NULL - AND created_at > $1 diff --git a/crates/bestool/tests/cmd/alerts.in/tamanu/package.json b/crates/bestool/tests/cmd/alerts.in/tamanu/package.json deleted file mode 100644 index 3e84cf95..00000000 --- a/crates/bestool/tests/cmd/alerts.in/tamanu/package.json +++ /dev/null @@ -1,3 +0,0 @@ -{ - "version": "2.0.0" -} diff --git a/crates/bestool/tests/cmd/alerts.in/tamanu/packages/central-server/config/default.json5 b/crates/bestool/tests/cmd/alerts.in/tamanu/packages/central-server/config/default.json5 deleted file mode 100644 index 0967ef42..00000000 --- a/crates/bestool/tests/cmd/alerts.in/tamanu/packages/central-server/config/default.json5 +++ /dev/null @@ -1 +0,0 @@ -{} diff --git a/crates/bestool/tests/cmd/alerts.in/tamanu/packages/central-server/config/local.json5 b/crates/bestool/tests/cmd/alerts.in/tamanu/packages/central-server/config/local.json5 deleted file mode 100644 index eeeb0ab1..00000000 --- a/crates/bestool/tests/cmd/alerts.in/tamanu/packages/central-server/config/local.json5 +++ /dev/null @@ -1,13 +0,0 @@ -{ - "db": { - "host": "localhost", - "name": "postgres", - "username": "postgres", - "password": "password" - }, - "mailgun": { - "domain": "", - "apiKey": "", - "from": "" - } -} diff --git a/crates/bestool/tests/cmd/alerts.stdout b/crates/bestool/tests/cmd/alerts.stdout deleted file mode 100644 index 1a4d62e5..00000000 --- a/crates/bestool/tests/cmd/alerts.stdout +++ /dev/null @@ -1,15 +0,0 @@ -------------------------------- -Alert: ./alerts/sql.yml -Recipients: test@example.com -Subject: Tamanu Alert - 1970-01-01 00:00:00 UTC -Body:

Automated alert! There have been 5 jobs -with errors in the past 1w. Here are the first 2:

-
    -
  • -

    foo: err

    -
  • -
  • -

    bar: err

    -
  • -
- diff --git a/crates/bestool/tests/cmd/alerts.toml b/crates/bestool/tests/cmd/alerts.toml deleted file mode 100644 index ebef0801..00000000 --- a/crates/bestool/tests/cmd/alerts.toml +++ /dev/null @@ -1,2 +0,0 @@ -bin.name = "bestool" -args = "tamanu --root ./tamanu alerts --interval 1w --dir ./alerts --dry-run" diff --git a/crates/bestool/tests/fixture.sql b/crates/bestool/tests/fixture.sql deleted file mode 100644 index 0214f263..00000000 --- a/crates/bestool/tests/fixture.sql +++ /dev/null @@ -1,17 +0,0 @@ -CREATE TABLE public.jobs ( - id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY, - created_at timestamp with time zone DEFAULT now() NOT NULL, - error text, - topic text NOT NULL -); - -COPY public.jobs (id, created_at, error, topic) FROM stdin; -1 1970-01-01 00:00:00 z \N bar -2 1970-01-01 00:00:00 z err foo -3 1970-01-01 00:00:00 z err bar -4 1970-01-01 00:00:00 z err baz -5 1970-01-01 00:00:00 z err qux -6 1970-01-01 00:00:00 z err foo -\. - -SELECT pg_catalog.setval('public.jobs_id_seq', 6, true); diff --git a/crates/bestool/tests/fixture_pg.rs b/crates/bestool/tests/fixture_pg.rs deleted file mode 100644 index 8b631b97..00000000 --- a/crates/bestool/tests/fixture_pg.rs +++ /dev/null @@ -1,133 +0,0 @@ -//! This is a set of utilities to setup a temporary Postgres cluster. The design is inspired by -//! [`pg_test`](https://github.com/rubenv/pgtest) and -//! [`pgtemp`](https://github.com/boustrophedon/pgtemp). This is more lightweight than containers -//! and cleaner than simply creating databases. The code is lightly adapted from `pgtemp` -//! (MIT license) with handlable errors. - -use std::fs; - -use bestool::find_postgres::find_postgres_bin; -use miette::{Context, IntoDiagnostic, Result}; -use tempfile::TempDir; - -/// Execute the `initdb` binary. -pub fn init_db() -> Result { - let temp_dir = TempDir::with_prefix("bestool-").into_diagnostic()?; - - let data_dir = temp_dir.path().join("data"); - - // write out password file for initdb - let pwfile = temp_dir.path().join("user_password.txt"); - fs::write(&pwfile, "password") - .into_diagnostic() - .wrap_err("writing password file")?; - - duct::cmd!( - find_postgres_bin("initdb")?, - "--auth", - "scram-sha-256", - "--username", - "postgres", - "--pwfile", - pwfile, - "-D", - data_dir, - ) - .stdout_null() - .run() - .into_diagnostic() - .wrap_err("running initdb")?; - - Ok(temp_dir) -} - -/// Execute the `pg_ctl start`. -/// -/// The Postgres server and resources get cleaned when the returned handle drops. -pub fn run_db(temp_dir: TempDir) -> Result { - let data_dir = temp_dir.path().join("data"); - - duct::cmd!( - find_postgres_bin("pg_ctl")?, - "start", - "-D", - data_dir, - "--wait", - "--silent", - "--log", - "log.txt", - "--options", - // https://www.postgresql.org/docs/current/non-durability.html - // https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server - if cfg!(unix) { - // Setting "unix_socket_directories" is necessary as creating socket files fail in permission error for some systems. - // Instead, this forces the use of TCP/IP over domain sockets. - "-c autovacuum=off -c full_page_writes=off -c fsync=off -c unix_socket_directories='' -c synchronous_commit=off" - } else { - "-c autovacuum=off -c full_page_writes=off -c fsync=off -c synchronous_commit=off" - }, - ) - .run() - .into_diagnostic() - .wrap_err("running pg_ctl")?; - - struct Handle(Option); - - impl Drop for Handle { - fn drop(&mut self) { - let Some(temp_dir) = self.0.take() else { - return; - }; - if let Err(err) = stop_db(temp_dir) { - eprintln!("{}", err); - } - } - } - - load_database().wrap_err("loading fixture database")?; - - Ok(Handle(Some(temp_dir))) -} - -fn load_database() -> Result<()> { - duct::cmd!( - find_postgres_bin("psql")?, - "--host", - "localhost", - "--username", - "postgres", - "--file", - "tests/fixture.sql", - ) - .env("PGPASSWORD", "password") - .stdout_null() - .run() - .into_diagnostic() - .wrap_err("running psql")?; - - Ok(()) -} - -fn stop_db(temp_dir: TempDir) -> Result<()> { - let data_dir = temp_dir.path().join("data"); - - duct::cmd!( - find_postgres_bin("pg_ctl")?, - "stop", - "-D", - data_dir, - "--wait", - "--silent" - ) - .run() - .into_diagnostic() - .wrap_err("running pg_ctl")?; - - // if we just used the default drop impl, errors would not be surfaced - temp_dir - .close() - .into_diagnostic() - .wrap_err("cleaning up the temp dir")?; - - Ok(()) -} From ec8ba767bdec16060339cad16074c1f50a733490 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?F=C3=A9lix=20Saparelli?= Date: Sun, 31 May 2026 17:57:27 +1200 Subject: [PATCH 06/12] feat(alertd/checks): tier check thresholds (warn + fail) Re-tune the migrated checks to the real sync cadence and add WARN tiers: - sync_lookup: query staleness unfiltered, tier in Rust (WARN >2m, FAIL >5m); absent row passes as not-tracked. - sync_facility_stale: single query of minutes-since-last-success per facility active in the last 48h (WARN >10m, FAIL >30m). - sync_restart_loop: SQL HAVING >=5, tier per facility (WARN >=5, FAIL >=10 restarts/hr). - sync_session_errors: tier on combined row count (WARN >=1, FAIL >=10). - sync_sessions: tighten stuck thresholds to WARN >15m, FAIL >45m. - fhir_service_requests_unresolved: tier per request (WARN >1h, FAIL >6h). - error-stream checks (fhir_job_errors, certificate_notification_errors, ips_errors, patient_communication_errors, report_errors): WARN >=1, FAIL >=10 via a generalised tiered_rows_check helper replacing fail_if_any_rows. - kopia_backup: add WARN >12h. uptime: WARN <10m. db_connect: WARN latency >1s. tamanu_http: WARN latency >2s. Co-authored-by: Claude --- .../checks/certificate_notification_errors.rs | 6 +- crates/alertd/src/doctor/checks/db_connect.rs | 17 ++- .../src/doctor/checks/fhir_job_errors.rs | 6 +- .../fhir_service_requests_unresolved.rs | 73 ++++++++--- crates/alertd/src/doctor/checks/ips_errors.rs | 6 +- .../alertd/src/doctor/checks/kopia_backup.rs | 19 +++ .../checks/patient_communication_errors.rs | 6 +- .../alertd/src/doctor/checks/report_errors.rs | 6 +- .../src/doctor/checks/sync_facility_stale.rs | 114 ++++++++++-------- .../alertd/src/doctor/checks/sync_lookup.rs | 85 ++++++++++--- .../src/doctor/checks/sync_restart_loop.rs | 65 +++++++--- .../src/doctor/checks/sync_session_errors.rs | 21 +++- .../alertd/src/doctor/checks/sync_sessions.rs | 10 +- .../alertd/src/doctor/checks/tamanu_http.rs | 16 ++- crates/alertd/src/doctor/checks/uptime.rs | 15 ++- crates/alertd/src/doctor/checks/util.rs | 76 +++++++++--- 16 files changed, 406 insertions(+), 135 deletions(-) diff --git a/crates/alertd/src/doctor/checks/certificate_notification_errors.rs b/crates/alertd/src/doctor/checks/certificate_notification_errors.rs index 450c2843..1d38d31e 100644 --- a/crates/alertd/src/doctor/checks/certificate_notification_errors.rs +++ b/crates/alertd/src/doctor/checks/certificate_notification_errors.rs @@ -2,7 +2,7 @@ use jiff::{Timestamp, ToSpan}; -use super::{CheckContext, util::fail_if_any_rows}; +use super::{CheckContext, util::tiered_rows_check}; use crate::doctor::check::Check; use bestool_tamanu::ApiServerKind; @@ -26,13 +26,15 @@ pub async fn run(ctx: CheckContext) -> Check { }; let since = Timestamp::now() - LOOKBACK_HOURS.hours(); - fail_if_any_rows( + tiered_rows_check( client, "certificate_notification_errors", "no recent certificate notification errors", "certificate notification errors: ", SQL, &[&since], + 1, + 10, ) .await } diff --git a/crates/alertd/src/doctor/checks/db_connect.rs b/crates/alertd/src/doctor/checks/db_connect.rs index a2b1cc56..d3d42447 100644 --- a/crates/alertd/src/doctor/checks/db_connect.rs +++ b/crates/alertd/src/doctor/checks/db_connect.rs @@ -3,6 +3,9 @@ use std::time::Instant; use super::CheckContext; use crate::doctor::check::Check; +/// Connect latency above which the DB is treated as degraded. +const WARN_LATENCY_MS: u64 = 1000; + pub async fn run(ctx: CheckContext) -> Check { let host = ctx .config @@ -21,10 +24,16 @@ pub async fn run(ctx: CheckContext) -> Check { tokio::spawn(async move { let _ = conn.await; }); - Check::pass( - "db_connect", - format!("postgres at {host}/{name} ({latency_ms}ms)"), - ) + let summary = format!("postgres at {host}/{name} ({latency_ms}ms)"); + if latency_ms > WARN_LATENCY_MS { + Check::warning( + "db_connect", + summary, + format!("connect latency {latency_ms}ms over {WARN_LATENCY_MS}ms"), + ) + } else { + Check::pass("db_connect", summary) + } } Err(err) => Check::fail( "db_connect", diff --git a/crates/alertd/src/doctor/checks/fhir_job_errors.rs b/crates/alertd/src/doctor/checks/fhir_job_errors.rs index 806e6706..5208c8fb 100644 --- a/crates/alertd/src/doctor/checks/fhir_job_errors.rs +++ b/crates/alertd/src/doctor/checks/fhir_job_errors.rs @@ -5,7 +5,7 @@ use jiff::{Timestamp, ToSpan}; -use super::{CheckContext, util::fail_if_any_rows}; +use super::{CheckContext, util::tiered_rows_check}; use crate::doctor::check::Check; use bestool_tamanu::ApiServerKind; @@ -29,13 +29,15 @@ pub async fn run(ctx: CheckContext) -> Check { }; let since = Timestamp::now() - LOOKBACK_HOURS.hours(); - fail_if_any_rows( + tiered_rows_check( client, "fhir_job_errors", "no recent FHIR job errors", "FHIR job errors: ", SQL, &[&since], + 1, + 10, ) .await } diff --git a/crates/alertd/src/doctor/checks/fhir_service_requests_unresolved.rs b/crates/alertd/src/doctor/checks/fhir_service_requests_unresolved.rs index 4afe2bf3..f18cdf7a 100644 --- a/crates/alertd/src/doctor/checks/fhir_service_requests_unresolved.rs +++ b/crates/alertd/src/doctor/checks/fhir_service_requests_unresolved.rs @@ -1,17 +1,24 @@ //! FHIR service requests that have stayed unresolved for too long. //! -//! Fails when a FHIR service request linked to a lab request has been -//! unresolved for over an hour. +//! Lists FHIR service requests linked to a lab request that have been +//! unresolved for over an hour, tiering on the longest outstanding duration: +//! WARN past 1h, FAIL past 6h. -use super::{CheckContext, util::fail_if_any_rows}; +use super::{CheckContext, fmt_db_error}; use crate::doctor::check::Check; use bestool_tamanu::ApiServerKind; +use serde_json::{Value, json}; const NAME: &str = "fhir_service_requests_unresolved"; + +const WARN_MINUTES: f64 = 60.0; +const FAIL_MINUTES: f64 = 6.0 * 60.0; + const SQL: &str = "SELECT lr.display_id AS lab_request_id, \ - ROUND(EXTRACT(EPOCH FROM (NOW() - fsr.last_updated)) / 60)::text AS duration_minutes \ + EXTRACT(EPOCH FROM (NOW() - fsr.last_updated)) / 60 AS duration_minutes \ FROM fhir.service_requests fsr JOIN lab_requests lr ON fsr.upstream_id = lr.id \ - WHERE fsr.resolved = FALSE AND NOW() - fsr.last_updated > INTERVAL '1 hours'"; + WHERE fsr.resolved = FALSE AND NOW() - fsr.last_updated > INTERVAL '1 hours' \ + ORDER BY duration_minutes DESC"; pub async fn run(ctx: CheckContext) -> Check { if ctx.kind != ApiServerKind::Central { @@ -25,19 +32,53 @@ pub async fn run(ctx: CheckContext) -> Check { return Check::skip(NAME, "no DB connection", "db unavailable"); }; - fail_if_any_rows( - client, - NAME, - "no unresolved FHIR service requests", - "unresolved FHIR service requests: ", - SQL, - &[], - ) - .await + let rows = match client.query(SQL, &[]).await { + Ok(r) => r, + Err(err) => return Check::fail(NAME, "query failed", fmt_db_error(&err)), + }; + + if rows.is_empty() { + return Check::pass(NAME, "no unresolved FHIR service requests"); + } + + let mut warn = Vec::new(); + let mut fail = Vec::new(); + for row in &rows { + let lab_request_id: Option = row.try_get("lab_request_id").ok(); + let minutes: f64 = row.try_get("duration_minutes").unwrap_or(0.0); + let entry = json!({ + "lab_request_id": lab_request_id, + "duration_minutes": minutes.round() as i64, + }); + if minutes > FAIL_MINUTES { + fail.push(entry); + } else if minutes > WARN_MINUTES { + warn.push(entry); + } + } + + if warn.is_empty() && fail.is_empty() { + return Check::pass(NAME, "no unresolved FHIR service requests"); + } + + let summary = format!( + "unresolved FHIR service requests: {} over 6h, {} over 1h", + fail.len(), + warn.len() + ); + let check = if fail.is_empty() { + Check::warning(NAME, summary, "unresolved FHIR service request(s)") + } else { + Check::fail(NAME, summary, "unresolved FHIR service request(s)") + }; + check + .with_detail("fail", Value::Array(fail)) + .with_detail("warn", Value::Array(warn)) } #[cfg(test)] mod tests { + use crate::doctor::check::CheckStatus; use crate::doctor::checks::test_support::{central_ctx, facility_ctx}; #[tokio::test] @@ -47,6 +88,10 @@ mod tests { }; let check = super::run(ctx).await; assert_eq!(check.name, "fhir_service_requests_unresolved"); + assert!(matches!( + check.status, + CheckStatus::Pass | CheckStatus::Warning(_) | CheckStatus::Fail(_) + )); } #[tokio::test] diff --git a/crates/alertd/src/doctor/checks/ips_errors.rs b/crates/alertd/src/doctor/checks/ips_errors.rs index 1812e8b8..2b45f187 100644 --- a/crates/alertd/src/doctor/checks/ips_errors.rs +++ b/crates/alertd/src/doctor/checks/ips_errors.rs @@ -2,7 +2,7 @@ use jiff::{Timestamp, ToSpan}; -use super::{CheckContext, util::fail_if_any_rows}; +use super::{CheckContext, util::tiered_rows_check}; use crate::doctor::check::Check; use bestool_tamanu::ApiServerKind; @@ -25,13 +25,15 @@ pub async fn run(ctx: CheckContext) -> Check { }; let since = Timestamp::now() - LOOKBACK_HOURS.hours(); - fail_if_any_rows( + tiered_rows_check( client, "ips_errors", "no recent IPS request errors", "IPS request errors: ", SQL, &[&since], + 1, + 10, ) .await } diff --git a/crates/alertd/src/doctor/checks/kopia_backup.rs b/crates/alertd/src/doctor/checks/kopia_backup.rs index dee4a8c0..c2f60ad5 100644 --- a/crates/alertd/src/doctor/checks/kopia_backup.rs +++ b/crates/alertd/src/doctor/checks/kopia_backup.rs @@ -29,6 +29,7 @@ use super::CheckContext; use crate::doctor::check::Check; const CHECK_NAME: &str = "kopia_backup"; +const WARN_AGE_SECS: i64 = 12 * 60 * 60; const FAIL_AGE_SECS: i64 = 24 * 60 * 60; pub async fn run(_ctx: CheckContext) -> Check { @@ -206,6 +207,12 @@ fn evaluate(snapshots: &[Snapshot], now: Timestamp) -> Check { summary.clone(), format!("no backup in {}", humanise_age(FAIL_AGE_SECS)), ) + } else if age_secs >= WARN_AGE_SECS { + Check::warning( + CHECK_NAME, + summary.clone(), + format!("no backup in {}", humanise_age(WARN_AGE_SECS)), + ) } else { Check::pass(CHECK_NAME, summary) }; @@ -300,6 +307,18 @@ mod tests { assert!(matches!(check.status, CheckStatus::Pass), "{check:?}"); } + #[test] + fn warn_when_postgres_snapshot_between_12h_and_24h() { + let now = Timestamp::from_second(20_000_000).unwrap(); + let snapshots = vec![snapshot( + "/var/lib/postgresql/16/main", + Some(now - 18.hours()), + None, + )]; + let check = evaluate(&snapshots, now); + assert!(matches!(check.status, CheckStatus::Warning(_)), "{check:?}"); + } + #[test] fn fail_when_postgres_snapshot_older_than_24h() { let now = Timestamp::from_second(20_000_000).unwrap(); diff --git a/crates/alertd/src/doctor/checks/patient_communication_errors.rs b/crates/alertd/src/doctor/checks/patient_communication_errors.rs index 74d5c537..0caf97c0 100644 --- a/crates/alertd/src/doctor/checks/patient_communication_errors.rs +++ b/crates/alertd/src/doctor/checks/patient_communication_errors.rs @@ -2,7 +2,7 @@ use jiff::{Timestamp, ToSpan}; -use super::{CheckContext, util::fail_if_any_rows}; +use super::{CheckContext, util::tiered_rows_check}; use crate::doctor::check::Check; use bestool_tamanu::ApiServerKind; @@ -26,13 +26,15 @@ pub async fn run(ctx: CheckContext) -> Check { }; let since = Timestamp::now() - LOOKBACK_HOURS.hours(); - fail_if_any_rows( + tiered_rows_check( client, "patient_communication_errors", "no recent patient communication errors", "patient communication errors: ", SQL, &[&since], + 1, + 10, ) .await } diff --git a/crates/alertd/src/doctor/checks/report_errors.rs b/crates/alertd/src/doctor/checks/report_errors.rs index 4dd0b2ce..dc63d35c 100644 --- a/crates/alertd/src/doctor/checks/report_errors.rs +++ b/crates/alertd/src/doctor/checks/report_errors.rs @@ -2,7 +2,7 @@ use jiff::{Timestamp, ToSpan}; -use super::{CheckContext, util::fail_if_any_rows}; +use super::{CheckContext, util::tiered_rows_check}; use crate::doctor::check::Check; use bestool_tamanu::ApiServerKind; @@ -26,13 +26,15 @@ pub async fn run(ctx: CheckContext) -> Check { }; let since = Timestamp::now() - LOOKBACK_HOURS.hours(); - fail_if_any_rows( + tiered_rows_check( client, "report_errors", "no recent report errors", "report errors: ", SQL, &[&since], + 1, + 10, ) .await } diff --git a/crates/alertd/src/doctor/checks/sync_facility_stale.rs b/crates/alertd/src/doctor/checks/sync_facility_stale.rs index e25c21bb..9daa36a1 100644 --- a/crates/alertd/src/doctor/checks/sync_facility_stale.rs +++ b/crates/alertd/src/doctor/checks/sync_facility_stale.rs @@ -1,36 +1,38 @@ //! Facilities whose sync has gone stale. //! -//! Flags facilities that synced in the last 48h but have had no completion in -//! the last 30m, as well as facilities whose last successful sync was over an -//! hour ago. +//! Sync runs about every 60s, so for each facility that has synced in the last +//! 48h we compute the minutes since its last successful (errorless, completed) +//! sync and tier: WARN past 10 minutes, FAIL past 30. The 48h-active guard +//! keeps decommissioned facilities from flagging. -use serde_json::Value; +use serde_json::{Value, json}; -use super::{CheckContext, util::fetch_rows}; +use super::{CheckContext, fmt_db_error}; use crate::doctor::check::Check; use bestool_tamanu::ApiServerKind; const NAME: &str = "sync_facility_stale"; -const NOT_SYNCING_SQL: &str = "with sync_sessions_with_facility_id as ( \ - select created_at, completed_at, \ - jsonb_array_elements_text(parameters->'facilityIds') as facility_id \ - from sync_sessions where parameters->>'isMobile' <> 'true' \ - ) \ - select distinct facility_id from sync_sessions_with_facility_id \ - where created_at > current_timestamp - '48 hours'::interval \ - except \ - select facility_id from sync_sessions_with_facility_id \ - where completed_at > current_timestamp - '30 minutes'::interval \ - group by facility_id order by facility_id"; +const WARN_MINUTES: f64 = 10.0; +const FAIL_MINUTES: f64 = 30.0; -const NO_RECENT_SUCCESS_SQL: &str = "SELECT facility_id, last_successful_sync FROM ( \ - SELECT facility_id, max(completed_at) as last_successful_sync FROM ( \ - SELECT jsonb_array_elements_text(parameters->'facilityIds') as facility_id, completed_at \ - FROM sync_sessions WHERE errors IS NULL \ - ) AS successful_syncs GROUP BY facility_id \ - ) AS last_successful_facility_syncs \ - WHERE last_successful_sync < now() - interval '1 hour'"; +const SQL: &str = "WITH facility_sessions AS ( \ + SELECT jsonb_array_elements_text(parameters->'facilityIds') AS facility_id, \ + created_at, completed_at, errors \ + FROM sync_sessions WHERE parameters->>'isMobile' <> 'true' \ + ), active AS ( \ + SELECT DISTINCT facility_id FROM facility_sessions \ + WHERE created_at > now() - interval '48 hours' \ + ), last_success AS ( \ + SELECT facility_id, max(completed_at) AS last_successful_sync \ + FROM facility_sessions WHERE errors IS NULL AND completed_at IS NOT NULL \ + GROUP BY facility_id \ + ) \ + SELECT a.facility_id, \ + ls.last_successful_sync::text AS last_successful_sync, \ + EXTRACT(EPOCH FROM (now() - ls.last_successful_sync)) / 60 AS minutes_since_success \ + FROM active a LEFT JOIN last_success ls USING (facility_id) \ + ORDER BY minutes_since_success DESC NULLS FIRST"; pub async fn run(ctx: CheckContext) -> Check { if ctx.kind != ApiServerKind::Central { @@ -44,41 +46,55 @@ pub async fn run(ctx: CheckContext) -> Check { return Check::skip(NAME, "no DB connection", "db unavailable"); }; - let not_syncing = match fetch_rows(client, NOT_SYNCING_SQL, &[]).await { - Ok(set) => set, - Err(err) => return Check::fail(NAME, "query failed", super::fmt_db_error(&err)), - }; - let no_recent_success = match fetch_rows(client, NO_RECENT_SUCCESS_SQL, &[]).await { - Ok(set) => set, - Err(err) => return Check::fail(NAME, "query failed", super::fmt_db_error(&err)), + let rows = match client.query(SQL, &[]).await { + Ok(r) => r, + Err(err) => return Check::fail(NAME, "query failed", fmt_db_error(&err)), }; - if not_syncing.is_empty() && no_recent_success.is_empty() { - return Check::pass(NAME, "all facilities syncing"); + let mut warn = Vec::new(); + let mut fail = Vec::new(); + for row in &rows { + let facility_id: String = row.try_get("facility_id").unwrap_or_default(); + let last: Option = row.try_get("last_successful_sync").ok(); + // A facility that is active but has never had a successful sync (NULL + // minutes) is as bad as a very stale one: treat it as a failure. + let minutes: Option = row.try_get("minutes_since_success").ok(); + let entry = json!({ + "facility_id": facility_id, + "last_successful_sync": last, + "minutes_since_success": minutes, + }); + match minutes { + Some(m) if m <= WARN_MINUTES => {} + Some(m) if m <= FAIL_MINUTES => warn.push(entry), + _ => fail.push(entry), + } } - let (not_syncing_count, not_syncing_truncated) = (not_syncing.count(), not_syncing.truncated); - let (no_recent_count, no_recent_truncated) = - (no_recent_success.count(), no_recent_success.truncated); + if warn.is_empty() && fail.is_empty() { + return Check::pass(NAME, "all facilities syncing"); + } - let check = Check::fail( - NAME, - format!( - "stale sync: {not_syncing_count} not syncing, {no_recent_count} with no recent success" - ), - "facility sync stale", + let summary = format!( + "stale sync: {} over {}m, {} over {}m", + fail.len(), + FAIL_MINUTES as i64, + warn.len(), + WARN_MINUTES as i64 ); + let check = if fail.is_empty() { + Check::warning(NAME, summary, "facility sync stale") + } else { + Check::fail(NAME, summary, "facility sync stale") + }; check - .with_detail("not_syncing", Value::Array(not_syncing.rows)) - .with_detail("not_syncing_count", not_syncing_count) - .with_detail("not_syncing_truncated", not_syncing_truncated) - .with_detail("no_recent_success", Value::Array(no_recent_success.rows)) - .with_detail("no_recent_success_count", no_recent_count) - .with_detail("no_recent_success_truncated", no_recent_truncated) + .with_detail("fail", Value::Array(fail)) + .with_detail("warn", Value::Array(warn)) } #[cfg(test)] mod tests { + use crate::doctor::check::CheckStatus; use crate::doctor::checks::test_support::{central_ctx, facility_ctx}; #[tokio::test] @@ -88,6 +104,10 @@ mod tests { }; let check = super::run(ctx).await; assert_eq!(check.name, "sync_facility_stale"); + assert!(matches!( + check.status, + CheckStatus::Pass | CheckStatus::Warning(_) | CheckStatus::Fail(_) + )); } #[tokio::test] diff --git a/crates/alertd/src/doctor/checks/sync_lookup.rs b/crates/alertd/src/doctor/checks/sync_lookup.rs index 0a7a72ad..4c229c73 100644 --- a/crates/alertd/src/doctor/checks/sync_lookup.rs +++ b/crates/alertd/src/doctor/checks/sync_lookup.rs @@ -1,16 +1,20 @@ //! Lookup table update staleness. //! -//! Fails when the central server hasn't recorded a successful lookup-table -//! update in over an hour. +//! The lookup table refreshes roughly every 20s, so tier on minutes of +//! staleness: WARN past 2 minutes, FAIL past 5. If the tracking row is absent, +//! treat the lookup as not tracked and pass. -use super::{CheckContext, util::fail_if_any_rows}; -use crate::doctor::check::Check; +use super::{CheckContext, fmt_db_error}; +use crate::doctor::check::{Check, CheckStatus}; use bestool_tamanu::ApiServerKind; const NAME: &str = "sync_lookup"; -const SQL: &str = "SELECT key, value AS last_sync_tick, updated_at::text AS last_updated, \ - (now() - updated_at)::text AS time_since_update FROM local_system_facts \ - WHERE key = 'lastSuccessfulLookupTableUpdate' AND updated_at < now() - interval '1 hour'"; +const SQL: &str = "SELECT value AS last_sync_tick, updated_at::text AS last_updated, \ + EXTRACT(EPOCH FROM (now() - updated_at))::bigint AS age_seconds \ + FROM local_system_facts WHERE key = 'lastSuccessfulLookupTableUpdate'"; + +const WARN_SECS: i64 = 2 * 60; +const FAIL_SECS: i64 = 5 * 60; pub async fn run(ctx: CheckContext) -> Check { if ctx.kind != ApiServerKind::Central { @@ -24,21 +28,66 @@ pub async fn run(ctx: CheckContext) -> Check { return Check::skip(NAME, "no DB connection", "db unavailable"); }; - fail_if_any_rows( - client, - NAME, - "lookup table up to date", - "lookup table stale: ", - SQL, - &[], - ) - .await + let row = match client.query_opt(SQL, &[]).await { + Ok(Some(r)) => r, + Ok(None) => return Check::pass(NAME, "lookup table not tracked"), + Err(err) => return Check::fail(NAME, "query failed", fmt_db_error(&err)), + }; + + let last_sync_tick: Option = row.try_get("last_sync_tick").ok(); + let last_updated: Option = row.try_get("last_updated").ok(); + let age_seconds: i64 = row.try_get("age_seconds").unwrap_or(0); + + let summary = format!("lookup table updated {}m ago", age_seconds / 60); + let check = match tier(age_seconds) { + CheckStatus::Fail(_) => Check::fail( + NAME, + summary, + format!("lookup table stale: {age_seconds}s since last update"), + ), + CheckStatus::Warning(_) => Check::warning( + NAME, + summary, + format!("lookup table stale: {age_seconds}s since last update"), + ), + _ => Check::pass(NAME, "lookup table up to date"), + }; + + let mut check = check.with_detail("age_seconds", age_seconds); + if let Some(tick) = last_sync_tick { + check = check.with_detail("last_sync_tick", tick); + } + if let Some(updated) = last_updated { + check = check.with_detail("last_updated", updated); + } + check +} + +/// Tier on seconds since the lookup table last updated. +fn tier(seconds: i64) -> CheckStatus { + if seconds > FAIL_SECS { + CheckStatus::Fail(String::new()) + } else if seconds > WARN_SECS { + CheckStatus::Warning(String::new()) + } else { + CheckStatus::Pass + } } #[cfg(test)] mod tests { + use super::*; use crate::doctor::checks::test_support::{central_ctx, facility_ctx}; + #[test] + fn tier_boundaries() { + assert!(matches!(tier(0), CheckStatus::Pass)); + assert!(matches!(tier(120), CheckStatus::Pass)); + assert!(matches!(tier(121), CheckStatus::Warning(_))); + assert!(matches!(tier(300), CheckStatus::Warning(_))); + assert!(matches!(tier(301), CheckStatus::Fail(_))); + } + #[tokio::test] async fn runs_against_central() { let Some(ctx) = central_ctx().await else { @@ -46,6 +95,10 @@ mod tests { }; let check = super::run(ctx).await; assert_eq!(check.name, "sync_lookup"); + assert!(matches!( + check.status, + CheckStatus::Pass | CheckStatus::Warning(_) | CheckStatus::Fail(_) + )); } #[tokio::test] diff --git a/crates/alertd/src/doctor/checks/sync_restart_loop.rs b/crates/alertd/src/doctor/checks/sync_restart_loop.rs index 09d36bc2..cf5fb7a0 100644 --- a/crates/alertd/src/doctor/checks/sync_restart_loop.rs +++ b/crates/alertd/src/doctor/checks/sync_restart_loop.rs @@ -1,19 +1,25 @@ //! Facilities stuck in a sync restart loop. //! -//! Fails when a facility has accumulated 10 or more `snapshot-for-pushing` sync -//! errors in the last hour, which indicates the sync is repeatedly restarting -//! rather than progressing. +//! Counts `snapshot-for-pushing` sync errors per facility in the last hour, +//! which indicates sync repeatedly restarting rather than progressing. WARN at +//! 5 restarts/hr, FAIL at 10. -use super::{CheckContext, util::fail_if_any_rows}; +use serde_json::{Value, json}; + +use super::{CheckContext, fmt_db_error}; use crate::doctor::check::Check; use bestool_tamanu::ApiServerKind; const NAME: &str = "sync_restart_loop"; + +const WARN_RESTARTS: i64 = 5; +const FAIL_RESTARTS: i64 = 10; + const SQL: &str = "SELECT jsonb_array_elements_text(parameters->'facilityIds') AS facility_id, \ COUNT(*) AS error_count FROM sync_sessions \ WHERE created_at > now() - interval '1 hour' AND errors IS NOT NULL \ AND cardinality(errors) = 1 AND errors[1] LIKE '%snapshot-for-pushing%' \ - GROUP BY facility_id HAVING COUNT(*) >= 10 ORDER BY error_count DESC"; + GROUP BY facility_id HAVING COUNT(*) >= 5 ORDER BY error_count DESC"; pub async fn run(ctx: CheckContext) -> Check { if ctx.kind != ApiServerKind::Central { @@ -27,19 +33,46 @@ pub async fn run(ctx: CheckContext) -> Check { return Check::skip(NAME, "no DB connection", "db unavailable"); }; - fail_if_any_rows( - client, - NAME, - "no sync restart loops", - "facilities in sync restart loop: ", - SQL, - &[], - ) - .await + let rows = match client.query(SQL, &[]).await { + Ok(r) => r, + Err(err) => return Check::fail(NAME, "query failed", fmt_db_error(&err)), + }; + + let mut warn = Vec::new(); + let mut fail = Vec::new(); + for row in &rows { + let facility_id: String = row.try_get("facility_id").unwrap_or_default(); + let error_count: i64 = row.try_get("error_count").unwrap_or(0); + let entry = json!({ "facility_id": facility_id, "error_count": error_count }); + if error_count >= FAIL_RESTARTS { + fail.push(entry); + } else if error_count >= WARN_RESTARTS { + warn.push(entry); + } + } + + if warn.is_empty() && fail.is_empty() { + return Check::pass(NAME, "no sync restart loops"); + } + + let summary = format!( + "sync restart loops: {} over {FAIL_RESTARTS}/hr, {} over {WARN_RESTARTS}/hr", + fail.len(), + warn.len() + ); + let check = if fail.is_empty() { + Check::warning(NAME, summary, "facilities in sync restart loop") + } else { + Check::fail(NAME, summary, "facilities in sync restart loop") + }; + check + .with_detail("fail", Value::Array(fail)) + .with_detail("warn", Value::Array(warn)) } #[cfg(test)] mod tests { + use crate::doctor::check::CheckStatus; use crate::doctor::checks::test_support::{central_ctx, facility_ctx}; #[tokio::test] @@ -49,6 +82,10 @@ mod tests { }; let check = super::run(ctx).await; assert_eq!(check.name, "sync_restart_loop"); + assert!(matches!( + check.status, + CheckStatus::Pass | CheckStatus::Warning(_) | CheckStatus::Fail(_) + )); } #[tokio::test] diff --git a/crates/alertd/src/doctor/checks/sync_session_errors.rs b/crates/alertd/src/doctor/checks/sync_session_errors.rs index f29d893d..8b15f0db 100644 --- a/crates/alertd/src/doctor/checks/sync_session_errors.rs +++ b/crates/alertd/src/doctor/checks/sync_session_errors.rs @@ -12,6 +12,8 @@ use bestool_tamanu::ApiServerKind; const NAME: &str = "sync_session_errors"; +const FAIL_ERRORS: usize = 10; + const MOBILE_SQL: &str = "SELECT id, errors::text, \ jsonb_array_elements_text(parameters->'facilityIds') AS facility_id, \ created_at::text AS created, (completed_at - created_at)::text AS duration \ @@ -62,11 +64,20 @@ pub async fn run(ctx: CheckContext) -> Check { let (mobile_count, mobile_truncated) = (mobile.count(), mobile.truncated); let (server_count, server_truncated) = (server.count(), server.truncated); - let check = Check::fail( - NAME, - format!("sync session errors: {mobile_count} mobile, {server_count} server"), - "recent sync session error(s)", - ); + // Truncation means well over FAIL_ERRORS rows, so saturate the total there. + let total = if mobile.truncated || server.truncated { + FAIL_ERRORS + } else { + mobile.rows.len() + server.rows.len() + }; + + let summary = format!("sync session errors: {mobile_count} mobile, {server_count} server"); + let reason = "recent sync session error(s)"; + let check = if total >= FAIL_ERRORS { + Check::fail(NAME, summary, reason) + } else { + Check::warning(NAME, summary, reason) + }; check .with_detail("mobile", Value::Array(mobile.rows)) .with_detail("mobile_count", mobile_count) diff --git a/crates/alertd/src/doctor/checks/sync_sessions.rs b/crates/alertd/src/doctor/checks/sync_sessions.rs index 4c79b8af..34252dac 100644 --- a/crates/alertd/src/doctor/checks/sync_sessions.rs +++ b/crates/alertd/src/doctor/checks/sync_sessions.rs @@ -12,10 +12,10 @@ pub async fn run(ctx: CheckContext) -> Check { SELECT count(*) FILTER (WHERE completed_at IS NULL) AS active_count, count(*) FILTER ( - WHERE completed_at IS NULL AND start_time < now() - interval '1 hour' + WHERE completed_at IS NULL AND start_time < now() - interval '15 minutes' ) AS stuck_warn, count(*) FILTER ( - WHERE completed_at IS NULL AND start_time < now() - interval '6 hours' + WHERE completed_at IS NULL AND start_time < now() - interval '45 minutes' ) AS stuck_fail, min(start_time) FILTER (WHERE completed_at IS NULL) AS oldest_started_at FROM sync_sessions @@ -42,18 +42,18 @@ pub async fn run(ctx: CheckContext) -> Check { let stuck_fail: i64 = row.try_get("stuck_fail").unwrap_or(0); let oldest: Option = row.try_get("oldest_started_at").ok(); - let summary = format!("{active} active, {stuck_warn} stuck >1h"); + let summary = format!("{active} active, {stuck_warn} stuck >15m"); let check = if stuck_fail > 0 { Check::fail( "sync_sessions", summary.clone(), - format!("{stuck_fail} session(s) stuck >6h"), + format!("{stuck_fail} session(s) stuck >45m"), ) } else if stuck_warn > 0 { Check::warning( "sync_sessions", summary.clone(), - format!("{stuck_warn} session(s) stuck >1h"), + format!("{stuck_warn} session(s) stuck >15m"), ) } else { Check::pass("sync_sessions", summary) diff --git a/crates/alertd/src/doctor/checks/tamanu_http.rs b/crates/alertd/src/doctor/checks/tamanu_http.rs index 3a5133d8..45f19dcc 100644 --- a/crates/alertd/src/doctor/checks/tamanu_http.rs +++ b/crates/alertd/src/doctor/checks/tamanu_http.rs @@ -5,6 +5,8 @@ use crate::doctor::check::Check; const PING_URL: &str = "http://localhost/api/public/ping"; const TIMEOUT: Duration = Duration::from_secs(5); +/// Response latency above which a reachable endpoint is treated as degraded. +const WARN_LATENCY_MS: u64 = 2000; pub async fn run(ctx: CheckContext) -> Check { let start = Instant::now(); @@ -16,10 +18,16 @@ pub async fn run(ctx: CheckContext) -> Check { let status = resp.status(); let detail_status = status.as_u16(); if status.is_success() { - Check::pass( - "tamanu_http", - format!("HTTP {} from {PING_URL} ({latency_ms}ms)", status.as_u16()), - ) + let summary = format!("HTTP {} from {PING_URL} ({latency_ms}ms)", status.as_u16()); + if latency_ms > WARN_LATENCY_MS { + Check::warning( + "tamanu_http", + summary, + format!("response latency {latency_ms}ms over {WARN_LATENCY_MS}ms"), + ) + } else { + Check::pass("tamanu_http", summary) + } .with_detail("status_code", detail_status) } else { Check::fail( diff --git a/crates/alertd/src/doctor/checks/uptime.rs b/crates/alertd/src/doctor/checks/uptime.rs index 1ddd42fc..79243845 100644 --- a/crates/alertd/src/doctor/checks/uptime.rs +++ b/crates/alertd/src/doctor/checks/uptime.rs @@ -3,9 +3,22 @@ use sysinfo::System; use super::CheckContext; use crate::doctor::check::Check; +/// Below this uptime the host has rebooted recently, which may be unexpected. +const WARN_UPTIME_SECS: u64 = 10 * 60; + pub async fn run(_ctx: CheckContext) -> Check { let secs = System::uptime(); - Check::pass("uptime", humanise(secs)).with_detail("uptime_secs", secs) + let summary = humanise(secs); + let check = if secs < WARN_UPTIME_SECS { + Check::warning( + "uptime", + summary, + "host rebooted within the last 10 minutes", + ) + } else { + Check::pass("uptime", summary) + }; + check.with_detail("uptime_secs", secs) } fn humanise(secs: u64) -> String { diff --git a/crates/alertd/src/doctor/checks/util.rs b/crates/alertd/src/doctor/checks/util.rs index 3cfbcdc2..6ab1cefe 100644 --- a/crates/alertd/src/doctor/checks/util.rs +++ b/crates/alertd/src/doctor/checks/util.rs @@ -64,32 +64,78 @@ pub async fn fetch_rows( Ok(RowSet { rows, truncated }) } -/// Run a single wrapped query: fail (with capped rows + count) if it -/// returns any rows, else pass. +/// Run a single wrapped query and tier the outcome on the number of +/// matching rows: PASS below `warn_min`, WARN at or above it, FAIL at or above +/// `fail_min`. /// -/// `summary_pass` is the headline shown when nothing matched; -/// `summary_fail_prefix` is prepended to the count when rows are found. -pub async fn fail_if_any_rows( +/// `summary_pass` is the headline shown when nothing crosses `warn_min`; +/// `summary_prefix` is prepended to the count for the WARN/FAIL summary. +/// +/// Rows are capped at [`REPORT_CAP`] (reported as `"100+"`), which is enough to +/// distinguish the small WARN/FAIL boundaries the error-stream checks use. +#[expect( + clippy::too_many_arguments, + reason = "shared query helper; each parameter is a distinct knob the call sites set" +)] +pub async fn tiered_rows_check( client: &Arc, name: &'static str, summary_pass: &str, - summary_fail_prefix: &str, + summary_prefix: &str, sql: &str, params: &[&(dyn ToSql + Sync)], + warn_min: usize, + fail_min: usize, ) -> Check { match fetch_rows(client, sql, params).await { - Ok(set) if set.is_empty() => Check::pass(name, summary_pass.to_string()), Ok(set) => { + // `truncated` means there were more than REPORT_CAP rows, which is + // well past any realistic fail_min, so treat it as the cap. + let n = if set.truncated { + REPORT_CAP + 1 + } else { + set.rows.len() + }; let count = set.count(); - Check::fail( - name, - format!("{summary_fail_prefix}{count}"), - format!("{} matching row(s)", count), - ) - .with_detail("rows", Value::Array(set.rows)) - .with_detail("truncated", set.truncated) - .with_detail("count", count) + if n < warn_min { + return Check::pass(name, summary_pass.to_string()); + } + let summary = format!("{summary_prefix}{count}"); + let reason = format!("{count} matching row(s)"); + let check = if n >= fail_min { + Check::fail(name, summary, reason) + } else { + Check::warning(name, summary, reason) + }; + check + .with_detail("rows", Value::Array(set.rows)) + .with_detail("truncated", set.truncated) + .with_detail("count", count) } Err(err) => Check::fail(name, "query failed", fmt_db_error(&err)), } } + +#[cfg(test)] +mod tests { + /// Pure count→tier decision mirroring [`tiered_rows_check`], factored so the + /// WARN/FAIL boundaries can be asserted without a database. + fn tier(n: usize, warn_min: usize, fail_min: usize) -> &'static str { + if n >= fail_min { + "fail" + } else if n >= warn_min { + "warning" + } else { + "pass" + } + } + + #[test] + fn error_stream_boundaries() { + assert_eq!(tier(0, 1, 10), "pass"); + assert_eq!(tier(1, 1, 10), "warning"); + assert_eq!(tier(9, 1, 10), "warning"); + assert_eq!(tier(10, 1, 10), "fail"); + assert_eq!(tier(100, 1, 10), "fail"); + } +} From e0a0537e444038a6f72ea37b8043e1318ff6e4fd Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?F=C3=A9lix=20Saparelli?= Date: Sat, 30 May 2026 17:57:47 +1200 Subject: [PATCH 07/12] feat(alertd/checks): port fhir unresolvable-service-requests alert to a check Add fhir_service_requests_unresolved, central-only, skipping when the DB is unavailable and failing when a lab-linked FHIR service request has stayed unresolved for over an hour. Co-authored-by: Claude --- crates/alertd/src/doctor/checks.rs | 5 ++ .../fhir_service_requests_unresolved.rs | 57 +++++++++++++++++++ 2 files changed, 62 insertions(+) create mode 100644 crates/alertd/src/doctor/checks/fhir_service_requests_unresolved.rs diff --git a/crates/alertd/src/doctor/checks.rs b/crates/alertd/src/doctor/checks.rs index ad615d66..60d9b0a1 100644 --- a/crates/alertd/src/doctor/checks.rs +++ b/crates/alertd/src/doctor/checks.rs @@ -23,6 +23,7 @@ pub mod disk_free; pub mod external_users; pub mod fhir_job_errors; pub mod fhir_jobs; +pub mod fhir_service_requests_unresolved; pub mod http_errors; pub mod ips_errors; pub mod kopia_backup; @@ -166,6 +167,10 @@ pub fn all() -> Vec { entry!("sync_facility_stale", sync_facility_stale), entry!("sync_lookup", sync_lookup), entry!("sync_restart_loop", sync_restart_loop), + entry!( + "fhir_service_requests_unresolved", + fhir_service_requests_unresolved + ), ] } diff --git a/crates/alertd/src/doctor/checks/fhir_service_requests_unresolved.rs b/crates/alertd/src/doctor/checks/fhir_service_requests_unresolved.rs new file mode 100644 index 00000000..4afe2bf3 --- /dev/null +++ b/crates/alertd/src/doctor/checks/fhir_service_requests_unresolved.rs @@ -0,0 +1,57 @@ +//! FHIR service requests that have stayed unresolved for too long. +//! +//! Fails when a FHIR service request linked to a lab request has been +//! unresolved for over an hour. + +use super::{CheckContext, util::fail_if_any_rows}; +use crate::doctor::check::Check; +use bestool_tamanu::ApiServerKind; + +const NAME: &str = "fhir_service_requests_unresolved"; +const SQL: &str = "SELECT lr.display_id AS lab_request_id, \ + ROUND(EXTRACT(EPOCH FROM (NOW() - fsr.last_updated)) / 60)::text AS duration_minutes \ + FROM fhir.service_requests fsr JOIN lab_requests lr ON fsr.upstream_id = lr.id \ + WHERE fsr.resolved = FALSE AND NOW() - fsr.last_updated > INTERVAL '1 hours'"; + +pub async fn run(ctx: CheckContext) -> Check { + if ctx.kind != ApiServerKind::Central { + return Check::skip( + NAME, + "not applicable on facility server", + "central-only check", + ); + } + let Some(client) = ctx.db.as_ref() else { + return Check::skip(NAME, "no DB connection", "db unavailable"); + }; + + fail_if_any_rows( + client, + NAME, + "no unresolved FHIR service requests", + "unresolved FHIR service requests: ", + SQL, + &[], + ) + .await +} + +#[cfg(test)] +mod tests { + use crate::doctor::checks::test_support::{central_ctx, facility_ctx}; + + #[tokio::test] + async fn runs_against_central() { + let Some(ctx) = central_ctx().await else { + return; + }; + let check = super::run(ctx).await; + assert_eq!(check.name, "fhir_service_requests_unresolved"); + } + + #[tokio::test] + async fn skips_on_facility() { + let check = super::run(facility_ctx()).await; + assert!(check.status.is_skip()); + } +} From f7d0f1a377fb99ffbad54a95fbedb95d3679d28f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?F=C3=A9lix=20Saparelli?= Date: Sat, 30 May 2026 17:44:19 +1200 Subject: [PATCH 08/12] refactor(alertd): move the doctor subsystem into the alertd crate Relocate the doctor framework, checks, sweep orchestration, and the DoctorTask from bestool-tamanu and bestool into bestool-alertd. alertd now owns the monitoring engine and depends on bestool-tamanu for common Tamanu domain code; bestool's doctor CLI keeps only arg parsing, rendering, and daemon-fetch, calling into bestool_alertd::doctor. This is a behaviour-preserving relocation. The one functional fix: perform_sweep and DoctorTask take the running binary's version as a parameter, so the wire payload reports bestool's version rather than alertd's (env!("CARGO_PKG_VERSION") would otherwise resolve to the library crate). Co-authored-by: Claude --- Cargo.lock | 10 +- crates/alertd/Cargo.toml | 6 + crates/alertd/src/doctor.rs | 9 + crates/{tamanu => alertd}/src/doctor/check.rs | 0 .../{tamanu => alertd}/src/doctor/checks.rs | 2 +- .../src/doctor/checks/caddy_version.rs | 4 +- .../src/doctor/checks/db_connect.rs | 0 .../src/doctor/checks/db_version.rs | 0 .../src/doctor/checks/disk_free.rs | 0 .../src/doctor/checks/external_users.rs | 0 .../src/doctor/checks/fhir_jobs.rs | 0 .../src/doctor/checks/http_errors.rs | 0 .../src/doctor/checks/kopia_backup.rs | 0 .../src/doctor/checks/load.rs | 0 .../src/doctor/checks/memory.rs | 0 .../src/doctor/checks/migrations.rs | 0 .../src/doctor/checks/server_id.rs | 4 +- .../src/doctor/checks/sync_sessions.rs | 0 .../src/doctor/checks/tailscale.rs | 4 +- .../src/doctor/checks/tamanu_found.rs | 4 +- .../src/doctor/checks/tamanu_http.rs | 0 .../src/doctor/checks/tamanu_service.rs | 13 +- .../src/doctor/checks/time_sync.rs | 0 .../src/doctor/checks/uptime.rs | 0 .../src/doctor/checks/version_drift.rs | 9 +- .../{tamanu => alertd}/src/doctor/progress.rs | 0 .../src/doctor/server_info.rs | 22 +- crates/alertd/src/doctor/sweep.rs | 358 ++++++++++++++++++ .../src/doctor/task.rs} | 23 +- crates/alertd/src/lib.rs | 1 + crates/bestool/Cargo.toml | 4 +- crates/bestool/src/actions/tamanu/alertd.rs | 7 +- crates/bestool/src/actions/tamanu/doctor.rs | 344 +---------------- crates/tamanu/Cargo.toml | 15 +- crates/tamanu/src/doctor.rs | 4 - crates/tamanu/src/lib.rs | 3 - 36 files changed, 441 insertions(+), 405 deletions(-) create mode 100644 crates/alertd/src/doctor.rs rename crates/{tamanu => alertd}/src/doctor/check.rs (100%) rename crates/{tamanu => alertd}/src/doctor/checks.rs (98%) rename crates/{tamanu => alertd}/src/doctor/checks/caddy_version.rs (99%) rename crates/{tamanu => alertd}/src/doctor/checks/db_connect.rs (100%) rename crates/{tamanu => alertd}/src/doctor/checks/db_version.rs (100%) rename crates/{tamanu => alertd}/src/doctor/checks/disk_free.rs (100%) rename crates/{tamanu => alertd}/src/doctor/checks/external_users.rs (100%) rename crates/{tamanu => alertd}/src/doctor/checks/fhir_jobs.rs (100%) rename crates/{tamanu => alertd}/src/doctor/checks/http_errors.rs (100%) rename crates/{tamanu => alertd}/src/doctor/checks/kopia_backup.rs (100%) rename crates/{tamanu => alertd}/src/doctor/checks/load.rs (100%) rename crates/{tamanu => alertd}/src/doctor/checks/memory.rs (100%) rename crates/{tamanu => alertd}/src/doctor/checks/migrations.rs (100%) rename crates/{tamanu => alertd}/src/doctor/checks/server_id.rs (85%) rename crates/{tamanu => alertd}/src/doctor/checks/sync_sessions.rs (100%) rename crates/{tamanu => alertd}/src/doctor/checks/tailscale.rs (87%) rename crates/{tamanu => alertd}/src/doctor/checks/tamanu_found.rs (85%) rename crates/{tamanu => alertd}/src/doctor/checks/tamanu_http.rs (100%) rename crates/{tamanu => alertd}/src/doctor/checks/tamanu_service.rs (99%) rename crates/{tamanu => alertd}/src/doctor/checks/time_sync.rs (100%) rename crates/{tamanu => alertd}/src/doctor/checks/uptime.rs (100%) rename crates/{tamanu => alertd}/src/doctor/checks/version_drift.rs (96%) rename crates/{tamanu => alertd}/src/doctor/progress.rs (100%) rename crates/{tamanu => alertd}/src/doctor/server_info.rs (89%) create mode 100644 crates/alertd/src/doctor/sweep.rs rename crates/{bestool/src/actions/tamanu/alertd/doctor_task.rs => alertd/src/doctor/task.rs} (92%) delete mode 100644 crates/tamanu/src/doctor.rs diff --git a/Cargo.lock b/Cargo.lock index dc4f4963..ac88a1bc 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -620,18 +620,23 @@ version = "6.1.1" dependencies = [ "axum", "bestool-canopy", + "bestool-kopia", "bestool-postgres", + "bestool-tamanu", "blake3", "bytes", "clap", "clap-markdown", "dirs", + "duct", "futures", "glob", + "hickory-resolver", "jiff", "lloggs", "mailgun-rs", "miette", + "node-semver", "notify", "prometheus", "pulldown-cmark", @@ -648,6 +653,7 @@ dependencies = [ "thiserror 2.0.18", "tokio", "tokio-postgres", + "tokio-stream", "tokio-util", "tower", "tower-http", @@ -769,12 +775,10 @@ dependencies = [ name = "bestool-tamanu" version = "0.10.2" dependencies = [ - "bestool-kopia", "dirs", "duct", "futures", "glob", - "hickory-resolver", "itertools 0.14.0", "jiff", "json5", @@ -782,11 +786,9 @@ dependencies = [ "leon-macros", "miette", "node-semver", - "owo-colors 4.3.0", "p256", "percent-encoding", "regex", - "reqwest", "serde", "serde_json", "sysinfo", diff --git a/crates/alertd/Cargo.toml b/crates/alertd/Cargo.toml index 5b2d7029..40561ce1 100644 --- a/crates/alertd/Cargo.toml +++ b/crates/alertd/Cargo.toml @@ -19,15 +19,20 @@ required-features = ["cli"] [dependencies] axum = "0.8.9" bestool-canopy = { version = "0.2.0", path = "../canopy" } +bestool-kopia = { version = "0.1.0", path = "../kopia" } bestool-postgres = { version = "1.0.11", path = "../postgres" } +bestool-tamanu = { version = "0.10.2", path = "../tamanu" } blake3 = "1.8.5" bytes = "1.9.0" clap = { workspace = true, optional = true, features = ["env", "wrap_help"] } clap-markdown = { version = "0.1.5", optional = true } dirs = "6.0.0" +duct = "1.1.0" futures = { workspace = true } glob = "0.3.3" +hickory-resolver = "0.26.1" jiff = { version = "0.2.24", features = ["serde"] } +node-semver = "2.2.0" lloggs = { workspace = true, optional = true } mailgun-rs = "2.0.2" miette = { workspace = true } @@ -46,6 +51,7 @@ tera = "1.20.0" thiserror = { workspace = true } tokio = { workspace = true, features = ["full"] } tokio-postgres = { version = "0.7.17", features = ["with-jiff-0_2", "with-serde_json-1"] } +tokio-stream = "0.1.17" tokio-util = { workspace = true } tower = "0.5.2" tower-http = { version = "0.6.6", features = ["trace"] } diff --git a/crates/alertd/src/doctor.rs b/crates/alertd/src/doctor.rs new file mode 100644 index 00000000..93e4676c --- /dev/null +++ b/crates/alertd/src/doctor.rs @@ -0,0 +1,9 @@ +pub mod check; +pub mod checks; +pub mod progress; +pub mod server_info; +pub mod sweep; +pub mod task; + +pub use sweep::{SweepResult, overall_from_payload, perform_sweep}; +pub use task::DoctorTask; diff --git a/crates/tamanu/src/doctor/check.rs b/crates/alertd/src/doctor/check.rs similarity index 100% rename from crates/tamanu/src/doctor/check.rs rename to crates/alertd/src/doctor/check.rs diff --git a/crates/tamanu/src/doctor/checks.rs b/crates/alertd/src/doctor/checks.rs similarity index 98% rename from crates/tamanu/src/doctor/checks.rs rename to crates/alertd/src/doctor/checks.rs index 8a4edbf1..e998bae8 100644 --- a/crates/tamanu/src/doctor/checks.rs +++ b/crates/alertd/src/doctor/checks.rs @@ -9,7 +9,7 @@ use std::{path::PathBuf, sync::Arc}; use node_semver::Version; use tokio_postgres::Client as PgClient; -use crate::{ApiServerKind, config::TamanuConfig}; +use bestool_tamanu::{ApiServerKind, config::TamanuConfig}; use super::check::Check; diff --git a/crates/tamanu/src/doctor/checks/caddy_version.rs b/crates/alertd/src/doctor/checks/caddy_version.rs similarity index 99% rename from crates/tamanu/src/doctor/checks/caddy_version.rs rename to crates/alertd/src/doctor/checks/caddy_version.rs index 3256a70e..a251ff81 100644 --- a/crates/tamanu/src/doctor/checks/caddy_version.rs +++ b/crates/alertd/src/doctor/checks/caddy_version.rs @@ -17,8 +17,10 @@ use node_semver::Version; use tokio::process::Command; +use bestool_tamanu::caddy; + use super::CheckContext; -use crate::{caddy, doctor::check::Check}; +use crate::doctor::check::Check; const CHECK_NAME: &str = "caddy_version"; diff --git a/crates/tamanu/src/doctor/checks/db_connect.rs b/crates/alertd/src/doctor/checks/db_connect.rs similarity index 100% rename from crates/tamanu/src/doctor/checks/db_connect.rs rename to crates/alertd/src/doctor/checks/db_connect.rs diff --git a/crates/tamanu/src/doctor/checks/db_version.rs b/crates/alertd/src/doctor/checks/db_version.rs similarity index 100% rename from crates/tamanu/src/doctor/checks/db_version.rs rename to crates/alertd/src/doctor/checks/db_version.rs diff --git a/crates/tamanu/src/doctor/checks/disk_free.rs b/crates/alertd/src/doctor/checks/disk_free.rs similarity index 100% rename from crates/tamanu/src/doctor/checks/disk_free.rs rename to crates/alertd/src/doctor/checks/disk_free.rs diff --git a/crates/tamanu/src/doctor/checks/external_users.rs b/crates/alertd/src/doctor/checks/external_users.rs similarity index 100% rename from crates/tamanu/src/doctor/checks/external_users.rs rename to crates/alertd/src/doctor/checks/external_users.rs diff --git a/crates/tamanu/src/doctor/checks/fhir_jobs.rs b/crates/alertd/src/doctor/checks/fhir_jobs.rs similarity index 100% rename from crates/tamanu/src/doctor/checks/fhir_jobs.rs rename to crates/alertd/src/doctor/checks/fhir_jobs.rs diff --git a/crates/tamanu/src/doctor/checks/http_errors.rs b/crates/alertd/src/doctor/checks/http_errors.rs similarity index 100% rename from crates/tamanu/src/doctor/checks/http_errors.rs rename to crates/alertd/src/doctor/checks/http_errors.rs diff --git a/crates/tamanu/src/doctor/checks/kopia_backup.rs b/crates/alertd/src/doctor/checks/kopia_backup.rs similarity index 100% rename from crates/tamanu/src/doctor/checks/kopia_backup.rs rename to crates/alertd/src/doctor/checks/kopia_backup.rs diff --git a/crates/tamanu/src/doctor/checks/load.rs b/crates/alertd/src/doctor/checks/load.rs similarity index 100% rename from crates/tamanu/src/doctor/checks/load.rs rename to crates/alertd/src/doctor/checks/load.rs diff --git a/crates/tamanu/src/doctor/checks/memory.rs b/crates/alertd/src/doctor/checks/memory.rs similarity index 100% rename from crates/tamanu/src/doctor/checks/memory.rs rename to crates/alertd/src/doctor/checks/memory.rs diff --git a/crates/tamanu/src/doctor/checks/migrations.rs b/crates/alertd/src/doctor/checks/migrations.rs similarity index 100% rename from crates/tamanu/src/doctor/checks/migrations.rs rename to crates/alertd/src/doctor/checks/migrations.rs diff --git a/crates/tamanu/src/doctor/checks/server_id.rs b/crates/alertd/src/doctor/checks/server_id.rs similarity index 85% rename from crates/tamanu/src/doctor/checks/server_id.rs rename to crates/alertd/src/doctor/checks/server_id.rs index 40699ec6..fdf0077e 100644 --- a/crates/tamanu/src/doctor/checks/server_id.rs +++ b/crates/alertd/src/doctor/checks/server_id.rs @@ -1,5 +1,7 @@ +use bestool_tamanu::server_info::get_or_create_server_id; + use super::CheckContext; -use crate::{doctor::check::Check, server_info::get_or_create_server_id}; +use crate::doctor::check::Check; pub async fn run(ctx: CheckContext) -> Check { // Pass the DB through optionally: an already-provisioned host has the id diff --git a/crates/tamanu/src/doctor/checks/sync_sessions.rs b/crates/alertd/src/doctor/checks/sync_sessions.rs similarity index 100% rename from crates/tamanu/src/doctor/checks/sync_sessions.rs rename to crates/alertd/src/doctor/checks/sync_sessions.rs diff --git a/crates/tamanu/src/doctor/checks/tailscale.rs b/crates/alertd/src/doctor/checks/tailscale.rs similarity index 87% rename from crates/tamanu/src/doctor/checks/tailscale.rs rename to crates/alertd/src/doctor/checks/tailscale.rs index 00f47de3..323dfa87 100644 --- a/crates/tamanu/src/doctor/checks/tailscale.rs +++ b/crates/alertd/src/doctor/checks/tailscale.rs @@ -1,5 +1,7 @@ +use bestool_tamanu::server_info::get_tailscale_info; + use super::CheckContext; -use crate::{doctor::check::Check, server_info::get_tailscale_info}; +use crate::doctor::check::Check; pub async fn run(_ctx: CheckContext) -> Check { let (ip, name) = get_tailscale_info().await; diff --git a/crates/tamanu/src/doctor/checks/tamanu_found.rs b/crates/alertd/src/doctor/checks/tamanu_found.rs similarity index 85% rename from crates/tamanu/src/doctor/checks/tamanu_found.rs rename to crates/alertd/src/doctor/checks/tamanu_found.rs index d22931e4..c415c429 100644 --- a/crates/tamanu/src/doctor/checks/tamanu_found.rs +++ b/crates/alertd/src/doctor/checks/tamanu_found.rs @@ -1,5 +1,7 @@ +use bestool_tamanu::ApiServerKind; + use super::CheckContext; -use crate::{ApiServerKind, doctor::check::Check}; +use crate::doctor::check::Check; pub async fn run(ctx: CheckContext) -> Check { let kind = match ctx.kind { diff --git a/crates/tamanu/src/doctor/checks/tamanu_http.rs b/crates/alertd/src/doctor/checks/tamanu_http.rs similarity index 100% rename from crates/tamanu/src/doctor/checks/tamanu_http.rs rename to crates/alertd/src/doctor/checks/tamanu_http.rs diff --git a/crates/tamanu/src/doctor/checks/tamanu_service.rs b/crates/alertd/src/doctor/checks/tamanu_service.rs similarity index 99% rename from crates/tamanu/src/doctor/checks/tamanu_service.rs rename to crates/alertd/src/doctor/checks/tamanu_service.rs index 31f1f228..21110809 100644 --- a/crates/tamanu/src/doctor/checks/tamanu_service.rs +++ b/crates/alertd/src/doctor/checks/tamanu_service.rs @@ -1,8 +1,6 @@ use serde_json::{Value, json}; -use super::CheckContext; -use crate::{ - doctor::check::Check, +use bestool_tamanu::{ pm2, services::{ Expectation, ExpectedState, Instances, Supervisor, expected, parse_systemd_unit, @@ -11,6 +9,9 @@ use crate::{ systemd, }; +use super::CheckContext; +use crate::doctor::check::Check; + pub async fn run(ctx: CheckContext) -> Check { let supervisor = if cfg!(target_os = "linux") { Supervisor::Systemd @@ -25,7 +26,7 @@ pub async fn run(ctx: CheckContext) -> Check { // DB setting. Without a DB client (e.g. unreachable), pass `None` so the // expectation surfaces as Unknown rather than a false-negative Down. let patient_portal_enabled = match ctx.db.as_deref() { - Some(client) => crate::server_info::query_patient_portal_enabled(client).await, + Some(client) => bestool_tamanu::server_info::query_patient_portal_enabled(client).await, None => None, }; let patient_portal_instanced = @@ -531,8 +532,10 @@ fn outcome_to_json(o: &Outcome) -> Value { #[cfg(test)] mod tests { + use bestool_tamanu::{ApiServerKind, config::TamanuConfig}; + use super::*; - use crate::{ApiServerKind, config::TamanuConfig, doctor::check::CheckStatus}; + use crate::doctor::check::CheckStatus; fn cfg(fhir_worker: bool) -> TamanuConfig { let json = serde_json::json!({ diff --git a/crates/tamanu/src/doctor/checks/time_sync.rs b/crates/alertd/src/doctor/checks/time_sync.rs similarity index 100% rename from crates/tamanu/src/doctor/checks/time_sync.rs rename to crates/alertd/src/doctor/checks/time_sync.rs diff --git a/crates/tamanu/src/doctor/checks/uptime.rs b/crates/alertd/src/doctor/checks/uptime.rs similarity index 100% rename from crates/tamanu/src/doctor/checks/uptime.rs rename to crates/alertd/src/doctor/checks/uptime.rs diff --git a/crates/tamanu/src/doctor/checks/version_drift.rs b/crates/alertd/src/doctor/checks/version_drift.rs similarity index 96% rename from crates/tamanu/src/doctor/checks/version_drift.rs rename to crates/alertd/src/doctor/checks/version_drift.rs index e76db833..c4e5d033 100644 --- a/crates/tamanu/src/doctor/checks/version_drift.rs +++ b/crates/alertd/src/doctor/checks/version_drift.rs @@ -7,13 +7,14 @@ use serde_json::{Value, json}; -use super::CheckContext; -use crate::{ - doctor::check::Check, +use bestool_tamanu::{ services::{Supervisor, expected, parse_systemd_unit, systemd_patient_portal_instanced}, versions, }; +use super::CheckContext; +use crate::doctor::check::Check; + pub async fn run(ctx: CheckContext) -> Check { let supervisor = if cfg!(target_os = "linux") { Supervisor::Systemd @@ -49,7 +50,7 @@ pub async fn run(ctx: CheckContext) -> Check { // started or orphaned containers aren't drift; they're outside the // expected set. let patient_portal_enabled = match ctx.db.as_deref() { - Some(client) => crate::server_info::query_patient_portal_enabled(client).await, + Some(client) => bestool_tamanu::server_info::query_patient_portal_enabled(client).await, None => None, }; let patient_portal_instanced = diff --git a/crates/tamanu/src/doctor/progress.rs b/crates/alertd/src/doctor/progress.rs similarity index 100% rename from crates/tamanu/src/doctor/progress.rs rename to crates/alertd/src/doctor/progress.rs diff --git a/crates/tamanu/src/doctor/server_info.rs b/crates/alertd/src/doctor/server_info.rs similarity index 89% rename from crates/tamanu/src/doctor/server_info.rs rename to crates/alertd/src/doctor/server_info.rs index 82f3641a..ba113a86 100644 --- a/crates/tamanu/src/doctor/server_info.rs +++ b/crates/alertd/src/doctor/server_info.rs @@ -12,7 +12,7 @@ use sysinfo::{Disks, System}; use tokio::net::TcpStream; use tracing::debug; -use crate::server_info::{detect_node_version, detect_virtualisation}; +use bestool_tamanu::server_info::{detect_node_version, detect_virtualisation}; const PROBE_TIMEOUT: Duration = Duration::from_secs(3); const IPV4_PROBE_ADDR: SocketAddr = SocketAddr::new(IpAddr::V4(Ipv4Addr::new(1, 1, 1, 1)), 443); @@ -38,7 +38,7 @@ pub struct Filesystem { #[derive(Debug, Clone, Serialize)] #[serde(rename_all = "camelCase")] pub struct ServerInfo { - pub bestool_version: &'static str, + pub bestool_version: String, pub tamanu_version: String, /// Host's installed Node.js version (bare, no leading `v`), if node is on /// `PATH`. Omitted when node isn't installed or can't be queried. @@ -92,17 +92,11 @@ pub struct ServerFacts { /// Build the status-payload `ServerInfo` block. /// -/// `bestool_version` is the version of the *calling* bestool binary — it must -/// be provided by the caller (typically `env!("CARGO_PKG_VERSION")` resolved -/// in the bestool crate) rather than evaluated here, because this code lives -/// in the `bestool-tamanu` library crate, where `env!("CARGO_PKG_VERSION")` -/// would resolve to that crate's own version (`bestool-tamanu`, currently 0.1.x) -/// — the bug this signature change fixes. -pub async fn gather( - bestool_version: &'static str, - tamanu_version: &str, - facts: ServerFacts, -) -> ServerInfo { +/// `bestool_version` is the version of the *calling* binary — it must be +/// provided by the caller (`env!("CARGO_PKG_VERSION")` resolved in the bestool +/// crate) rather than evaluated here, since in this library crate it would +/// resolve to the library's own version instead of the running binary's. +pub async fn gather(bestool_version: &str, tamanu_version: &str, facts: ServerFacts) -> ServerInfo { let disks = Disks::new_with_refreshed_list(); let filesystems = disks .iter() @@ -122,7 +116,7 @@ pub async fn gather( .map(|s| s.to_string()); ServerInfo { - bestool_version, + bestool_version: bestool_version.to_string(), tamanu_version: tamanu_version.to_string(), node_version: detect_node_version().await, hostname: System::host_name(), diff --git a/crates/alertd/src/doctor/sweep.rs b/crates/alertd/src/doctor/sweep.rs new file mode 100644 index 00000000..6896c5a0 --- /dev/null +++ b/crates/alertd/src/doctor/sweep.rs @@ -0,0 +1,358 @@ +use std::{path::Path, sync::Arc}; + +use futures::stream::{FuturesUnordered, StreamExt}; +use miette::{IntoDiagnostic, Result, miette}; +use node_semver::Version; +use serde_json::{Map, Value}; +use tracing::{debug, warn}; + +use bestool_tamanu::{config::TamanuConfig, server_info::get_or_create_server_id}; + +use crate::doctor::{ + check::{Check, OverallResult}, + checks::{self, CheckContext}, + progress::{DoctorEvent, ProgressSender}, + server_info::{self, ServerFacts}, +}; + +pub struct SweepResult { + pub server_id: Option, + pub results: Vec<(Check, bool)>, + pub overall: OverallResult, + pub payload: Value, + /// `SELECT version()` result observed during this sweep, available so + /// callers (e.g. the daemon plugin) can cache it across ticks instead of + /// re-querying every minute. + pub pg_version: Option, +} + +#[expect( + clippy::too_many_arguments, + reason = "each argument is a distinct knob the CLI and daemon callers need to thread through" +)] +pub async fn perform_sweep( + binary_version: &str, + version: &Version, + root: &Path, + config: Arc, + database_url: &str, + http_client: reqwest::Client, + selected_names: &[String], + skip_names: &[String], + cached_pg_version: Option, + progress: Option, +) -> Result { + // Open a single connection up-front. Checks that need the DB share it; the + // `db_connect` check separately measures the open latency for reporting. + // Goes through `bestool_postgres::pool::connect_one` so all DB opens in + // the project share one SSL fallback / auth retry / app-name path. + let db = match bestool_postgres::pool::connect_one(database_url, "bestool-tamanu-doctor").await + { + Ok(client) => Some(Arc::new(client)), + Err(err) => { + warn!(%err, "doctor could not open Tamanu DB; DB-dependent checks will skip"); + None + } + }; + + let kind = bestool_tamanu::detect_kind(&config, db.as_deref()).await; + debug!(?kind, "detected Tamanu server kind for doctor sweep"); + + let check_ctx = CheckContext { + tamanu_version: version.clone(), + tamanu_root: root.to_path_buf(), + config: config.clone(), + kind, + database_url: database_url.to_owned(), + db: db.clone(), + http_client, + }; + + let registry = checks::all(); + let known: Vec<&str> = registry.iter().map(|e| e.name).collect(); + if let Some(unknown) = selected_names.iter().find(|n| !known.contains(&n.as_str())) { + return Err(miette!( + "unknown check name `{unknown}`; known checks: {}", + known.join(", ") + )); + } + if let Some(unknown) = skip_names.iter().find(|n| !known.contains(&n.as_str())) { + return Err(miette!( + "unknown check name `{unknown}` in --skip; known checks: {}", + known.join(", ") + )); + } + + let selected: Vec<(usize, &checks::CheckEntry)> = registry + .iter() + .enumerate() + .filter(|(_, e)| selected_names.is_empty() || selected_names.iter().any(|n| n == e.name)) + .filter(|(_, e)| !skip_names.iter().any(|n| n == e.name)) + .collect(); + + // Run all selected checks concurrently. Results are collated by registry + // index before returning, so callers see a stable order regardless of + // completion order. A progress channel can observe results as they land. + let mut pending = FuturesUnordered::new(); + for (idx, entry) in &selected { + let ctx = check_ctx.clone(); + let on_wire = entry.on_wire; + let idx = *idx; + let fut = (entry.run)(ctx); + pending.push(async move { + let result = fut.await; + (idx, on_wire, result) + }); + } + + let mut completed: Vec<(usize, Check, bool)> = Vec::with_capacity(selected.len()); + while let Some((idx, on_wire, check)) = pending.next().await { + if let Some(tx) = progress.as_ref() { + let _ = tx.send(DoctorEvent::Completed(check.clone())); + } + completed.push((idx, check, on_wire)); + } + completed.sort_by_key(|(idx, _, _)| *idx); + let results: Vec<(Check, bool)> = completed.into_iter().map(|(_, c, w)| (c, w)).collect(); + + // Resolve via the file path first so a doctor sweep can still report to + // canopy when the DB is down — that's exactly the moment canopy most + // needs to hear from us. + let server_id = match get_or_create_server_id(db.as_deref()).await { + Ok(id) => Some(id), + Err(err) => { + warn!("could not resolve metaServerId: {err}"); + None + } + }; + + let facts = collect_server_facts(&config, db.as_deref(), cached_pg_version).await; + let pg_version = facts.pg_version.clone(); + // `binary_version` is the running binary's (bestool's) version, threaded in + // by the caller. Evaluating `env!("CARGO_PKG_VERSION")` here would resolve + // to this library's version instead, which is the wrong answer for the wire + // payload. + let info = server_info::gather(binary_version, &version.to_string(), facts).await; + let info_value = serde_json::to_value(&info).into_diagnostic()?; + + let overall = + OverallResult::from_checks(&results.iter().map(|(c, _)| c.clone()).collect::>()); + let payload = build_payload(&info_value, &results, overall); + + Ok(SweepResult { + server_id, + results, + overall, + payload, + pg_version, + }) +} + +async fn collect_server_facts( + config: &TamanuConfig, + db: Option<&tokio_postgres::Client>, + cached_pg_version: Option, +) -> ServerFacts { + let mut facts = ServerFacts { + canonical_url: config.canonical_url().map(|u| u.to_string()), + timezone: config.primary_time_zone().map(|s| s.to_string()), + pg_version: cached_pg_version, + ..Default::default() + }; + + let Some(client) = db else { + return facts; + }; + + if facts.pg_version.is_none() { + match client.query_one("SELECT version()", &[]).await { + Ok(row) => match row.try_get::<_, String>(0) { + Ok(v) => facts.pg_version = Some(v), + Err(err) => warn!("decoding pg_version: {err}"), + }, + Err(err) => warn!("SELECT version() failed: {err}"), + } + } + + match client + .query_opt( + "SELECT value FROM local_system_facts WHERE key = 'currentSyncTick'", + &[], + ) + .await + { + Ok(Some(row)) => match row.try_get::<_, String>(0) { + Ok(tick) => facts.current_sync_tick = Some(tick), + Err(err) => warn!("decoding currentSyncTick: {err}"), + }, + Ok(None) => {} + Err(err) => warn!("querying currentSyncTick: {err}"), + } + + facts +} + +pub fn overall_from_payload(payload: &Value) -> OverallResult { + let healthy = payload + .get("healthy") + .and_then(Value::as_bool) + .unwrap_or(true); + if !healthy { + return OverallResult::Failing; + } + // `healthy: true` covers both Healthy and Degraded — peek at the + // per-check entries to disambiguate. A `healthy: false` entry in a + // top-level-healthy payload means a warning was logged. + let degraded = payload + .get("health") + .and_then(Value::as_array) + .map(|arr| { + arr.iter().any(|c| { + c.get("healthy") == Some(&Value::Bool(false)) + && c.get("skipped") != Some(&Value::Bool(true)) + }) + }) + .unwrap_or(false); + if degraded { + OverallResult::Degraded + } else { + OverallResult::Healthy + } +} + +fn build_payload(info: &Value, results: &[(Check, bool)], overall: OverallResult) -> Value { + let mut payload: Map = match info { + Value::Object(o) => o.clone(), + _ => Map::new(), + }; + + // Lift any `payload_extras` from individual checks into the top-level + // payload (alongside server facts like `osTimezone`). Lets a check carry + // bulky context-data that belongs with server facts rather than crowding + // its diagnostic entry in `health[]`. + for (check, _) in results { + for (k, v) in &check.payload_extras { + payload.insert(k.clone(), v.clone()); + } + } + + let health: Vec = results + .iter() + .filter(|(_, on_wire)| *on_wire) + .map(|(c, _)| c.to_wire()) + .collect(); + + payload.insert("healthy".into(), overall.is_healthy_top_level().into()); + payload.insert("health".into(), Value::Array(health)); + + Value::Object(payload) +} + +#[cfg(test)] +mod tests { + use super::*; + + fn pass(name: &'static str) -> (Check, bool) { + (Check::pass(name, "ok"), true) + } + fn warn(name: &'static str) -> (Check, bool) { + (Check::warning(name, "deg", "reason"), true) + } + fn fail(name: &'static str) -> (Check, bool) { + (Check::fail(name, "bad", "reason"), true) + } + fn skip(name: &'static str) -> (Check, bool) { + (Check::skip(name, "not run", "reason"), true) + } + + #[test] + fn payload_all_pass_is_healthy() { + let results = vec![pass("a"), pass("b")]; + let overall = + OverallResult::from_checks(&results.iter().map(|(c, _)| c.clone()).collect::>()); + let payload = build_payload(&Value::Object(Default::default()), &results, overall); + assert_eq!(payload["healthy"], true); + assert_eq!(payload["health"].as_array().unwrap().len(), 2); + assert!(payload["health"][0]["healthy"].as_bool().unwrap()); + } + + #[test] + fn payload_warning_keeps_top_healthy_but_check_unhealthy() { + let results = vec![pass("a"), warn("b")]; + let overall = + OverallResult::from_checks(&results.iter().map(|(c, _)| c.clone()).collect::>()); + let payload = build_payload(&Value::Object(Default::default()), &results, overall); + assert_eq!(payload["healthy"], true); + assert_eq!(payload["health"][1]["healthy"], false); + } + + #[test] + fn payload_fail_flips_top_level() { + let results = vec![pass("a"), warn("b"), fail("c")]; + let overall = + OverallResult::from_checks(&results.iter().map(|(c, _)| c.clone()).collect::>()); + let payload = build_payload(&Value::Object(Default::default()), &results, overall); + assert_eq!(payload["healthy"], false); + } + + #[test] + fn payload_lifts_payload_extras_into_top_level() { + // `payload_extras` is for data a check wants alongside server facts + // (osTimezone etc), not in its per-check entry. The tamanu_service + // check uses it for raw service inventory. + let mut info = serde_json::Map::new(); + info.insert("osTimezone".into(), "Pacific/Auckland".into()); + let info_value = Value::Object(info); + + let check = Check::pass("svc", "ok") + .with_detail("supervisor", "systemd") + .with_payload_extra( + "services", + serde_json::json!({"supervisor": "systemd", "expectations": []}), + ); + let results = vec![(check, true)]; + let overall = + OverallResult::from_checks(&results.iter().map(|(c, _)| c.clone()).collect::>()); + let payload = build_payload(&info_value, &results, overall); + + assert_eq!(payload["osTimezone"], "Pacific/Auckland"); + // Lifted into the top level, alongside osTimezone. + assert_eq!(payload["services"]["supervisor"], "systemd"); + // And NOT duplicated into the per-check entry. + assert!(payload["health"][0].get("services").is_none()); + // But the lean per-check detail (supervisor label) is still on the + // `health[]` entry. + assert_eq!(payload["health"][0]["supervisor"], "systemd"); + } + + #[test] + fn off_wire_checks_skipped_in_health_array() { + let results = vec![ + (Check::pass("on", "ok"), true), + (Check::pass("off", "ok"), false), + ]; + let overall = + OverallResult::from_checks(&results.iter().map(|(c, _)| c.clone()).collect::>()); + let payload = build_payload(&Value::Object(Default::default()), &results, overall); + let names: Vec<&str> = payload["health"] + .as_array() + .unwrap() + .iter() + .map(|v| v["check"].as_str().unwrap()) + .collect(); + assert_eq!(names, vec!["on"]); + } + + #[test] + fn payload_skip_is_healthy_on_wire() { + // The whole point of distinguishing Skip from Fail/Warning is that + // "we don't know" shouldn't fire alerts downstream of the wire format. + let results = vec![pass("a"), skip("b")]; + let overall = + OverallResult::from_checks(&results.iter().map(|(c, _)| c.clone()).collect::>()); + let payload = build_payload(&Value::Object(Default::default()), &results, overall); + assert_eq!(payload["healthy"], true); + assert_eq!(payload["health"][1]["healthy"], true); + assert_eq!(payload["health"][1]["skipped"], true); + } +} diff --git a/crates/bestool/src/actions/tamanu/alertd/doctor_task.rs b/crates/alertd/src/doctor/task.rs similarity index 92% rename from crates/bestool/src/actions/tamanu/alertd/doctor_task.rs rename to crates/alertd/src/doctor/task.rs index 7eb4e037..3dcc3f8b 100644 --- a/crates/bestool/src/actions/tamanu/alertd/doctor_task.rs +++ b/crates/alertd/src/doctor/task.rs @@ -1,11 +1,6 @@ use std::{path::PathBuf, sync::Arc, time::Duration}; -use bestool_alertd::{ - BackgroundTask, TaskContext, TaskEndpoint, TaskEndpointResponse, - canopy::DEFAULT_CANOPY_URL, - tasks::TaskEndpointHandler, -}; -use bestool_tamanu::{config::TamanuConfig, doctor::progress::DoctorEvent}; +use bestool_tamanu::config::TamanuConfig; use futures::{StreamExt, future::BoxFuture, stream::BoxStream}; use jiff::Timestamp; use miette::{Result, miette}; @@ -15,7 +10,10 @@ use serde_json::{Value, json}; use tokio::sync::{Mutex, mpsc}; use tracing::warn; -use crate::actions::tamanu::doctor; +use crate::canopy::DEFAULT_CANOPY_URL; +use crate::doctor::{self, progress::DoctorEvent}; +use crate::tasks::TaskEndpointHandler; +use crate::{BackgroundTask, TaskContext, TaskEndpoint, TaskEndpointResponse}; const DOCTOR_INTERVAL: Duration = Duration::from_secs(60); @@ -29,6 +27,7 @@ pub struct DoctorTask { } struct DoctorTaskInner { + binary_version: String, tamanu_version: Version, tamanu_root: PathBuf, config: Arc, @@ -53,6 +52,7 @@ struct LatestSweep { impl DoctorTask { pub fn new( + binary_version: String, tamanu_version: Version, tamanu_root: PathBuf, config: Arc, @@ -60,6 +60,7 @@ impl DoctorTask { ) -> Self { Self { inner: Arc::new(DoctorTaskInner { + binary_version, tamanu_version, tamanu_root, config, @@ -78,10 +79,11 @@ impl DoctorTaskInner { async fn run_sweep( self: &Arc, ctx: &TaskContext, - progress: Option, + progress: Option, ) -> Result { let cached = self.pg_version_cache.lock().await.clone(); let sweep = doctor::perform_sweep( + &self.binary_version, &self.tamanu_version, &self.tamanu_root, self.config.clone(), @@ -189,9 +191,8 @@ impl DoctorTaskInner { } }); - let stream: BoxStream<'static, Value> = Box::pin( - tokio_stream::wrappers::UnboundedReceiverStream::new(out_rx).map(|v| v), - ); + let stream: BoxStream<'static, Value> = + Box::pin(tokio_stream::wrappers::UnboundedReceiverStream::new(out_rx).map(|v| v)); TaskEndpointResponse::JsonLines(stream) } } diff --git a/crates/alertd/src/lib.rs b/crates/alertd/src/lib.rs index b98e85f5..1db2e10d 100644 --- a/crates/alertd/src/lib.rs +++ b/crates/alertd/src/lib.rs @@ -6,6 +6,7 @@ pub use bestool_canopy::Redacted; mod alert; pub mod commands; mod daemon; +pub mod doctor; mod events; mod glob_resolver; pub mod http_server; diff --git a/crates/bestool/Cargo.toml b/crates/bestool/Cargo.toml index 04ed2e71..3adbca9e 100644 --- a/crates/bestool/Cargo.toml +++ b/crates/bestool/Cargo.toml @@ -150,7 +150,7 @@ tamanu-alerts = [ "dep:tokio-postgres", "dep:walkdir", ] -tamanu-alertd = ["__tamanu", "tamanu-config", "bestool-tamanu/doctor", "dep:bestool-alertd", "dep:bestool-postgres", "dep:p256", "dep:serde_path_to_error", "dep:serde_yaml", "dep:tokio-postgres", "dep:tokio-stream", "dep:walkdir"] +tamanu-alertd = ["__tamanu", "tamanu-config", "dep:bestool-alertd", "dep:bestool-postgres", "dep:p256", "dep:serde_path_to_error", "dep:serde_yaml", "dep:tokio-postgres", "dep:tokio-stream", "dep:walkdir"] tamanu-artifacts = ["__tamanu", "dep:comfy-table", "dep:detect-targets", "dep:target-tuples"] tamanu-backup = ["__tamanu", "file", "tamanu-config", "dep:bestool-psql", "dep:algae-cli", "dep:duct"] tamanu-backup-configs = ["__tamanu", "tamanu-backup", "dep:walkdir", "dep:zip"] @@ -158,7 +158,7 @@ tamanu-config = ["__tamanu"] tamanu-doctor = [ "__tamanu", "tamanu-config", - "bestool-tamanu/doctor", + "dep:bestool-alertd", "dep:bestool-canopy", "dep:bestool-postgres", "dep:bestool-psql", diff --git a/crates/bestool/src/actions/tamanu/alertd.rs b/crates/bestool/src/actions/tamanu/alertd.rs index 68afe461..89e9190d 100644 --- a/crates/bestool/src/actions/tamanu/alertd.rs +++ b/crates/bestool/src/actions/tamanu/alertd.rs @@ -13,13 +13,11 @@ use bestool_tamanu::{ server_info::{fetch_device_key_with, query_device_key_row}, }; +use bestool_alertd::doctor::DoctorTask; + use super::{TamanuArgs, find_tamanu}; use crate::actions::Context; -mod doctor_task; - -use doctor_task::DoctorTask; - /// Run the alert daemon /// /// The alert and target definitions are documented online at: @@ -364,6 +362,7 @@ async fn build_config( if !no_healthchecks { daemon_config = daemon_config.with_task(Arc::new(DoctorTask::new( + env!("CARGO_PKG_VERSION").to_string(), tamanu_version.clone(), root.to_path_buf(), config.clone(), diff --git a/crates/bestool/src/actions/tamanu/doctor.rs b/crates/bestool/src/actions/tamanu/doctor.rs index c6cc632f..c01972b3 100644 --- a/crates/bestool/src/actions/tamanu/doctor.rs +++ b/crates/bestool/src/actions/tamanu/doctor.rs @@ -5,24 +5,21 @@ use std::{ }; use clap::Parser; -use futures::stream::{FuturesUnordered, StreamExt}; use miette::{IntoDiagnostic, Result, WrapErr, miette}; use node_semver::Version; use owo_colors::OwoColorize; -use serde_json::{Map, Value}; +use serde_json::Value; use tokio::sync::mpsc; use tracing::{debug, warn}; -use bestool_tamanu::{ - config::{TamanuConfig, load_config}, - doctor::{ - check::{Check, CheckStatus, OverallResult}, - checks::{self, CheckContext}, - progress::{DoctorEvent, ProgressSender}, - server_info::{self, ServerFacts}, - }, - server_info::get_or_create_server_id, +use bestool_alertd::doctor::{ + SweepResult, + check::{Check, CheckStatus, OverallResult}, + checks, + overall_from_payload, perform_sweep, + progress::DoctorEvent, }; +use bestool_tamanu::config::{TamanuConfig, load_config}; use super::{TamanuArgs, find_tamanu}; use crate::actions::Context; @@ -177,6 +174,7 @@ async fn run_local_sweep( let progress = renderer.as_ref().map(|(tx, _)| tx.clone()); let sweep = perform_sweep( + env!("CARGO_PKG_VERSION"), version, root, config, @@ -465,239 +463,6 @@ async fn run_daemon_recompute( }) } -fn overall_from_payload(payload: &Value) -> OverallResult { - let healthy = payload - .get("healthy") - .and_then(Value::as_bool) - .unwrap_or(true); - if !healthy { - return OverallResult::Failing; - } - // `healthy: true` covers both Healthy and Degraded — peek at the - // per-check entries to disambiguate. A `healthy: false` entry in a - // top-level-healthy payload means a warning was logged. - let degraded = payload - .get("health") - .and_then(Value::as_array) - .map(|arr| { - arr.iter().any(|c| { - c.get("healthy") == Some(&Value::Bool(false)) - && c.get("skipped") != Some(&Value::Bool(true)) - }) - }) - .unwrap_or(false); - if degraded { - OverallResult::Degraded - } else { - OverallResult::Healthy - } -} - -pub(super) struct SweepResult { - pub server_id: Option, - pub results: Vec<(Check, bool)>, - pub overall: OverallResult, - pub payload: Value, - /// `SELECT version()` result observed during this sweep, available so - /// callers (e.g. the daemon plugin) can cache it across ticks instead of - /// re-querying every minute. - pub pg_version: Option, -} - -#[expect( - clippy::too_many_arguments, - reason = "each argument is a distinct knob the CLI and daemon callers need to thread through" -)] -pub(super) async fn perform_sweep( - version: &Version, - root: &Path, - config: Arc, - database_url: &str, - http_client: reqwest::Client, - selected_names: &[String], - skip_names: &[String], - cached_pg_version: Option, - progress: Option, -) -> Result { - // Open a single connection up-front. Checks that need the DB share it; the - // `db_connect` check separately measures the open latency for reporting. - // Goes through `bestool_postgres::pool::connect_one` so all DB opens in - // the project share one SSL fallback / auth retry / app-name path. - let db = match bestool_postgres::pool::connect_one(database_url, "bestool-tamanu-doctor").await - { - Ok(client) => Some(Arc::new(client)), - Err(err) => { - warn!(%err, "doctor could not open Tamanu DB; DB-dependent checks will skip"); - None - } - }; - - let kind = bestool_tamanu::detect_kind(&config, db.as_deref()).await; - debug!(?kind, "detected Tamanu server kind for doctor sweep"); - - let check_ctx = CheckContext { - tamanu_version: version.clone(), - tamanu_root: root.to_path_buf(), - config: config.clone(), - kind, - database_url: database_url.to_owned(), - db: db.clone(), - http_client, - }; - - let registry = checks::all(); - let known: Vec<&str> = registry.iter().map(|e| e.name).collect(); - if let Some(unknown) = selected_names.iter().find(|n| !known.contains(&n.as_str())) { - return Err(miette!( - "unknown check name `{unknown}`; known checks: {}", - known.join(", ") - )); - } - if let Some(unknown) = skip_names.iter().find(|n| !known.contains(&n.as_str())) { - return Err(miette!( - "unknown check name `{unknown}` in --skip; known checks: {}", - known.join(", ") - )); - } - - let selected: Vec<(usize, &checks::CheckEntry)> = registry - .iter() - .enumerate() - .filter(|(_, e)| selected_names.is_empty() || selected_names.iter().any(|n| n == e.name)) - .filter(|(_, e)| !skip_names.iter().any(|n| n == e.name)) - .collect(); - - // Run all selected checks concurrently. Results are collated by registry - // index before returning, so callers see a stable order regardless of - // completion order. A progress channel can observe results as they land. - let mut pending = FuturesUnordered::new(); - for (idx, entry) in &selected { - let ctx = check_ctx.clone(); - let on_wire = entry.on_wire; - let idx = *idx; - let fut = (entry.run)(ctx); - pending.push(async move { - let result = fut.await; - (idx, on_wire, result) - }); - } - - let mut completed: Vec<(usize, Check, bool)> = Vec::with_capacity(selected.len()); - while let Some((idx, on_wire, check)) = pending.next().await { - if let Some(tx) = progress.as_ref() { - let _ = tx.send(DoctorEvent::Completed(check.clone())); - } - completed.push((idx, check, on_wire)); - } - completed.sort_by_key(|(idx, _, _)| *idx); - let results: Vec<(Check, bool)> = completed.into_iter().map(|(_, c, w)| (c, w)).collect(); - - // Resolve via the file path first so a doctor sweep can still report to - // canopy when the DB is down — that's exactly the moment canopy most - // needs to hear from us. - let server_id = match get_or_create_server_id(db.as_deref()).await { - Ok(id) => Some(id), - Err(err) => { - warn!("could not resolve metaServerId: {err}"); - None - } - }; - - let facts = collect_server_facts(&config, db.as_deref(), cached_pg_version).await; - let pg_version = facts.pg_version.clone(); - // `env!("CARGO_PKG_VERSION")` here resolves at *this* crate's compile time - // — the bestool crate — which is what we want in the wire payload. The - // same expression inside `bestool-tamanu` resolves to that library's - // version (0.1.x) and gave the wrong answer before this argument existed. - let info = - server_info::gather(env!("CARGO_PKG_VERSION"), &version.to_string(), facts).await; - let info_value = serde_json::to_value(&info).into_diagnostic()?; - - let overall = - OverallResult::from_checks(&results.iter().map(|(c, _)| c.clone()).collect::>()); - let payload = build_payload(&info_value, &results, overall); - - Ok(SweepResult { - server_id, - results, - overall, - payload, - pg_version, - }) -} - -async fn collect_server_facts( - config: &TamanuConfig, - db: Option<&tokio_postgres::Client>, - cached_pg_version: Option, -) -> ServerFacts { - let mut facts = ServerFacts { - canonical_url: config.canonical_url().map(|u| u.to_string()), - timezone: config.primary_time_zone().map(|s| s.to_string()), - pg_version: cached_pg_version, - ..Default::default() - }; - - let Some(client) = db else { - return facts; - }; - - if facts.pg_version.is_none() { - match client.query_one("SELECT version()", &[]).await { - Ok(row) => match row.try_get::<_, String>(0) { - Ok(v) => facts.pg_version = Some(v), - Err(err) => warn!("decoding pg_version: {err}"), - }, - Err(err) => warn!("SELECT version() failed: {err}"), - } - } - - match client - .query_opt( - "SELECT value FROM local_system_facts WHERE key = 'currentSyncTick'", - &[], - ) - .await - { - Ok(Some(row)) => match row.try_get::<_, String>(0) { - Ok(tick) => facts.current_sync_tick = Some(tick), - Err(err) => warn!("decoding currentSyncTick: {err}"), - }, - Ok(None) => {} - Err(err) => warn!("querying currentSyncTick: {err}"), - } - - facts -} - -fn build_payload(info: &Value, results: &[(Check, bool)], overall: OverallResult) -> Value { - let mut payload: Map = match info { - Value::Object(o) => o.clone(), - _ => Map::new(), - }; - - // Lift any `payload_extras` from individual checks into the top-level - // payload (alongside server facts like `osTimezone`). Lets a check carry - // bulky context-data that belongs with server facts rather than crowding - // its diagnostic entry in `health[]`. - for (check, _) in results { - for (k, v) in &check.payload_extras { - payload.insert(k.clone(), v.clone()); - } - } - - let health: Vec = results - .iter() - .filter(|(_, on_wire)| *on_wire) - .map(|(c, _)| c.to_wire()) - .collect(); - - payload.insert("healthy".into(), overall.is_healthy_top_level().into()); - payload.insert("health".into(), Value::Array(health)); - - Value::Object(payload) -} - fn selected_names_for_render(only: &[String], skip: &[String]) -> Result> { let registry = checks::all(); let known: Vec<&str> = registry.iter().map(|e| e.name).collect(); @@ -960,84 +725,6 @@ mod tests { (Check::skip(name, "not run", "reason"), true) } - #[test] - fn payload_all_pass_is_healthy() { - let results = vec![pass("a"), pass("b")]; - let overall = - OverallResult::from_checks(&results.iter().map(|(c, _)| c.clone()).collect::>()); - let payload = build_payload(&Value::Object(Default::default()), &results, overall); - assert_eq!(payload["healthy"], true); - assert_eq!(payload["health"].as_array().unwrap().len(), 2); - assert!(payload["health"][0]["healthy"].as_bool().unwrap()); - } - - #[test] - fn payload_warning_keeps_top_healthy_but_check_unhealthy() { - let results = vec![pass("a"), warn("b")]; - let overall = - OverallResult::from_checks(&results.iter().map(|(c, _)| c.clone()).collect::>()); - let payload = build_payload(&Value::Object(Default::default()), &results, overall); - assert_eq!(payload["healthy"], true); - assert_eq!(payload["health"][1]["healthy"], false); - } - - #[test] - fn payload_fail_flips_top_level() { - let results = vec![pass("a"), warn("b"), fail("c")]; - let overall = - OverallResult::from_checks(&results.iter().map(|(c, _)| c.clone()).collect::>()); - let payload = build_payload(&Value::Object(Default::default()), &results, overall); - assert_eq!(payload["healthy"], false); - } - - #[test] - fn payload_lifts_payload_extras_into_top_level() { - // `payload_extras` is for data a check wants alongside server facts - // (osTimezone etc), not in its per-check entry. The tamanu_service - // check uses it for raw service inventory. - let mut info = serde_json::Map::new(); - info.insert("osTimezone".into(), "Pacific/Auckland".into()); - let info_value = Value::Object(info); - - let check = Check::pass("svc", "ok") - .with_detail("supervisor", "systemd") - .with_payload_extra( - "services", - serde_json::json!({"supervisor": "systemd", "expectations": []}), - ); - let results = vec![(check, true)]; - let overall = - OverallResult::from_checks(&results.iter().map(|(c, _)| c.clone()).collect::>()); - let payload = build_payload(&info_value, &results, overall); - - assert_eq!(payload["osTimezone"], "Pacific/Auckland"); - // Lifted into the top level, alongside osTimezone. - assert_eq!(payload["services"]["supervisor"], "systemd"); - // And NOT duplicated into the per-check entry. - assert!(payload["health"][0].get("services").is_none()); - // But the lean per-check detail (supervisor label) is still on the - // `health[]` entry. - assert_eq!(payload["health"][0]["supervisor"], "systemd"); - } - - #[test] - fn off_wire_checks_skipped_in_health_array() { - let results = vec![ - (Check::pass("on", "ok"), true), - (Check::pass("off", "ok"), false), - ]; - let overall = - OverallResult::from_checks(&results.iter().map(|(c, _)| c.clone()).collect::>()); - let payload = build_payload(&Value::Object(Default::default()), &results, overall); - let names: Vec<&str> = payload["health"] - .as_array() - .unwrap() - .iter() - .map(|v| v["check"].as_str().unwrap()) - .collect(); - assert_eq!(names, vec!["on"]); - } - #[test] fn render_plain_contains_summary_line() { let results = vec![pass("a"), warn("b")]; @@ -1072,19 +759,6 @@ mod tests { assert!(!out.contains("1 warning")); } - #[test] - fn skip_is_healthy_on_wire() { - // The whole point of distinguishing Skip from Fail/Warning is that - // "we don't know" shouldn't fire alerts downstream of the wire format. - let results = vec![pass("a"), skip("b")]; - let overall = - OverallResult::from_checks(&results.iter().map(|(c, _)| c.clone()).collect::>()); - let payload = build_payload(&Value::Object(Default::default()), &results, overall); - assert_eq!(payload["healthy"], true); - assert_eq!(payload["health"][1]["healthy"], true); - assert_eq!(payload["health"][1]["skipped"], true); - } - #[test] fn render_failing_summary() { let results = vec![fail("a")]; diff --git a/crates/tamanu/Cargo.toml b/crates/tamanu/Cargo.toml index ca9aa0b8..6c7a89d1 100644 --- a/crates/tamanu/Cargo.toml +++ b/crates/tamanu/Cargo.toml @@ -4,7 +4,7 @@ version = "0.10.2" edition = "2024" authors = ["Félix Saparelli ", "BES Developers "] license = "GPL-3.0-or-later" -description = "(Internal) BES tooling: Tamanu library (config, discovery, healthchecks)" +description = "(Internal) BES tooling: Tamanu library (config, discovery)" repository = "https://github.com/beyondessential/bestool/tree/main/crates/tamanu" exclude.workspace = true @@ -12,13 +12,6 @@ exclude.workspace = true workspace = true [features] -default = ["doctor"] -doctor = [ - "dep:bestool-kopia", - "dep:hickory-resolver", - "dep:owo-colors", - "dep:reqwest", -] meta-ticket = ["dep:p256"] [dependencies] @@ -48,12 +41,6 @@ uuid = { version = "1.23.1", features = ["v4"] } # meta-ticket-only p256 = { version = "0.13.2", features = ["pkcs8", "pem"], optional = true } -# doctor-only -bestool-kopia = { version = "0.1.0", path = "../kopia", optional = true } -hickory-resolver = { version = "0.26.1", optional = true } -owo-colors = { version = "4.2.3", optional = true } -reqwest = { workspace = true, optional = true } - [dev-dependencies] tokio = { workspace = true, features = ["full"] } diff --git a/crates/tamanu/src/doctor.rs b/crates/tamanu/src/doctor.rs deleted file mode 100644 index fe8f247c..00000000 --- a/crates/tamanu/src/doctor.rs +++ /dev/null @@ -1,4 +0,0 @@ -pub mod check; -pub mod checks; -pub mod progress; -pub mod server_info; diff --git a/crates/tamanu/src/lib.rs b/crates/tamanu/src/lib.rs index c093aab1..cd871248 100644 --- a/crates/tamanu/src/lib.rs +++ b/crates/tamanu/src/lib.rs @@ -18,9 +18,6 @@ pub mod versions; pub mod systemd; -#[cfg(feature = "doctor")] -pub mod doctor; - /// What kind of server to interact with. #[derive(Copy, Clone, Debug, PartialEq, Eq, PartialOrd, Ord)] pub enum ApiServerKind { From 1422654d9ac1d15af06a832f910ad71e027dde75 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?F=C3=A9lix=20Saparelli?= Date: Sat, 30 May 2026 18:06:49 +1200 Subject: [PATCH 09/12] refactor(alertd): retire the YAML alert engine and standalone CLI Removes the YAML alert subsystem from bestool-alertd, leaving a thin daemon that runs background tasks (the doctor sweep) on a schedule, posts to canopy, and serves task/status/health/metrics over HTTP. Canopy now owns alerting; the daemon no longer loads YAML alerts or sends email/Slack. Removed: the standalone binary (main.rs, the [[bin]], the cli feature), alert.rs, loader.rs, glob_resolver.rs, events.rs, targets, templates.rs, state_file.rs, scheduler.rs, commands, the alert HTTP endpoints, and the now-unused deps. InternalContext moves to a slim context.rs; DaemonConfig drops alert_globs, email, server_kind, and dry_run. The Windows service is kept and now runs the daemon via run_with_shutdown. bestool tamanu alertd loses the status/reload/loaded-alerts/pause/validate passthroughs and the alert-dir/email/server-kind plumbing; it always registers the doctor sweep. The obsolete release-alertd workflow is removed. Co-authored-by: Claude --- .github/workflows/release-alertd.yml | 234 ---- Cargo.lock | 171 --- crates/alertd/ALERTS.md | 276 ----- crates/alertd/Cargo.toml | 35 +- crates/alertd/README.md | 64 +- crates/alertd/TARGETS.md | 264 ----- crates/alertd/USAGE.md | 218 ---- crates/alertd/src/alert.rs | 888 --------------- crates/alertd/src/commands.rs | 64 -- crates/alertd/src/commands/loaded_alerts.rs | 82 -- crates/alertd/src/commands/pause.rs | 137 --- crates/alertd/src/commands/reload.rs | 94 -- crates/alertd/src/commands/status.rs | 106 -- crates/alertd/src/commands/validate.rs | 146 --- crates/alertd/src/context.rs | 12 + crates/alertd/src/daemon.rs | 398 +------ crates/alertd/src/events.rs | 471 -------- crates/alertd/src/glob_resolver.rs | 85 -- crates/alertd/src/http_server.rs | 29 +- crates/alertd/src/http_server/endpoints.rs | 10 - .../alertd/src/http_server/endpoints/alert.rs | 132 --- .../src/http_server/endpoints/alerts.rs | 95 -- .../alertd/src/http_server/endpoints/index.rs | 30 +- .../src/http_server/endpoints/pause_alert.rs | 41 - .../src/http_server/endpoints/reload.rs | 70 -- .../src/http_server/endpoints/targets.rs | 27 - .../src/http_server/endpoints/validate.rs | 262 ----- crates/alertd/src/http_server/state.rs | 11 +- crates/alertd/src/http_server/test_utils.rs | 17 +- crates/alertd/src/http_server/types.rs | 49 - crates/alertd/src/lib.rs | 86 +- crates/alertd/src/loader.rs | 384 ------- crates/alertd/src/main.rs | 392 ------- crates/alertd/src/metrics.rs | 82 +- crates/alertd/src/scheduler.rs | 1009 ----------------- crates/alertd/src/state_file.rs | 324 ------ crates/alertd/src/targets.rs | 386 ------- crates/alertd/src/targets/canopy.rs | 217 ---- crates/alertd/src/targets/default.rs | 140 --- crates/alertd/src/targets/email.rs | 62 - crates/alertd/src/targets/slack.rs | 112 -- crates/alertd/src/tasks.rs | 2 +- crates/alertd/src/templates.rs | 112 -- crates/alertd/src/windows_service.rs | 15 +- crates/alertd/tests/alert_features.rs | 571 ---------- crates/alertd/tests/database_health.rs | 193 ---- crates/alertd/tests/reload.rs | 77 -- crates/alertd/tests/state_persistence.rs | 162 --- crates/alertd/update-usage.sh | 5 - crates/bestool/Cargo.toml | 21 +- crates/bestool/src/actions/tamanu/alertd.rs | 271 +---- update-usage.sh | 1 - 52 files changed, 97 insertions(+), 9045 deletions(-) delete mode 100644 .github/workflows/release-alertd.yml delete mode 100644 crates/alertd/ALERTS.md delete mode 100644 crates/alertd/TARGETS.md delete mode 100644 crates/alertd/USAGE.md delete mode 100644 crates/alertd/src/alert.rs delete mode 100644 crates/alertd/src/commands.rs delete mode 100644 crates/alertd/src/commands/loaded_alerts.rs delete mode 100644 crates/alertd/src/commands/pause.rs delete mode 100644 crates/alertd/src/commands/reload.rs delete mode 100644 crates/alertd/src/commands/status.rs delete mode 100644 crates/alertd/src/commands/validate.rs create mode 100644 crates/alertd/src/context.rs delete mode 100644 crates/alertd/src/events.rs delete mode 100644 crates/alertd/src/glob_resolver.rs delete mode 100644 crates/alertd/src/http_server/endpoints/alert.rs delete mode 100644 crates/alertd/src/http_server/endpoints/alerts.rs delete mode 100644 crates/alertd/src/http_server/endpoints/pause_alert.rs delete mode 100644 crates/alertd/src/http_server/endpoints/reload.rs delete mode 100644 crates/alertd/src/http_server/endpoints/targets.rs delete mode 100644 crates/alertd/src/http_server/endpoints/validate.rs delete mode 100644 crates/alertd/src/loader.rs delete mode 100644 crates/alertd/src/main.rs delete mode 100644 crates/alertd/src/scheduler.rs delete mode 100644 crates/alertd/src/state_file.rs delete mode 100644 crates/alertd/src/targets.rs delete mode 100644 crates/alertd/src/targets/canopy.rs delete mode 100644 crates/alertd/src/targets/default.rs delete mode 100644 crates/alertd/src/targets/email.rs delete mode 100644 crates/alertd/src/targets/slack.rs delete mode 100644 crates/alertd/src/templates.rs delete mode 100644 crates/alertd/tests/alert_features.rs delete mode 100644 crates/alertd/tests/database_health.rs delete mode 100644 crates/alertd/tests/reload.rs delete mode 100644 crates/alertd/tests/state_persistence.rs delete mode 100755 crates/alertd/update-usage.sh diff --git a/.github/workflows/release-alertd.yml b/.github/workflows/release-alertd.yml deleted file mode 100644 index d3382587..00000000 --- a/.github/workflows/release-alertd.yml +++ /dev/null @@ -1,234 +0,0 @@ -name: Release bestool-alertd - -on: - push: - tags: - - "bestool-alertd-v*" - -env: - CARGO_TERM_COLOR: always - CARGO_UNSTABLE_SPARSE_REGISTRY: "true" - -concurrency: - group: ${{ github.workflow }}-${{ github.ref }} - cancel-in-progress: false - -jobs: - build: - permissions: - contents: read - id-token: write - attestations: write - - strategy: - fail-fast: false - matrix: - include: - - target: x86_64-apple-darwin - os: macos-15 - - target: aarch64-apple-darwin - os: macos-15 - - target: x86_64-unknown-linux-gnu - os: ubuntu-24.04 - - target: x86_64-unknown-linux-musl - os: ubuntu-24.04 - - target: aarch64-unknown-linux-gnu - os: ubuntu-24.04-arm - - target: aarch64-unknown-linux-musl - os: ubuntu-24.04-arm - - target: x86_64-pc-windows-msvc - os: windows-2022 - - name: Build / ${{ matrix.target }} - runs-on: ${{ matrix.os }} - - steps: - - uses: actions/checkout@v6 - - - name: Configure toolchain - run: | - rustup toolchain install --profile minimal --no-self-update stable - rustup target add ${{ matrix.target }} - rustup default stable - - - if: runner.os == 'Windows' - name: Use GNU tar - shell: cmd - run: | - echo "Adding GNU tar to PATH" - echo C:\Program Files\Git\usr\bin>>"%GITHUB_PATH%" - - - if: runner.os == 'Windows' - name: Statically link crt - shell: bash - run: | - echo 'RUSTFLAGS=-Ctarget-feature=+crt-static' >> "$GITHUB_ENV" - - - if: runner.os == 'Linux' - run: pip install ziglang - - if: contains(matrix.target, 'musl') - run: sudo apt-get update && sudo apt-get install -y musl musl-dev musl-tools - - - uses: Swatinem/rust-cache@v2 - with: - key: alertd-${{ matrix.target }} - - uses: taiki-e/install-action@v2.79.1 - with: - tool: cargo-zigbuild,cargo-auditable - - # zigbuild isn't compatible with auditable - # https://github.com/rust-secure-code/cargo-auditable/issues/179 - - if: matrix.target == 'aarch64-unknown-linux-musl' - run: cargo zigbuild --profile dist --target ${{ matrix.target }} -p bestool-alertd - - if: matrix.target != 'aarch64-unknown-linux-musl' - run: cargo auditable build --profile dist --target ${{ matrix.target }} -p bestool-alertd - - - name: Extract version - id: version - shell: bash - run: | - version="${GITHUB_REF_NAME#bestool-alertd-v}" - echo "version=$version" >> "$GITHUB_OUTPUT" - - - name: Package archive - id: package - shell: bash - run: | - version="${{ steps.version.outputs.version }}" - target="${{ matrix.target }}" - name="bestool-alertd-${version}-${target}" - src_dir="target/${target}/dist" - ext="" - if [[ "${{ runner.os }}" == "Windows" ]]; then - ext=".exe" - fi - - mkdir -p staging - install -m644 crates/alertd/USAGE.md "staging/USAGE.md" - install -m644 crates/alertd/ALERTS.md "staging/ALERTS.md" - install -m644 crates/alertd/TARGETS.md "staging/TARGETS.md" - install -m644 COPYING "staging/COPYING" - install -m755 "$src_dir/bestool-alertd${ext}" "staging/bestool-alertd${ext}" - - tar -C staging --zstd -cf "${name}.tar.zst" . - echo "asset=${name}.tar.zst" >> "$GITHUB_OUTPUT" - - - uses: actions/attest-build-provenance@v4 - with: - subject-path: ${{ steps.package.outputs.asset }} - - - name: Package DEB - if: contains(matrix.target, 'linux-gnu') - shell: bash - run: | - version="${{ steps.version.outputs.version }}" - target="${{ matrix.target }}" - arch="" - if [[ "$target" == "x86_64-unknown-linux-gnu" ]]; then - arch="amd64" - elif [[ "$target" == "aarch64-unknown-linux-gnu" ]]; then - arch="arm64" - fi - - deb_dir="target/deb-alertd" - mkdir -p "$deb_dir/DEBIAN" - install -Dm755 "target/$target/dist/bestool-alertd" "$deb_dir/usr/bin/bestool-alertd" - - install -Dm644 COPYING "$deb_dir/usr/share/doc/bestool-alertd/copyright" - install -Dm644 crates/alertd/USAGE.md "$deb_dir/usr/share/doc/bestool-alertd/USAGE.md" - install -Dm644 crates/alertd/ALERTS.md "$deb_dir/usr/share/doc/bestool-alertd/ALERTS.md" - install -Dm644 crates/alertd/TARGETS.md "$deb_dir/usr/share/doc/bestool-alertd/TARGETS.md" - - cat > "$deb_dir/DEBIAN/control" << EOF - Package: bestool-alertd - Version: $version - Section: utils - Priority: optional - Architecture: $arch - Maintainer: BES Developers - Description: BES tooling: Alert daemon (standalone) - Homepage: https://github.com/beyondessential/bestool - License: GPL-3.0-or-later - EOF - - dpkg-deb --build --root-owner-group "$deb_dir" "bestool-alertd-${target}-${version}.deb" - - - uses: actions/attest-build-provenance@v4 - if: contains(matrix.target, 'linux-gnu') - with: - subject-path: "bestool-alertd-${{ matrix.target }}-${{ steps.version.outputs.version }}.deb" - - - uses: actions/upload-artifact@v7 - with: - name: alertd-${{ matrix.target }} - path: | - ${{ steps.package.outputs.asset }} - *.deb - if-no-files-found: error - retention-days: 7 - - - name: Configure AWS Credentials - if: contains(matrix.target, 'linux-gnu') - uses: aws-actions/configure-aws-credentials@v6.1.1 - with: - aws-region: ap-southeast-2 - role-to-assume: arn:aws:iam::143295493206:role/gha-tamanu-tools-upload - role-session-name: GHA@BEStool-alertd=Build - - - name: Upload DEB to S3 - if: contains(matrix.target, 'linux-gnu') - shell: bash - run: | - version="${{ steps.version.outputs.version }}" - target="${{ matrix.target }}" - deb="bestool-alertd-${target}-${version}.deb" - aws s3 cp "$deb" "s3://bes-ops-tools/bestool-alertd/${version}/" --no-progress - aws s3 cp "$deb" "s3://bes-ops-tools/bestool-alertd/latest/bestool-alertd-${target}.deb" --no-progress - - - name: Upload version file - if: matrix.target == 'x86_64-unknown-linux-gnu' - shell: bash - run: | - version="${{ steps.version.outputs.version }}" - echo -n "$version" > latest-version.txt - aws s3 cp latest-version.txt s3://bes-ops-tools/bestool-alertd/latest-version.txt --no-progress - - - name: Invalidate CloudFront - if: matrix.target == 'x86_64-unknown-linux-gnu' - shell: bash - run: | - version="${{ steps.version.outputs.version }}" - aws cloudfront create-invalidation \ - --distribution-id=EDAG0UBS1MN74 \ - --paths "/bestool-alertd/${version}/*" "/bestool-alertd/latest/*" "/bestool-alertd/latest-version.txt" - - release: - name: Publish GitHub Release - needs: build - runs-on: ubuntu-latest - permissions: - contents: write - steps: - - uses: actions/download-artifact@v8 - with: - pattern: alertd-* - merge-multiple: true - path: assets - - - name: Create or update release - env: - GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} - GH_REPO: ${{ github.repository }} - TAG: ${{ github.ref_name }} - shell: bash - run: | - version="${TAG#bestool-alertd-v}" - if gh release view "$TAG" >/dev/null 2>&1; then - gh release upload "$TAG" assets/* --clobber - else - gh release create "$TAG" \ - --title "bestool-alertd $version" \ - --verify-tag \ - --generate-notes \ - assets/* - fi diff --git a/Cargo.lock b/Cargo.lock index ac88a1bc..11ed10ae 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -568,7 +568,6 @@ dependencies = [ "leon", "leon-macros", "lloggs", - "mailgun-rs", "merkle_hash", "miette", "mimalloc", @@ -577,7 +576,6 @@ dependencies = [ "p256", "percent-encoding", "privilege", - "pulldown-cmark", "quick-xml 0.40.1", "rand_core 0.6.4", "regex", @@ -587,7 +585,6 @@ dependencies = [ "semver", "serde", "serde_json", - "serde_path_to_error", "serde_yaml", "ssh-key", "sysinfo", @@ -599,7 +596,6 @@ dependencies = [ "thiserror 2.0.18", "tokio", "tokio-postgres", - "tokio-stream", "tokio-util", "tracing", "trycmd", @@ -623,43 +619,25 @@ dependencies = [ "bestool-kopia", "bestool-postgres", "bestool-tamanu", - "blake3", "bytes", - "clap", - "clap-markdown", "dirs", "duct", "futures", - "glob", "hickory-resolver", "jiff", - "lloggs", - "mailgun-rs", "miette", "node-semver", - "notify", "prometheus", - "pulldown-cmark", - "rand 0.10.1", "reqwest", "serde", "serde_json", - "serde_path_to_error", - "serde_yaml", "sysinfo", - "temp-env", - "tempfile", - "tera", - "thiserror 2.0.18", "tokio", "tokio-postgres", "tokio-stream", - "tokio-util", - "tower", "tower-http", "tracing", "url", - "walkdir", "windows-service", ] @@ -2548,15 +2526,6 @@ version = "1.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "42703706b716c37f96a77aea830392ad231f44c9e9a67872fa5548707e11b11c" -[[package]] -name = "fsevent-sys" -version = "4.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "76ee7a02da4d231650c7cea31349b889be2f45ddb3ef3032d2ec8185f6313fd2" -dependencies = [ - "libc", -] - [[package]] name = "funty" version = "2.0.0" @@ -2711,15 +2680,6 @@ dependencies = [ "zeroize", ] -[[package]] -name = "getopts" -version = "0.2.24" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "cfe4fbac503b8d1f88e6676011885f34b7174f46e59956bba534ba83abded4df" -dependencies = [ - "unicode-width 0.2.2", -] - [[package]] name = "getrandom" version = "0.2.17" @@ -3437,26 +3397,6 @@ dependencies = [ "web-time", ] -[[package]] -name = "inotify" -version = "0.11.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bd5b3eaf1a28b758ac0faa5a4254e8ab2705605496f1b1f3fbbc3988ad73d199" -dependencies = [ - "bitflags 2.11.1", - "inotify-sys", - "libc", -] - -[[package]] -name = "inotify-sys" -version = "0.1.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e05c02b5e89bff3b946cedeca278abc628fe811e604f027c45a8aa3cf793d0eb" -dependencies = [ - "libc", -] - [[package]] name = "inout" version = "0.1.4" @@ -3710,26 +3650,6 @@ dependencies = [ "rayon", ] -[[package]] -name = "kqueue" -version = "1.1.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "eac30106d7dce88daf4a3fcb4879ea939476d5074a9b7ddd0fb97fa4bed5596a" -dependencies = [ - "kqueue-sys", - "libc", -] - -[[package]] -name = "kqueue-sys" -version = "1.1.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "07293a4e297ac234359b510362495713f75ea345d5307140414f20c69ffeb087" -dependencies = [ - "bitflags 2.11.1", - "libc", -] - [[package]] name = "lazy_static" version = "1.5.0" @@ -3965,19 +3885,6 @@ version = "0.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d3d25b0e0b648a86960ac23b7ad4abb9717601dec6f66c165f5b037f3f03065f" -[[package]] -name = "mailgun-rs" -version = "2.0.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "dff169a36659dc6442193ab9935632345595d6fe6184cd5079bff4f695fecbdd" -dependencies = [ - "reqwest", - "serde", - "serde_json", - "thiserror 2.0.18", - "typed-builder", -] - [[package]] name = "matchers" version = "0.2.0" @@ -4099,16 +4006,6 @@ version = "0.3.17" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6877bb514081ee2a7ff5ef9de3281f14a4dd4bceac4c09388074a6b5df8a139a" -[[package]] -name = "mime_guess" -version = "2.0.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f7c44f8e672c00fe5308fa235f821cb4198414e1c77935c1ab6948d3fd78550e" -dependencies = [ - "mime", - "unicase", -] - [[package]] name = "minimal-lexical" version = "0.2.1" @@ -4326,33 +4223,6 @@ version = "0.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "61807f77802ff30975e01f4f071c8ba10c022052f98b3294119f3e615d13e5be" -[[package]] -name = "notify" -version = "8.2.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4d3d07927151ff8575b7087f245456e549fea62edf0ec4e565a5ee50c8402bc3" -dependencies = [ - "bitflags 2.11.1", - "fsevent-sys", - "inotify", - "kqueue", - "libc", - "log", - "mio", - "notify-types", - "walkdir", - "windows-sys 0.60.2", -] - -[[package]] -name = "notify-types" -version = "2.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "42b8cfee0e339a0337359f3c88165702ac6e600dc01c0cc9579a92d62b08477a" -dependencies = [ - "bitflags 2.11.1", -] - [[package]] name = "ntapi" version = "0.4.3" @@ -5383,25 +5253,6 @@ dependencies = [ "thiserror 1.0.69", ] -[[package]] -name = "pulldown-cmark" -version = "0.13.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e9f068eba8e7071c5f9511831b44f32c740d5adf574e990f946ddb53db2f314e" -dependencies = [ - "bitflags 2.11.1", - "getopts", - "memchr", - "pulldown-cmark-escape", - "unicase", -] - -[[package]] -name = "pulldown-cmark-escape" -version = "0.11.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "007d8adb5ddab6f8e3f491ac63566a7d5002cc7ed73901f72057943fa71ae1ae" - [[package]] name = "quick-xml" version = "0.37.5" @@ -5774,7 +5625,6 @@ dependencies = [ "base64 0.22.1", "bytes", "encoding_rs", - "futures-channel", "futures-core", "futures-util", "h2", @@ -5787,7 +5637,6 @@ dependencies = [ "js-sys", "log", "mime", - "mime_guess", "percent-encoding", "pin-project-lite", "quinn", @@ -7746,26 +7595,6 @@ dependencies = [ "rustc-hash 2.1.2", ] -[[package]] -name = "typed-builder" -version = "0.23.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "31aa81521b70f94402501d848ccc0ecaa8f93c8eb6999eb9747e72287757ffda" -dependencies = [ - "typed-builder-macro", -] - -[[package]] -name = "typed-builder-macro" -version = "0.23.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "076a02dc54dd46795c2e9c8282ed40bcfb1e22747e955de9389a1de28190fb26" -dependencies = [ - "proc-macro2", - "quote", - "syn", -] - [[package]] name = "typed-path" version = "0.12.3" diff --git a/crates/alertd/ALERTS.md b/crates/alertd/ALERTS.md deleted file mode 100644 index 7854ea7b..00000000 --- a/crates/alertd/ALERTS.md +++ /dev/null @@ -1,276 +0,0 @@ -# Alert Definitions - -Alerts are the core component of Alertd, representing conditions that need to be monitored and -notified about. Alerts are a YAML file, with a single document in the file. The absolute path to -the file is considered its identifier: identical contents at a different path will dupe the alert. - -## Structure - -```yaml -# Optional fields (with defaults) -enabled: true | false # Default: true -interval: # Default: "1 minute" -always-send: # Default: false -when-changed: # Default: false - -# Required: At least one source -sql: # SQL query source -# OR -shell: # Shell command -run: # Command to run -# OR -event: # Event-based trigger - -# Optional: SQL-specific -numerical: # List of numerical thresholds - - field: - alert-at: - clear-at: # Optional - -# Required: Targets -send: # List of send targets - - id: - subject: # Optional - template: -``` - -## Duration Format - -Formats accepted by `interval` and `always-send.after`: -- ` second(s)` -- ` minute(s)` -- ` hour(s)` -- ` day(s)` -- `s`, `m`, `h`, `d` - -Examples: `30 seconds`, `5 minutes`, `2 hours`, `1 day`, `30s`, `5m` - - -## Source Types - -### SQL Source - -```yaml -sql: -numerical: # Optional - - field: - alert-at: - clear-at: # Optional -``` - -Context variables: -- `rows`: Array of query result rows (each row is a dict) -- `triggered`: Boolean indicating if previously triggered -- Standard variables (see below) - -Numerical thresholds: -- Alert triggers when `field >= alert-at` -- Alert clears when `field <= clear-at` (if specified) -- If `clear-at` omitted, alert never auto-clears - -### Shell Source - -```yaml -shell: # e.g., "bash", "sh", "python" -run: # Command to execute -``` - -Context variables: -- `output`: Command stdout as string -- `exit_code`: Process exit code -- `triggered`: Boolean indicating if previously triggered -- Standard variables (see below) - -Alert triggers if exit code is not 0. - -### Event Source - -```yaml -event: -``` - -Event types: -- `definition-error`: Alert definition file has errors -- `source-error`: Alert source execution failed -- `database-down`: The PostgreSQL database is unreachable - -Event-specific context variables vary by type: -- `definition-error`: `alert_file`, `error_message` -- `source-error`: `alert_file`, `error_message` -- `database-down`: `database_url` (password redacted), `error_message` - -All `event` sources are internally defaulted to send to the `default` target with a very basic -template if none is defined through the alert files. The default target is the one named -`default` in `_targets.yml`, or — if that doesn't exist — the first one alphabetically. If no -`default` is configured but the canopy auth path is available, a canopy target with -`source: bestool-alertd` is registered under `id: default` automatically; alerts that reference -`id: default` will use it too. See [TARGETS.md](TARGETS.md#canopy) for details. - -## Send Targets - -The `id` field references a target defined in `_targets.yml`. The target type (email, Slack, etc.) -is determined by the target definition, not the alert. If multiple targets share the same `id`, -the alert is sent to all of them. See [TARGETS.md](TARGETS.md) for target configuration. - -### Simple Format - -```yaml -send: - - id: # References external target in _targets.yml - subject: # Optional, Tera template - template: # Required, Tera template (Markdown) -``` - -## Template Context - -Standard variables available in all templates: -- `alert_file`: Path to alert definition file -- `filename`: Basename of alert file -- `hostname`: System hostname -- `now`: Current timestamp - -Source-specific variables are added based on source type (see above). - -## Template Syntax - -Templates use [Tera](https://keats.github.io/tera/docs/) syntax: -- Variables: `{{ variable }}` -- Conditionals: `{% if condition %}...{% endif %}` -- Loops: `{% for item in items %}...{% endfor %}` -- Filters: `{{ value | filter }}` - -Templates are rendered as Markdown and converted to HTML for email. Markdown will also pass through -HTML if you want to use HTML in your templates directly. - -The Tera syntax is Jinja2-like, but with some differences! Check the docs, and don't forget to -`validate` the alert definition. - -## Always-Send Config - -By default, the "triggered" state of alerts is tracked internally. When an alert is triggered, it -is sent immediately to its targets. Subsequent checks that would trigger the alert are ignored, -until the alert goes back to "okay" state, at which point its triggered state is reset and the next -trigger will be sent to targets. - -When `always-send: true`, the alert will be sent to its targets every time it is triggered, regardless of its previous state. - -When `always-send: false`, the alert will only be sent to its targets if it is triggered for the first time or if it has been cleared. - -There's an advanced configuration with `always-send.after` that allows you to resend alerts after a certain duration. - -```yaml -# Simple boolean -always-send: true | false - -# Timed resending -always-send: - after: # Resend after this duration -``` - -## When-Changed Config - -By default, for SQL sources the alert is considered triggered if it returns any rows, and Shell -sources are considered triggered if they return a non-zero exit code. With `when-changed: true`, -the output (in rows or string) of the alert is kept track of, and the alert is considered -triggered if that output changes. - -For SQL sources, the `except` and `only` fields (only one of them, you can't use both at once) can -be used to filter the fields to compare. For example, this can be used to allow datetimes to change -in the query without triggering the alert, but still being able to use them in the email template. - -```yaml -# Simple boolean -when-changed: true | false - -# Detailed configuration -when-changed: - except: [, ...] # Exclude these fields from comparison - only: [, ...] # Only compare these fields -``` - -Notes: -- `except` and `only` are mutually exclusive -- Fields refer to column names in SQL results or keys in context - -## Examples - -### SQL Alert - -```yaml -interval: 5 minutes - -sql: "SELECT count(*) as count FROM users WHERE created_at > NOW() - INTERVAL '1 hour'" - -numerical: - - field: count - alert-at: 100 - clear-at: 50 - -send: - - id: ops-team - subject: "High user registration rate" - template: | - Alert: {{ rows[0].count }} users registered in the last hour. -``` - -### Shell Alert - -```yaml -interval: 1 minute - -shell: bash -run: "df -h / | tail -1 | awk '{print $5}' | sed 's/%//'" - -send: - - id: ops-team - subject: "Disk space alert on {{ hostname }}" - template: | - Disk usage: {{ output }}% -``` - -### Event Alert - -```yaml -event: http - -send: - - id: ops-team - subject: "{{ subject | default(value='HTTP Alert') }}" - template: | - {{ message }} - - {% if custom %} - Additional data: {{ custom | json_encode(pretty=true) }} - {% endif %} -``` - -### Timed Resending - -```yaml -sql: "SELECT 1 WHERE (SELECT pg_is_in_recovery()) = true" - -always-send: - after: 8 hours - -send: - - id: dba-team - subject: "Database in recovery mode" - template: "The database is still in recovery mode." -``` - -### Change Detection - -```yaml -sql: "SELECT version, deployed_at FROM app_version ORDER BY deployed_at DESC LIMIT 1" - -when-changed: - only: [version] - -send: - - id: dev-team - subject: "New deployment detected" - template: | - {% for row in rows %} - - Version {{ row.version }} deployed at {{ row.deployed_at }} - {% endfor %} -``` diff --git a/crates/alertd/Cargo.toml b/crates/alertd/Cargo.toml index 40561ce1..4c247c69 100644 --- a/crates/alertd/Cargo.toml +++ b/crates/alertd/Cargo.toml @@ -4,71 +4,38 @@ version = "6.1.1" edition = "2024" authors = ["Félix Saparelli ", "BES Developers "] license = "GPL-3.0-or-later" -description = "(Internal) BES tooling: Alert daemon" +description = "(Internal) BES tooling: healthcheck daemon" repository = "https://github.com/beyondessential/bestool/tree/main/crates/alertd" exclude.workspace = true [lints] workspace = true -[[bin]] -name = "bestool-alertd" -path = "src/main.rs" -required-features = ["cli"] - [dependencies] axum = "0.8.9" bestool-canopy = { version = "0.2.0", path = "../canopy" } bestool-kopia = { version = "0.1.0", path = "../kopia" } bestool-postgres = { version = "1.0.11", path = "../postgres" } bestool-tamanu = { version = "0.10.2", path = "../tamanu" } -blake3 = "1.8.5" bytes = "1.9.0" -clap = { workspace = true, optional = true, features = ["env", "wrap_help"] } -clap-markdown = { version = "0.1.5", optional = true } dirs = "6.0.0" duct = "1.1.0" futures = { workspace = true } -glob = "0.3.3" hickory-resolver = "0.26.1" jiff = { version = "0.2.24", features = ["serde"] } node-semver = "2.2.0" -lloggs = { workspace = true, optional = true } -mailgun-rs = "2.0.2" miette = { workspace = true } -notify = "8.2.0" prometheus = "0.14.0" -pulldown-cmark = "0.13.3" -rand.workspace = true reqwest = { workspace = true } serde = { version = "1.0.228", features = ["derive"] } serde_json = "1.0.145" -serde_path_to_error = "0.1.17" -serde_yaml = "0.9.34" sysinfo.workspace = true -tempfile = "3.21.0" -tera = "1.20.0" -thiserror = { workspace = true } tokio = { workspace = true, features = ["full"] } tokio-postgres = { version = "0.7.17", features = ["with-jiff-0_2", "with-serde_json-1"] } tokio-stream = "0.1.17" -tokio-util = { workspace = true } -tower = "0.5.2" tower-http = { version = "0.6.6", features = ["trace"] } tracing = { workspace = true } url = { version = "2.5.8", features = ["serde"] } -walkdir = "2.5.0" [target.'cfg(windows)'.dependencies] windows-service = "0.8.0" - -[dev-dependencies] -temp-env = "0.3" - -[features] -default = ["cli"] -cli = ["dep:clap", "dep:clap-markdown", "dep:lloggs", "miette/fancy"] - -[package.metadata.binstall] -pkg-url = "https://github.com/beyondessential/bestool/releases/download/{ name }-v{ version }/{ name }-{ version }-{ target }{ archive-suffix }" -pkg-fmt = "tzstd" diff --git a/crates/alertd/README.md b/crates/alertd/README.md index 895ed50f..887b7f6d 100644 --- a/crates/alertd/README.md +++ b/crates/alertd/README.md @@ -1,62 +1,36 @@ # bestool-alertd -An alert daemon that watches a set of YAML alert definitions, runs them on a -schedule, and dispatches the results to one or more targets (email, HTTP -endpoints, etc.). +A healthcheck daemon: it runs background tasks (the Tamanu doctor sweep) on a +schedule, posts the results to canopy, and serves task/status/health/metrics +over a small HTTP API. -This crate is part of [BES tooling][repo], and is in particular what powers -the `tamanu alerts` workflow. It is published as both a library and a -standalone binary; the `bestool` umbrella tool also embeds it. +This crate is part of [BES tooling][repo]. It is a library embedded by the +`bestool` umbrella tool, which drives it via `bestool tamanu alertd`. [repo]: https://github.com/beyondessential/bestool -## Install +## Use -```console -$ cargo install bestool-alertd -``` +The daemon is configured and run through `bestool tamanu alertd run`, which +reads the database and device-key configuration from Tamanu's config files, +registers the doctor sweep, and starts the daemon. -Pre-built binaries are attached to each `bestool-alertd-v*` GitHub release. +The HTTP control API exposes: -## Use +- `GET /` — list of endpoints. +- `GET /status` — daemon name, version, uptime, pid. +- `GET /health` — watchdog health (200 if healthy, 530 if stalled). +- `GET /metrics` — Prometheus metrics. +- `GET /tasks/{task}/{endpoint}` — endpoints exposed by registered tasks (e.g. + the doctor's `latest` and `recompute`). -```console -$ bestool-alertd run \ - --database-url postgresql://localhost/mydb \ - --glob '/etc/myapp/alerts/**/*.yml' -``` - -Common flags: - -- `--glob PATTERN` (repeatable): where to find alert definition files. Patterns - may match a directory (read recursively) or individual files. Globs are - watched for changes and re-evaluated periodically. -- `--database-url URL` / `DATABASE_URL`: PostgreSQL connection for SQL alerts. -- `--email-from`, `--mailgun-api-key`, `--mailgun-domain`: enable email - targets via Mailgun. -- `--device-key-file PATH` / `DEVICE_KEY_FILE`: PEM identity used when posting - to canopy `/events` targets. -- `--dry-run`: execute every alert once and exit; useful in CI. -- `--server-addr`: where the local HTTP control API listens - (default `[::1]:8271` and `127.0.0.1:8271`). - -`bestool-alertd` exposes additional subcommands that talk to a running daemon -via its HTTP API: `status`, `reload`, `loaded-alerts`, `pause-alert`, -`validate`. SIGHUP also triggers a reload on Unix. - -On Windows, `bestool-alertd install` registers a native service named +On Windows, `bestool tamanu alertd install` registers a native service named `bestool-alertd`; `uninstall` and `configure-recovery` are also provided. -## Defining alerts and targets - -- Alert files: see [ALERTS.md](./ALERTS.md). -- Target files: see [TARGETS.md](./TARGETS.md). - ## Library -The crate also exposes a library API (`bestool_alertd::run`, -`DaemonConfig`, …) so other tools can embed the daemon without going through -the binary. +The crate exposes a library API (`bestool_alertd::run`, `DaemonConfig`, +`BackgroundTask`, the `doctor` module, …) so other tools can embed the daemon. ## License diff --git a/crates/alertd/TARGETS.md b/crates/alertd/TARGETS.md deleted file mode 100644 index 4ac6fd2f..00000000 --- a/crates/alertd/TARGETS.md +++ /dev/null @@ -1,264 +0,0 @@ -# Send Targets - -There must be at least one `_targets.yml` or `_targets.yaml` file in the directories that alertd -scans for alert definitions, and one of these files must contain at least one target. It's -recommended to have a target with `id: default`. If an explicit default isn't defined, the first -target in alphabetical ID order will be used as default. - -The one exception: if no `default`-id target is configured *and* the canopy auth path is -available (either tailscale reachability or an mTLS device key), alertd registers a synthesised -canopy target under `id: default` automatically — see [Canopy](#canopy) below. - -## Target Types - -### Email - -Email targets send alerts via Mailgun. They require the `addresses` field. - -```yaml -targets: - - id: - addresses: - - - - -``` - -### Slack - -Slack targets post alerts to a Slack incoming webhook. They require the `webhook` field and -optionally accept a `fields` list to customize the JSON payload sent to the webhook. - -```yaml -targets: - - id: - webhook: - fields: # Optional, defaults shown below - - name: hostname - field: hostname - - name: filename - field: filename - - name: subject - field: subject - - name: message - field: body -``` - -Each field entry is either: -- **Template field**: `{ name: , field: }` — rendered from template context. - Valid template fields: `hostname`, `filename`, `subject`, `body`, `interval`. -- **Fixed value**: `{ name: , value: }` — a static string. - -If `fields` is omitted, the default set (`hostname`, `filename`, `subject`, `message`) is used. - -### Canopy - -Canopy targets push events to a [canopy](https://meta.tamanu.app) server's `/events` API. Canopy -aggregates pushed events into deduplicated *issues* keyed by `source` + `ref`; alertd posts an -`active: true` event when an alert triggers and an `active: false` event when it clears, so issues -auto-resolve. - -```yaml -targets: - - id: - canopy: - url: https://meta.tamanu.app # Optional, default shown - source: # Required, identifies this device's event stream - severity: # Optional, default: error -``` - -`severity` is one of the RFC 5424 levels (lowercase): `emergency`, `alert`, `critical`, `error`, -`warning`, `notice`, `info`, `debug`. Canopy treats events at or above `error` as incident-grade. - -#### Authentication - -Canopy requires one of two auth paths; alertd probes them at startup and on every reload: - -1. **Tailscale** — if the host is on the canopy tailnet, events are pushed there without any - client cert. This path is preferred when available. - -2. **mTLS via Tamanu device key** — falls back to the public endpoint (`url:` above) using a - self-signed client cert minted from a Tamanu device key. The key is the value stored in the - Tamanu DB at `local_system_facts(key='deviceKey')`. The certificate has a 6-day validity and - is automatically renewed every 5 days while the daemon is running. - - `bestool tamanu alertd` fetches the device key from the Tamanu DB automatically. - Standalone `bestool-alertd run` takes the key path via `--device-key-file ` (env - `DEVICE_KEY_FILE`). - -#### Synthesised `default` target - -If no target named `default` is configured (typically: no `_targets.yml` exists at all) *and* -one of the canopy auth paths is available, alertd registers a synthesised canopy target under -`id: default` with: - -- `source: bestool-alertd` -- `url`: the default canopy URL (ignored when tailscale is the active path) -- `severity: error` - -Because it slots into the regular `default` id, this works two ways at once: alerts that -explicitly reference `id: default` in their `send:` block resolve to it, and event sources like -`database-down` that fall back to the default target also route to canopy. To get any other -behaviour — different source, non-canopy default, etc. — add a `_targets.yml` with an explicit -`default` entry. - -## Structure - -The target type is determined automatically by the fields present: -- If `addresses` is present, it's an email target. -- If `webhook` is present, it's a Slack target. -- If `canopy:` (a nested object) is present, it's a canopy target. - -Targets of different types can be mixed freely in the same `_targets.yml` file, and multiple -targets of different types can share the same `id` to send alerts to both simultaneously. - -## Examples - -### Single Email Target - -```yaml -targets: - - id: ops-team - addresses: - - ops@example.com -``` - -### Single Slack Target - -```yaml -targets: - - id: ops-team - webhook: https://hooks.example.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX -``` - -### Mixed Email and Slack - -Alerts sent to `ops-team` will be delivered to both email and Slack: - -```yaml -targets: - - id: ops-team - addresses: - - ops@example.com - - oncall@example.com - - - id: ops-team - webhook: https://hooks.example.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX -``` - -### Slack with Custom Fields - -```yaml -targets: - - id: monitoring - webhook: https://hooks.example.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX - fields: - - name: text - field: body - - name: server - field: hostname - - name: environment - value: production -``` - -### Multiple Targets - -```yaml -targets: - - id: ops-team - addresses: - - ops@example.com - - oncall@example.com - - alerts@example.com - - - id: dev-team - addresses: - - developers@example.com - - - id: slack-alerts - webhook: https://hooks.example.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX -``` - -### Multiple Files - -Targets can be split across multiple `_targets.yml` files: - -``` -/etc/alertd/ -├── alerts/ -│ ├── disk-space.yml -│ └── database.yml -├── teams/ -│ ├── _targets.yml # ops-team, security-team -│ └── alerts/ -│ └── security.yml -└── _targets.yml # dev-team, dba-team -``` - -All `_targets.yml` files are merged together. Targets with the same ID are grouped: an alert -referencing that ID will be sent to all of them. - -## Usage in Alerts - -Reference targets in alert definitions: - -```yaml -# In an alert definition file -sql: "SELECT 1" - -send: - - id: ops-team # References the target defined in _targets.yml - subject: "Alert" - template: "Message" -``` - -Multiple alerts can reference the same target: - -```yaml -# alert1.yml -send: - - id: ops-team - subject: "Alert 1" - template: "..." - -# alert2.yml -send: - - id: ops-team - subject: "Alert 2" - template: "..." -``` - -Alerts can send to multiple targets: - -```yaml -send: - - id: ops-team - subject: "Ops Alert" - template: "..." - - - id: dev-team - subject: "Dev Alert" - template: "..." -``` - -## Email Configuration - -Email sending requires Mailgun configuration provided via: -- Tamanu config files (when using `bestool tamanu alertd`) -- Environment variables or command-line options (when using standalone `alertd`) - -The `_targets.yml` file only defines recipients, not email server configuration. - -## Slack Configuration - -Slack targets are self-contained: the webhook URL in `_targets.yml` is all that's needed. No -additional daemon configuration is required. To obtain a webhook URL, create an [Incoming -Webhook](https://api.slack.com/messaging/webhooks) in your Slack workspace. - -## Canopy Configuration - -Canopy targets pick up auth from the environment (tailscale presence or device key) — see the -[Canopy](#canopy) target type section above. No `_targets.yml` field configures the auth path. - -The event `ref` (canopy's dedup key) is auto-derived as `{hostname}/{alert-stem}:{target-id}`, -so the same alert firing on different hosts or to different canopy targets produces distinct -canopy issues. diff --git a/crates/alertd/USAGE.md b/crates/alertd/USAGE.md deleted file mode 100644 index 815493a4..00000000 --- a/crates/alertd/USAGE.md +++ /dev/null @@ -1,218 +0,0 @@ -# Command-Line Help for `bestool-alertd` - -This document contains the help content for the `bestool-alertd` command-line program. - -**Command Overview:** - -* [`bestool-alertd`↴](#bestool-alertd) -* [`bestool-alertd run`↴](#bestool-alertd-run) -* [`bestool-alertd status`↴](#bestool-alertd-status) -* [`bestool-alertd reload`↴](#bestool-alertd-reload) -* [`bestool-alertd loaded-alerts`↴](#bestool-alertd-loaded-alerts) -* [`bestool-alertd pause-alert`↴](#bestool-alertd-pause-alert) -* [`bestool-alertd validate`↴](#bestool-alertd-validate) - -## `bestool-alertd` - -BES tooling: Alert daemon - -The daemon watches for changes to alert definition files and automatically reloads when changes are detected. You can also send SIGHUP to manually trigger a reload. - -On Windows, the daemon can be installed as a native Windows service using the 'install' subcommand. See 'bestool-alertd install --help' for details. - -The alert and target definitions are documented online at: and . - -**Usage:** `bestool-alertd [OPTIONS] ` - -###### **Subcommands:** - -* `run` — Run the alert daemon -* `status` — Show status and health of a running daemon -* `reload` — Send reload signal to running daemon -* `loaded-alerts` — List currently loaded alert files -* `pause-alert` — Temporarily pause an alert -* `validate` — Validate an alert definition file - -###### **Options:** - -* `--color ` — When to use terminal colours. - - You can also set the `NO_COLOR` environment variable to disable colours, or the `CLICOLOR_FORCE` environment variable to force colours. Defaults to `auto`, which checks whether the output is a terminal to decide. - - Default value: `auto` - - Possible values: - - `auto`: - Automatically detect whether to use colours - - `always`: - Always use colours, even if the terminal does not support them - - `never`: - Never use colours - -* `-v`, `--verbose` — Set diagnostic log level. - - This enables diagnostic logging, which is useful for investigating bugs. Use multiple times to increase verbosity. - - You may want to use with `--log-file` to avoid polluting your terminal. - - Default value: `0` -* `--log-file ` — Write diagnostic logs to a file. - - This writes diagnostic logs to a file, instead of the terminal, in JSON format. - - If the path provided is a directory, a file will be created in that directory with daily rotation. The initial file name will be in the format `programname.YYYY-MM-DDTHH-MM-SSZ.log`, and a new file will be created each day at midnight UTC. - - If the path is a file, logs will be written to that specific file without rotation. -* `--log-file-keep ` — Limit the number of log files to keep. - - When used with a directory in `--log-file`, this controls how many rotated log files are kept. Older files are automatically deleted when this limit is reached. Defaults to 32 days of logs. Pass 0 to disable rotation and keep all files. - - Default value: `32` -* `--log-timeless` — Omit timestamps in logs. - - This can be useful when running under service managers that capture logs, to avoid having two timestamps. When run under systemd, this is automatically enabled. - - This option is ignored if the log file is set, or when using `RUST_LOG` or equivalent (as logging is initialized before arguments are parsed in that case); you may want to use `LOG_TIMELESS` instead in the latter case. - - - -## `bestool-alertd run` - -Run the alert daemon - -Starts the daemon which monitors alert definition files and executes alerts based on their configured schedules. The daemon will watch for file changes and automatically reload when definitions are modified. - -**Usage:** `bestool-alertd run [OPTIONS]` - -###### **Options:** - -* `--database-url ` — Database connection URL - - PostgreSQL connection URL, e.g., postgresql://user:pass@localhost/dbname -* `--glob ` — Glob patterns for alert definitions - - Patterns can match directories (which will be read recursively) or individual files. Can be provided multiple times. Examples: /etc/tamanu/alerts, /opt/*/alerts, /etc/tamanu/alerts/**/*.yml -* `--email-from ` — Email sender address -* `--mailgun-api-key ` — Mailgun API key -* `--mailgun-domain ` — Mailgun domain -* `--tamanu-version ` — Tamanu version of the install this daemon alerts for. Sent on every canopy request via the `X-Version` header - - Default value: `0.0.0` -* `--device-key-file ` — Path to a Tamanu device key PEM, used as client identity for canopy targets. - - Required for any alert that targets a canopy `/events` endpoint. The key is the same value Tamanu stores in `local_system_facts(key='deviceKey')`; only the private key is read (a fresh self-signed cert is generated from it at startup). -* `--dry-run` — Execute all alerts once and quit (ignoring intervals) -* `--no-server` — Disable the HTTP server -* `--server-addr ` — HTTP server bind address(es) - - Can be provided multiple times. The server will attempt to bind to each address in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 -* `--watchdog-timeout ` — Watchdog timeout in seconds - - If no alert task reports activity within this many seconds, the daemon will exit so the service manager can restart it. Defaults to 600 (10 minutes). - - Default value: `600` -* `--no-watchdog` — Disable the watchdog - - By default, the daemon will exit if no alert activity is detected within the watchdog timeout. This flag disables that behavior. - - - -## `bestool-alertd status` - -Show status and health of a running daemon - -Connects to the running daemon's HTTP API and displays version, uptime, health, and watchdog information. Exits with code 1 if the daemon is unhealthy. - -**Usage:** `bestool-alertd status [OPTIONS]` - -###### **Options:** - -* `--server-addr ` — HTTP server address(es) to try - - Can be provided multiple times. Will attempt to connect to each address in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - - - -## `bestool-alertd reload` - -Send reload signal to running daemon - -Connects to the running daemon's HTTP API and triggers a reload. This is an alternative to SIGHUP that works on all platforms including Windows. - -**Usage:** `bestool-alertd reload [OPTIONS]` - -###### **Options:** - -* `--server-addr ` — HTTP server address(es) to try - - Can be provided multiple times. Will attempt to connect to each address in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - - - -## `bestool-alertd loaded-alerts` - -List currently loaded alert files - -Connects to the running daemon's HTTP API and retrieves the list of currently loaded alert definition files. - -**Usage:** `bestool-alertd loaded-alerts [OPTIONS]` - -###### **Options:** - -* `--server-addr ` — HTTP server address(es) to try - - Can be provided multiple times. Will attempt to connect to each address in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 -* `--detail` — Show detailed state information for each alert - - - -## `bestool-alertd pause-alert` - -Temporarily pause an alert - -Pauses an alert until the specified time. The alert will not execute during this period. The pause is lost when the daemon restarts. - -**Usage:** `bestool-alertd pause-alert [OPTIONS] ` - -###### **Arguments:** - -* `` — Alert file path to pause - -###### **Options:** - -* `--until ` — Time until which to pause the alert (fuzzy time format) - - Examples: "1 hour", "2 days", "next monday", "2024-12-25T10:00:00Z" Defaults to 1 week from now if not specified. -* `--server-addr ` — HTTP server address(es) to try - - Can be provided multiple times. Will attempt to connect to each address in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - - - -## `bestool-alertd validate` - -Validate an alert definition file - -Parses an alert definition file and reports any syntax or validation errors. Uses pretty error reporting to pinpoint the exact location of problems. Requires the daemon to be running. - -**Usage:** `bestool-alertd validate [OPTIONS] ` - -###### **Arguments:** - -* `` — Path to the alert definition file to validate - -###### **Options:** - -* `--server-addr ` — HTTP server address(es) to try - - Can be provided multiple times. Will attempt to connect to each address in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - - - -
- - - This document was generated automatically by - clap-markdown. - - diff --git a/crates/alertd/src/alert.rs b/crates/alertd/src/alert.rs deleted file mode 100644 index bdc3986b..00000000 --- a/crates/alertd/src/alert.rs +++ /dev/null @@ -1,888 +0,0 @@ -use std::{ - collections::HashMap, io::Write, ops::ControlFlow, path::PathBuf, process::Stdio, sync::Arc, - time::Duration, -}; - -use jiff::Timestamp; -use miette::{Context as _, IntoDiagnostic, Result, miette}; -use tera::Context as TeraCtx; -use tokio::io::AsyncReadExt as _; -use tokio_postgres::types::ToSql; -use tracing::{debug, error, info, instrument, warn}; - -use crate::{ - EmailConfig, LogError, events::EventType, targets::ExternalTarget, templates::build_context, -}; - -fn enabled() -> bool { - true -} - -fn default_interval() -> String { - "1 minute".to_string() -} - -#[derive(serde::Deserialize, Debug, Clone)] -#[serde(rename_all = "kebab-case")] -pub struct NumericalThreshold { - pub field: String, - pub alert_at: f64, - pub clear_at: Option, -} - -#[derive(serde::Deserialize, Debug, Clone)] -#[serde(rename_all = "kebab-case")] -#[serde(untagged)] -pub enum WhenChanged { - Boolean(bool), - Detailed(WhenChangedConfig), -} - -impl Default for WhenChanged { - fn default() -> Self { - WhenChanged::Boolean(false) - } -} - -#[derive(serde::Deserialize, Debug, Clone)] -#[serde(rename_all = "kebab-case")] -pub struct WhenChangedConfig { - #[serde(default)] - pub except: Vec, - #[serde(default)] - pub only: Vec, -} - -#[derive(serde::Deserialize, Debug, Clone)] -#[serde(rename_all = "kebab-case")] -#[serde(untagged)] -pub enum AlwaysSend { - Boolean(bool), - Timed(AlwaysSendConfig), -} - -impl Default for AlwaysSend { - fn default() -> Self { - AlwaysSend::Boolean(false) - } -} - -#[derive(serde::Deserialize, Debug, Clone)] -#[serde(rename_all = "kebab-case")] -pub struct AlwaysSendConfig { - pub after: String, - #[serde(skip)] - pub after_duration: Duration, -} - -#[derive(serde::Deserialize, Debug, Default, Clone)] -#[serde(rename_all = "kebab-case")] -pub struct AlertDefinition { - #[serde(default, skip)] - pub file: PathBuf, - - #[serde(default = "enabled")] - pub enabled: bool, - - #[serde(default = "default_interval")] - pub interval: String, - - #[serde(skip)] - pub interval_duration: Duration, - - #[serde(default)] - pub always_send: AlwaysSend, - - #[serde(default)] - pub when_changed: WhenChanged, - - #[serde(default)] - pub send: Vec, - - /// Restrict this alert to a daemon with a particular `server_kind`. The - /// value is an opaque string; alertd treats `central`, `facility`, or - /// anything else identically — equality with the daemon's configured - /// `server_kind` is the only criterion. Absence means "run anywhere", - /// which is the implicit "all" — there's no explicit `both`/`all` value. - /// - /// Named `server_kind` (serialised as `server-kind` in YAML) rather than - /// `target` because alertd already uses `target` for delivery targets - /// inside the `send:` list (e.g. `send: [{ target: external, ... }]`), - /// and reusing the word at the top level would be confusing. - #[serde(default)] - pub server_kind: Option, - - #[serde(flatten)] - pub source: TicketSource, -} - -/// Whether an alert with this `server_kind` filter should run on a daemon -/// configured with the given `server_kind`. -/// -/// - alert `None` ("any") → always runs. -/// - alert `Some(k)`, daemon `None` → permissive: the daemon isn't -/// configured to filter, so we don't drop anything. Better to run too -/// many alerts than to silently swallow targeted ones because someone -/// forgot to set the daemon kind. -/// - alert `Some(k)`, daemon `Some(d)` → runs iff `k == d` (case-sensitive -/// string equality). -pub fn server_kind_matches(alert: Option<&str>, daemon: Option<&str>) -> bool { - match (alert, daemon) { - (None, _) | (_, None) => true, - (Some(a), Some(d)) => a == d, - } -} - -#[derive(serde::Deserialize, Debug, Default, Clone)] -#[serde(untagged, deny_unknown_fields)] -pub enum TicketSource { - Sql { - sql: String, - #[serde(default)] - numerical: Vec, - }, - Shell { - shell: String, - run: String, - }, - Event { - event: EventType, - }, - - #[default] - None, -} - -impl AlertDefinition { - pub fn normalise( - mut self, - external_targets: &HashMap>, - ) -> Result<(Self, Vec)> { - // Parse interval string into duration - self.interval_duration = parse_interval(&self.interval) - .wrap_err_with(|| format!("failed to parse interval: {}", self.interval))?; - - // Parse always_send after duration if configured - if let AlwaysSend::Timed(ref mut config) = self.always_send { - config.after_duration = parse_interval(&config.after) - .wrap_err_with(|| format!("failed to parse always-send after: {}", config.after))?; - } - - // Validate templates before resolving targets - // This catches template syntax errors early - for (idx, target) in self.send.iter().enumerate() { - crate::templates::load_templates(target.subject(), target.template()).wrap_err_with( - || { - format!( - "validating templates for send target #{} (id: {})", - idx + 1, - target.id() - ) - }, - )?; - } - - let resolved = self - .send - .iter() - .flat_map(|target| { - let resolved_targets = target.resolve_external(external_targets); - if resolved_targets.is_empty() { - error!( - file=?self.file, - id = %target.id(), - available_targets=?external_targets.keys().collect::>(), - "external target not found" - ); - } - resolved_targets - }) - .collect(); - - self.send.clear(); // Clear send targets after resolution - Ok((self, resolved)) - } - - #[instrument(skip(self, pool, not_before, context))] - pub async fn read_sources( - &self, - pool: &bestool_postgres::pool::PgPool, - not_before: Timestamp, - context: &mut TeraCtx, - was_triggered: bool, - ) -> Result> { - match &self.source { - TicketSource::None => { - debug!(?self.file, "no source, skipping"); - return Ok(ControlFlow::Break(())); - } - TicketSource::Event { .. } => { - // Event sources are triggered externally, not by this method - debug!(?self.file, "event source, skipping normal execution"); - return Ok(ControlFlow::Break(())); - } - TicketSource::Sql { sql, numerical } => { - let client = pool - .get() - .await - .map_err(|e| miette!("getting connection from pool: {e}"))?; - let statement = client.prepare(sql).await.into_diagnostic()?; - - let interval = bestool_postgres::pg_interval::Interval(self.interval_duration); - let all_params: Vec<&(dyn ToSql + Sync)> = vec![¬_before, &interval]; - - let rows = client - .query(&statement, &all_params[..statement.params().len()]) - .await - .into_diagnostic() - .wrap_err("querying database")?; - - if rows.is_empty() { - debug!(?self.file, "no rows returned, skipping"); - return Ok(ControlFlow::Break(())); - } - - let context_rows = rows_to_value_map(&rows); - - // Check numerical thresholds if configured - if !numerical.is_empty() { - let triggered = - check_numerical_thresholds(&context_rows, numerical, was_triggered)?; - if !triggered { - debug!(?self.file, "numerical thresholds not met, skipping"); - return Ok(ControlFlow::Break(())); - } - } - - info!(?self.file, rows=%rows.len(), "alert triggered"); - context.insert("rows", &context_rows); - } - TicketSource::Shell { shell, run } => { - let mut script = tempfile::Builder::new().tempfile().into_diagnostic()?; - write!(script.as_file_mut(), "{run}").into_diagnostic()?; - - let mut shell = tokio::process::Command::new(shell) - .arg(script.path()) - .stdin(Stdio::null()) - .stdout(Stdio::piped()) - .spawn() - .into_diagnostic()?; - - let mut output = Vec::new(); - let mut stdout = shell - .stdout - .take() - .ok_or_else(|| miette!("getting the child stdout handle"))?; - let output_future = - futures::future::try_join(shell.wait(), stdout.read_to_end(&mut output)); - - let Ok(res) = tokio::time::timeout(self.interval_duration, output_future).await - else { - warn!(?self.file, "the script timed out, skipping"); - shell.kill().await.into_diagnostic()?; - return Ok(ControlFlow::Break(())); - }; - - let (status, output_size) = res.into_diagnostic().wrap_err("running the shell")?; - - if status.success() { - debug!(?self.file, "the script succeeded, skipping"); - return Ok(ControlFlow::Break(())); - } - info!(?self.file, ?status, ?output_size, "alert triggered"); - - context.insert("output", &String::from_utf8_lossy(&output)); - } - } - Ok(ControlFlow::Continue(())) - } - - pub async fn execute( - &self, - ctx: Arc, - email: Option<&EmailConfig>, - dry_run: bool, - resolved_targets: &[crate::targets::ResolvedTarget], - ) -> Result<()> { - info!(?self.file, "executing alert"); - - let now = Timestamp::now(); - let not_before = now - self.interval_duration; - info!(?now, ?not_before, interval=?self.interval_duration, "date range for alert"); - - let mut tera_ctx = build_context(self, now); - if self - .read_sources(&ctx.pg_pool, not_before, &mut tera_ctx, false) - .await? - .is_break() - { - return Ok(()); - } - - for target in resolved_targets { - if let Err(err) = target.send(self, &mut tera_ctx, email, &ctx, dry_run).await { - error!("sending: {}", LogError(&err)); - } - } - - Ok(()) - } -} - -#[derive(Debug, Clone)] -pub struct InternalContext { - pub pg_pool: bestool_postgres::pool::PgPool, - pub http_client: reqwest::Client, - pub canopy_client: Option>, -} - -fn rows_to_value_map( - rows: &[tokio_postgres::Row], -) -> Vec> { - rows.iter() - .map(|row| { - let mut map = serde_json::Map::new(); - for (idx, column) in row.columns().iter().enumerate() { - let value = bestool_postgres::stringify::postgres_to_json_value(row, idx); - map.insert(column.name().to_string(), value); - } - map - }) - .collect() -} - -fn check_numerical_thresholds( - rows: &[serde_json::Map], - thresholds: &[NumericalThreshold], - was_triggered: bool, -) -> Result { - for threshold in thresholds { - for row in rows { - let value = match row.get(&threshold.field) { - Some(serde_json::Value::Number(n)) => n - .as_f64() - .ok_or_else(|| miette!("field '{}' is not a valid number", threshold.field))?, - Some(_) => { - return Err(miette!( - "field '{}' exists but is not a number", - threshold.field - )); - } - None => { - return Err(miette!( - "field '{}' not found in query results", - threshold.field - )); - } - }; - - // Determine if we're checking for "above" or "below" based on clear_at - let is_inverted = threshold - .clear_at - .is_some_and(|clear| clear > threshold.alert_at); - - if was_triggered { - // Already triggered, check if we should clear - if let Some(clear_at) = threshold.clear_at { - let should_clear = if is_inverted { - // Inverted: clear when value >= clear_at - value >= clear_at - } else { - // Normal: clear when value <= clear_at - value <= clear_at - }; - - if should_clear { - // This threshold has cleared, continue checking others - continue; - } else { - // Still above/below clear threshold, remain triggered - return Ok(true); - } - } else { - // No clear_at specified, check alert_at threshold - let still_triggered = if is_inverted { - value <= threshold.alert_at - } else { - value >= threshold.alert_at - }; - - if still_triggered { - return Ok(true); - } - } - } else { - // Not yet triggered, check if we should trigger - let should_trigger = if is_inverted { - // Inverted: trigger when value <= alert_at - value <= threshold.alert_at - } else { - // Normal: trigger when value >= alert_at - value >= threshold.alert_at - }; - - if should_trigger { - return Ok(true); - } - } - } - } - - Ok(false) -} - -fn parse_interval(s: &str) -> Result { - let s = s.trim(); - - // Try to parse as a simple number (seconds) - if let Ok(secs) = s.parse::() { - return Ok(Duration::from_secs(secs)); - } - - // Parse with units - let parts: Vec<&str> = s.split_whitespace().collect(); - if parts.len() != 2 { - return Err(miette!( - "interval must be in format ' ' or just ''" - )); - } - - let value: u64 = parts[0] - .parse() - .into_diagnostic() - .wrap_err("interval value must be a number")?; - let unit = parts[1].to_lowercase(); - - let duration = match unit.as_str() { - "second" | "seconds" | "s" | "sec" | "secs" => Duration::from_secs(value), - "minute" | "minutes" | "m" | "min" | "mins" => Duration::from_secs(value * 60), - "hour" | "hours" | "h" | "hr" | "hrs" => Duration::from_secs(value * 3600), - "day" | "days" | "d" => Duration::from_secs(value * 86400), - _ => { - return Err(miette!( - "unknown interval unit: {}, expected: seconds, minutes, hours, or days", - unit - )); - } - }; - - Ok(duration) -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn test_alert_with_event_source() { - let yaml = r#" -event: source-error -send: - - id: test-target - subject: Test - template: Test template -"#; - let alert: AlertDefinition = serde_yaml::from_str(yaml).unwrap(); - assert!(matches!(alert.source, TicketSource::Event { .. })); - if let TicketSource::Event { event } = alert.source { - assert_eq!(event, EventType::SourceError); - } - } - - #[test] - fn test_parse_interval() { - assert_eq!(parse_interval("60").unwrap(), Duration::from_secs(60)); - assert_eq!(parse_interval("1 minute").unwrap(), Duration::from_secs(60)); - assert_eq!( - parse_interval("5 minutes").unwrap(), - Duration::from_secs(300) - ); - assert_eq!( - parse_interval("2 hours").unwrap(), - Duration::from_secs(7200) - ); - assert_eq!(parse_interval("1 day").unwrap(), Duration::from_secs(86400)); - assert_eq!( - parse_interval("30 seconds").unwrap(), - Duration::from_secs(30) - ); - } - - #[test] - fn test_default_interval() { - let yaml = r#" -sql: "SELECT 1" -send: - - id: test - subject: Test - template: Test -"#; - let alert: AlertDefinition = serde_yaml::from_str(yaml).unwrap(); - assert_eq!(alert.interval, "1 minute"); - } - - #[test] - fn test_default_always_send() { - let yaml = r#" -sql: "SELECT 1" -send: - - id: test - subject: Test - template: Test -"#; - let alert: AlertDefinition = serde_yaml::from_str(yaml).unwrap(); - assert!(matches!(alert.always_send, AlwaysSend::Boolean(false))); - } - - #[test] - fn test_always_send_true() { - let yaml = r#" -always-send: true -sql: "SELECT 1" -send: - - id: test - subject: Test - template: Test -"#; - let alert: AlertDefinition = serde_yaml::from_str(yaml).unwrap(); - assert!(matches!(alert.always_send, AlwaysSend::Boolean(true))); - } - - #[test] - fn test_always_send_timed() { - let yaml = r#" -always-send: - after: 8h -sql: "SELECT 1" -send: - - id: test - subject: Test - template: Test -"#; - let alert: AlertDefinition = serde_yaml::from_str(yaml).unwrap(); - match alert.always_send { - AlwaysSend::Timed(config) => { - assert_eq!(config.after, "8h"); - } - _ => panic!("Expected AlwaysSend::Timed"), - } - } - - #[test] - fn test_always_send_timed_normalised() { - let yaml = r#" -always-send: - after: 8 hours -sql: "SELECT 1" -send: - - id: test - subject: Test - template: Test -"#; - let alert: AlertDefinition = serde_yaml::from_str(yaml).unwrap(); - let external_targets = std::collections::HashMap::new(); - let (normalised, _) = alert.normalise(&external_targets).unwrap(); - - match normalised.always_send { - AlwaysSend::Timed(config) => { - assert_eq!(config.after, "8 hours"); - assert_eq!( - config.after_duration, - std::time::Duration::from_secs(8 * 3600) - ); - } - _ => panic!("Expected AlwaysSend::Timed"), - } - } - - #[test] - fn test_numerical_threshold_normal() { - let yaml = r#" -sql: "SELECT 1" -numerical: - - field: count - alert-at: 100 - clear-at: 50 -send: - - id: test - subject: Test - template: Test -"#; - let alert: AlertDefinition = serde_yaml::from_str(yaml).unwrap(); - if let TicketSource::Sql { numerical, .. } = &alert.source { - assert_eq!(numerical.len(), 1); - assert_eq!(numerical[0].field, "count"); - assert_eq!(numerical[0].alert_at, 100.0); - assert_eq!(numerical[0].clear_at, Some(50.0)); - } else { - panic!("Expected Sql source"); - } - } - - #[test] - fn test_numerical_threshold_inverted() { - let yaml = r#" -sql: "SELECT 1" -numerical: - - field: free_space - alert-at: 10 - clear-at: 50 -send: - - id: test - subject: Test - template: Test -"#; - let alert: AlertDefinition = serde_yaml::from_str(yaml).unwrap(); - if let TicketSource::Sql { numerical, .. } = &alert.source { - assert_eq!(numerical.len(), 1); - assert_eq!(numerical[0].field, "free_space"); - assert_eq!(numerical[0].alert_at, 10.0); - assert_eq!(numerical[0].clear_at, Some(50.0)); - } else { - panic!("Expected Sql source"); - } - } - - #[test] - fn test_numerical_threshold_no_clear() { - let yaml = r#" -sql: "SELECT 1" -numerical: - - field: errors - alert-at: 5 -send: - - id: test - subject: Test - template: Test -"#; - let alert: AlertDefinition = serde_yaml::from_str(yaml).unwrap(); - if let TicketSource::Sql { numerical, .. } = &alert.source { - assert_eq!(numerical.len(), 1); - assert_eq!(numerical[0].field, "errors"); - assert_eq!(numerical[0].alert_at, 5.0); - assert_eq!(numerical[0].clear_at, None); - } else { - panic!("Expected Sql source"); - } - } - - #[test] - fn test_check_numerical_thresholds_normal_trigger() { - let mut row = serde_json::Map::new(); - row.insert("count".to_string(), serde_json::Value::Number(150.into())); - let rows = vec![row]; - - let threshold = NumericalThreshold { - field: "count".to_string(), - alert_at: 100.0, - clear_at: Some(50.0), - }; - - // Not triggered yet, value 150 >= alert_at 100, should trigger - let result = - check_numerical_thresholds(&rows, std::slice::from_ref(&threshold), false).unwrap(); - assert!(result); - - // Already triggered, value 150 > clear_at 50, should stay triggered - let result = - check_numerical_thresholds(&rows, std::slice::from_ref(&threshold), true).unwrap(); - assert!(result); - - // Already triggered, value 30 <= clear_at 50, should clear - let mut row = serde_json::Map::new(); - row.insert("count".to_string(), serde_json::Value::Number(30.into())); - let rows = vec![row]; - let result = check_numerical_thresholds(&rows, &[threshold], true).unwrap(); - assert!(!result); - } - - #[test] - fn test_check_numerical_thresholds_inverted_trigger() { - let mut row = serde_json::Map::new(); - row.insert( - "free_space".to_string(), - serde_json::Value::Number(5.into()), - ); - let rows = vec![row]; - - let threshold = NumericalThreshold { - field: "free_space".to_string(), - alert_at: 10.0, - clear_at: Some(50.0), // Inverted because clear_at > alert_at - }; - - // Not triggered yet, value 5 <= alert_at 10, should trigger (inverted) - let result = - check_numerical_thresholds(&rows, std::slice::from_ref(&threshold), false).unwrap(); - assert!(result); - - // Already triggered, value 5 < clear_at 50, should stay triggered - let result = - check_numerical_thresholds(&rows, std::slice::from_ref(&threshold), true).unwrap(); - assert!(result); - - // Already triggered, value 60 >= clear_at 50, should clear - let mut row = serde_json::Map::new(); - row.insert( - "free_space".to_string(), - serde_json::Value::Number(60.into()), - ); - let rows = vec![row]; - let result = check_numerical_thresholds(&rows, &[threshold], true).unwrap(); - assert!(!result); - } - - #[test] - fn test_check_numerical_thresholds_no_clear_at() { - let threshold = NumericalThreshold { - field: "errors".to_string(), - alert_at: 5.0, - clear_at: None, - }; - - // Trigger when >= 5 - let mut row = serde_json::Map::new(); - row.insert("errors".to_string(), serde_json::Value::Number(10.into())); - let rows = vec![row]; - let result = - check_numerical_thresholds(&rows, std::slice::from_ref(&threshold), false).unwrap(); - assert!(result); - - // Still triggered when >= 5 - let result = - check_numerical_thresholds(&rows, std::slice::from_ref(&threshold), true).unwrap(); - assert!(result); - - // Clear when < 5 - let mut row = serde_json::Map::new(); - row.insert("errors".to_string(), serde_json::Value::Number(3.into())); - let rows = vec![row]; - let result = check_numerical_thresholds(&rows, &[threshold], true).unwrap(); - assert!(!result); - } - - #[test] - fn test_when_changed_boolean_true() { - let yaml = r#" -sql: "SELECT 1" -when-changed: true -send: - - id: test - subject: Test - template: Test -"#; - let alert: AlertDefinition = serde_yaml::from_str(yaml).unwrap(); - assert!(matches!(alert.when_changed, WhenChanged::Boolean(true))); - } - - #[test] - fn test_when_changed_boolean_false() { - let yaml = r#" -sql: "SELECT 1" -when-changed: false -send: - - id: test - subject: Test - template: Test -"#; - let alert: AlertDefinition = serde_yaml::from_str(yaml).unwrap(); - assert!(matches!(alert.when_changed, WhenChanged::Boolean(false))); - } - - #[test] - fn test_when_changed_default() { - let yaml = r#" -sql: "SELECT 1" -send: - - id: test - subject: Test - template: Test -"#; - let alert: AlertDefinition = serde_yaml::from_str(yaml).unwrap(); - assert!(matches!(alert.when_changed, WhenChanged::Boolean(false))); - } - - #[test] - fn test_when_changed_except() { - let yaml = r#" -sql: "SELECT 1" -when-changed: - except: [created_at, updated_at] -send: - - id: test - subject: Test - template: Test -"#; - let alert: AlertDefinition = serde_yaml::from_str(yaml).unwrap(); - match &alert.when_changed { - WhenChanged::Detailed(config) => { - assert_eq!(config.except, vec!["created_at", "updated_at"]); - assert!(config.only.is_empty()); - } - _ => panic!("Expected Detailed variant"), - } - } - - #[test] - fn test_when_changed_only() { - let yaml = r#" -sql: "SELECT 1" -when-changed: - only: [error, message] -send: - - id: test - subject: Test - template: Test -"#; - let alert: AlertDefinition = serde_yaml::from_str(yaml).unwrap(); - match &alert.when_changed { - WhenChanged::Detailed(config) => { - assert_eq!(config.only, vec!["error", "message"]); - assert!(config.except.is_empty()); - } - _ => panic!("Expected Detailed variant"), - } - } - - #[test] - fn server_kind_defaults_to_none_when_absent() { - let yaml = "sql: \"SELECT 1\"\nsend:\n - id: x\n subject: s\n template: t\n"; - let alert: AlertDefinition = serde_yaml::from_str(yaml).unwrap(); - assert_eq!(alert.server_kind, None); - } - - #[test] - fn server_kind_parses_as_arbitrary_string() { - // alertd doesn't enforce a vocabulary — `central` / `facility` are - // just what bestool-tamanu happens to pass through. Whatever string - // the daemon's `server_kind` is configured with is what alerts match - // against. - for text in ["central", "facility", "kiosk", "edge"] { - let yaml = format!( - "sql: \"SELECT 1\"\nserver-kind: {text}\nsend:\n - id: x\n subject: s\n template: t\n" - ); - let alert: AlertDefinition = serde_yaml::from_str(&yaml).unwrap(); - assert_eq!(alert.server_kind.as_deref(), Some(text)); - } - } - - #[test] - fn server_kind_matches_truth_table() { - // Absent on the alert → always runs. - assert!(server_kind_matches(None, Some("central"))); - assert!(server_kind_matches(None, Some("facility"))); - assert!(server_kind_matches(None, None)); - - // Absent on the daemon → permissive: every alert applies. - assert!(server_kind_matches(Some("central"), None)); - assert!(server_kind_matches(Some("facility"), None)); - - // Both present → equality. - assert!(server_kind_matches(Some("central"), Some("central"))); - assert!(!server_kind_matches(Some("central"), Some("facility"))); - assert!(!server_kind_matches(Some("facility"), Some("central"))); - - // Match is case-sensitive. - assert!(!server_kind_matches(Some("Central"), Some("central"))); - } -} diff --git a/crates/alertd/src/commands.rs b/crates/alertd/src/commands.rs deleted file mode 100644 index 9fa1f2fd..00000000 --- a/crates/alertd/src/commands.rs +++ /dev/null @@ -1,64 +0,0 @@ -mod loaded_alerts; -mod pause; -mod reload; -mod status; -mod validate; - -pub use loaded_alerts::get_loaded_alerts; -pub use pause::pause_alert; -pub use reload::send_reload; -pub use status::get_status; -pub use validate::validate_alert; - -use tracing::info; - -/// Default server addresses to try when connecting to the daemon -pub fn default_server_addrs() -> Vec { - vec![ - "[::1]:8271".parse().unwrap(), - "127.0.0.1:8271".parse().unwrap(), - ] -} - -/// Attempt to connect to a running daemon at any of the provided addresses -/// -/// Returns a tuple of (client, base_url) on success, or an error if no daemon could be reached. -pub async fn try_connect_daemon( - addrs: &[std::net::SocketAddr], -) -> miette::Result<(reqwest::Client, String)> { - let client = crate::http_client(); - let mut last_error = None; - - for addr in addrs { - let url = format!("http://{}", addr); - info!("trying to connect to daemon at {}", url); - - // Try to connect with a simple status check - let test_response = match client.get(format!("{}/status", url)).send().await { - Ok(resp) => resp, - Err(e) => { - info!("failed to connect to {}: {}", url, e); - last_error = Some(e); - continue; - } - }; - - if test_response.status().is_success() { - info!("connected to daemon at {}", url); - return Ok((client, url)); - } - } - - if let Some(err) = last_error { - Err(miette::miette!( - "failed to connect to daemon at any of {} address(es): {}", - addrs.len(), - err - )) - } else { - Err(miette::miette!( - "no daemon found at any of {} address(es)", - addrs.len() - )) - } -} diff --git a/crates/alertd/src/commands/loaded_alerts.rs b/crates/alertd/src/commands/loaded_alerts.rs deleted file mode 100644 index 4b45a912..00000000 --- a/crates/alertd/src/commands/loaded_alerts.rs +++ /dev/null @@ -1,82 +0,0 @@ -use super::try_connect_daemon; -use serde::Deserialize; - -#[derive(Debug, Deserialize)] -struct AlertStateInfo { - path: String, - enabled: bool, - interval: String, - triggered_at: Option, - last_sent_at: Option, - paused_until: Option, - always_send: String, -} - -/// Get the list of currently loaded alerts from a running daemon -pub async fn get_loaded_alerts(addrs: &[std::net::SocketAddr], detail: bool) -> miette::Result<()> { - let (client, base_url) = try_connect_daemon(addrs).await?; - - let url = if detail { - format!("{}/alerts?detail=true", base_url) - } else { - format!("{}/alerts", base_url) - }; - - let response = client - .get(url) - .send() - .await - .map_err(|e| miette::miette!("failed to get alerts: {}", e))?; - - if !response.status().is_success() { - return Err(miette::miette!( - "failed to get alerts (status: {})", - response.status() - )); - } - - if detail { - let alert_states: Vec = response - .json() - .await - .map_err(|e| miette::miette!("failed to parse response: {}", e))?; - - if alert_states.is_empty() { - println!("No alerts currently loaded"); - } else { - println!("Loaded alerts ({}):\n", alert_states.len()); - for state in alert_states { - println!(" {}:", state.path); - println!(" enabled: {}", state.enabled); - println!(" interval: {}", state.interval); - println!(" always_send: {}", state.always_send); - if let Some(triggered_at) = state.triggered_at { - println!(" triggered_at: {}", triggered_at); - } - if let Some(last_sent_at) = state.last_sent_at { - println!(" last_sent_at: {}", last_sent_at); - } - if let Some(paused_until) = state.paused_until { - println!(" paused_until: {}", paused_until); - } - println!(); - } - } - } else { - let alerts: Vec = response - .json() - .await - .map_err(|e| miette::miette!("failed to parse response: {}", e))?; - - if alerts.is_empty() { - println!("No alerts currently loaded"); - } else { - println!("Loaded alerts ({}):", alerts.len()); - for alert in alerts { - println!(" {}", alert); - } - } - } - - Ok(()) -} diff --git a/crates/alertd/src/commands/pause.rs b/crates/alertd/src/commands/pause.rs deleted file mode 100644 index acbdeb3a..00000000 --- a/crates/alertd/src/commands/pause.rs +++ /dev/null @@ -1,137 +0,0 @@ -use std::io::{self, Write}; - -use tracing::info; - -use super::try_connect_daemon; - -/// Pause an alert until a specified time -pub async fn pause_alert( - alert_path: &str, - until: Option<&str>, - addrs: &[std::net::SocketAddr], -) -> miette::Result<()> { - // Parse or default the until time - let until_timestamp = if let Some(until_str) = until { - // Try parsing as timestamp first - if let Ok(ts) = until_str.parse::() { - ts - } else { - // Try parsing as relative time using jiff's Span - let span: jiff::Span = until_str - .parse() - .map_err(|e| miette::miette!("failed to parse time '{}': {}", until_str, e))?; - jiff::Timestamp::now() - .checked_add(span) - .map_err(|e| miette::miette!("time calculation overflow: {}", e))? - } - } else { - // Default to 1 week from now - jiff::Timestamp::now() - .checked_add(jiff::Span::new().days(7)) - .map_err(|e| miette::miette!("time calculation overflow: {}", e))? - }; - - let (client, base_url) = try_connect_daemon(addrs).await?; - - // Try to pause the alert - let url = format!("{}/alerts", base_url); - - let body = serde_json::json!({ - "alert": alert_path, - "until": until_timestamp.to_string(), - }); - - let response = client - .delete(&url) - .json(&body) - .send() - .await - .map_err(|e| miette::miette!("failed to send pause request: {}", e))?; - - if response.status() == reqwest::StatusCode::NOT_FOUND { - // Alert not found, try to find a partial match - info!("alert not found, trying to find partial match"); - - let alerts_response = client - .get(format!("{}/alerts", base_url)) - .send() - .await - .map_err(|e| miette::miette!("failed to get alerts list: {}", e))?; - - let alerts: Vec = alerts_response - .json() - .await - .map_err(|e| miette::miette!("failed to parse alerts list: {}", e))?; - - // Find partial matches - let matches: Vec<&String> = alerts.iter().filter(|a| a.contains(alert_path)).collect(); - - if matches.is_empty() { - return Err(miette::miette!( - "alert '{}' not found and no partial matches", - alert_path - )); - } else if matches.len() == 1 { - // Exactly one match, ask for confirmation - println!("Alert '{}' not found.", alert_path); - println!("Did you mean: {}", matches[0]); - print!("Pause this alert? [y/N] "); - io::stdout().flush().unwrap(); - - let mut input = String::new(); - io::stdin() - .read_line(&mut input) - .map_err(|e| miette::miette!("failed to read input: {}", e))?; - - if input.trim().eq_ignore_ascii_case("y") || input.trim().eq_ignore_ascii_case("yes") { - // Retry with the matched path - let retry_url = format!("{}/alerts", base_url); - let retry_body = serde_json::json!({ - "alert": matches[0], - "until": until_timestamp.to_string(), - }); - - let retry_response = client - .delete(&retry_url) - .json(&retry_body) - .send() - .await - .map_err(|e| miette::miette!("failed to send pause request: {}", e))?; - - if !retry_response.status().is_success() { - return Err(miette::miette!( - "failed to pause alert (status: {})", - retry_response.status() - )); - } - - println!("Alert paused until {}", until_timestamp); - return Ok(()); - } else { - return Err(miette::miette!("pause cancelled")); - } - } else { - // Multiple matches - println!( - "Alert '{}' not found. Did you mean one of these?", - alert_path - ); - for (i, m) in matches.iter().enumerate() { - println!(" {}. {}", i + 1, m); - } - return Err(miette::miette!( - "multiple matches found, please be more specific" - )); - } - } - - if !response.status().is_success() { - return Err(miette::miette!( - "failed to pause alert (status: {})", - response.status() - )); - } - - println!("Alert paused until {}", until_timestamp); - Ok(()) -} diff --git a/crates/alertd/src/commands/reload.rs b/crates/alertd/src/commands/reload.rs deleted file mode 100644 index dfb44b6f..00000000 --- a/crates/alertd/src/commands/reload.rs +++ /dev/null @@ -1,94 +0,0 @@ -use tracing::info; - -/// Send a reload signal to a running alertd daemon -/// -/// Tries to connect to the daemon's HTTP API at each of the provided addresses in order -/// until one succeeds. This is an alternative to SIGHUP that works on all platforms -/// including Windows. -pub async fn send_reload(addrs: &[std::net::SocketAddr]) -> miette::Result<()> { - let client = crate::http_client(); - - let mut last_error = None; - - for addr in addrs { - let url = format!("http://{}", addr); - info!("checking if daemon is running at {}", url); - - // First, check if daemon is running by fetching status - let status_response = match client.get(format!("{}/status", url)).send().await { - Ok(resp) => resp, - Err(e) => { - info!("failed to connect to {}: {}", url, e); - last_error = Some(e); - continue; - } - }; - - if !status_response.status().is_success() { - info!( - "daemon at {} returned status: {}", - url, - status_response.status() - ); - continue; - } - - let status: serde_json::Value = match status_response.json().await { - Ok(s) => s, - Err(e) => { - info!("failed to parse status response from {}: {}", url, e); - continue; - } - }; - - // Verify it's the correct daemon - if status.get("name").and_then(|n| n.as_str()) != Some("bestool-alertd") { - info!( - "unexpected daemon running at {}: {:?}", - url, - status.get("name") - ); - continue; - } - - info!( - "found bestool-alertd daemon at {} (pid: {})", - url, - status.get("pid").unwrap_or(&serde_json::Value::Null) - ); - - // Send reload request - info!("sending reload request to {}", url); - let reload_response = match client.post(format!("{}/reload", url)).send().await { - Ok(resp) => resp, - Err(e) => { - return Err(miette::miette!("reload request to {} failed: {}", url, e)); - } - }; - - if !reload_response.status().is_success() { - return Err(miette::miette!( - "reload request to {} failed (status: {})", - url, - reload_response.status() - )); - } - - info!("reload request sent successfully to {}", url); - return Ok(()); - } - - // If we get here, we couldn't connect to any address - if let Some(err) = last_error { - Err(miette::miette!( - "failed to connect to daemon at any of {} address(es): {}", - addrs.len(), - err - )) - } else { - Err(miette::miette!( - "no daemon found at any of {} address(es)", - addrs.len() - )) - } -} diff --git a/crates/alertd/src/commands/status.rs b/crates/alertd/src/commands/status.rs deleted file mode 100644 index 265243d7..00000000 --- a/crates/alertd/src/commands/status.rs +++ /dev/null @@ -1,106 +0,0 @@ -use miette::miette; -use tracing::info; - -use super::try_connect_daemon; -use crate::http_server::StatusResponse; - -pub async fn get_status(addrs: &[std::net::SocketAddr]) -> miette::Result<()> { - let (client, url) = try_connect_daemon(addrs).await?; - - // Fetch /status - let status_response = client - .get(format!("{url}/status")) - .send() - .await - .map_err(|e| miette!("failed to fetch status: {e}"))?; - - if !status_response.status().is_success() { - return Err(miette!( - "status endpoint returned {}", - status_response.status() - )); - } - - let status: StatusResponse = status_response - .json() - .await - .map_err(|e| miette!("failed to parse status response: {e}"))?; - - // Fetch /health - let health_response = client - .get(format!("{url}/health")) - .send() - .await - .map_err(|e| miette!("failed to fetch health: {e}"))?; - - let health_status_code = health_response.status().as_u16(); - let health: serde_json::Value = health_response - .json() - .await - .map_err(|e| miette!("failed to parse health response: {e}"))?; - - info!("connected to daemon at {url}"); - - let local_version = crate::VERSION; - - println!("Name: {}", status.name); - println!("Version: {}", status.version); - if status.version != local_version { - println!( - " WARNING: running daemon is {}, but this CLI is {local_version}", - status.version - ); - println!(" Consider restarting the service to pick up the new version."); - } - println!("PID: {}", status.pid); - println!("Started: {}", status.started_at); - - let healthy = health - .get("healthy") - .and_then(|v| v.as_bool()) - .unwrap_or(false); - - if healthy { - println!("Health: ok"); - } else { - println!("Health: UNHEALTHY (HTTP {health_status_code})"); - } - - if let Some(uptime) = health.get("uptime_secs").and_then(|v| v.as_i64()) { - println!("Uptime: {}", format_duration(uptime)); - } - - if let Some(last) = health - .get("last_activity_secs_ago") - .and_then(|v| v.as_i64()) - { - println!("Last tick: {last}s ago"); - } else { - println!("Last tick: (none yet)"); - } - - if let Some(timeout) = health.get("watchdog_timeout_secs").and_then(|v| v.as_u64()) { - println!("Watchdog: {}", format_duration(timeout as i64)); - } else { - println!("Watchdog: disabled"); - } - - if !healthy { - std::process::exit(1); - } - - Ok(()) -} - -fn format_duration(secs: i64) -> String { - let hours = secs / 3600; - let mins = (secs % 3600) / 60; - let secs = secs % 60; - if hours > 0 { - format!("{hours}h {mins}m {secs}s") - } else if mins > 0 { - format!("{mins}m {secs}s") - } else { - format!("{secs}s") - } -} diff --git a/crates/alertd/src/commands/validate.rs b/crates/alertd/src/commands/validate.rs deleted file mode 100644 index 3bf372bb..00000000 --- a/crates/alertd/src/commands/validate.rs +++ /dev/null @@ -1,146 +0,0 @@ -use miette::{Context as _, IntoDiagnostic, NamedSource, SourceSpan}; -use tracing::warn; - -use super::try_connect_daemon; - -/// Validate an alert definition file -pub async fn validate_alert( - file: &std::path::Path, - addrs: &[std::net::SocketAddr], -) -> miette::Result<()> { - // Read the file - let content = std::fs::read_to_string(file) - .into_diagnostic() - .wrap_err_with(|| format!("failed to read file: {}", file.display()))?; - - // Connect to daemon - let (client, base_url) = try_connect_daemon(addrs).await?; - - // Check daemon version - let status_response = client - .get(format!("{}/status", base_url)) - .send() - .await - .into_diagnostic() - .wrap_err("failed to get daemon status")?; - - #[derive(serde::Deserialize)] - struct StatusResponse { - version: String, - } - - if let Ok(status) = status_response.json::().await { - let daemon_version = &status.version; - let cli_version = crate::VERSION; - if daemon_version != cli_version { - warn!( - "version mismatch: daemon is running {} but CLI is {}", - daemon_version, cli_version - ); - eprintln!( - "⚠ Warning: Version mismatch detected!\n Daemon version: {}\n CLI version: {}\n", - daemon_version, cli_version - ); - } - } - - // Send validation request - let response = client - .post(format!("{}/validate", base_url)) - .body(content.clone()) - .send() - .await - .into_diagnostic() - .wrap_err("failed to send validation request")?; - - if !response.status().is_success() { - return Err(miette::miette!( - "validation request failed with status: {}", - response.status() - )); - } - - // Parse response - #[derive(serde::Deserialize)] - struct ValidationResponse { - valid: bool, - error: Option, - error_location: Option, - info: Option, - } - - #[derive(serde::Deserialize)] - struct ErrorLocation { - line: usize, - column: usize, - } - - #[derive(serde::Deserialize)] - struct ValidationInfo { - enabled: bool, - interval: String, - source_type: String, - targets: usize, - } - - let validation: ValidationResponse = response - .json() - .await - .into_diagnostic() - .wrap_err("failed to parse validation response")?; - - if validation.valid { - println!("✓ Alert definition is valid"); - println!(" File: {}", file.display()); - if let Some(info) = validation.info { - println!(" Enabled: {}", info.enabled); - println!(" Interval: {}", info.interval); - println!(" Source: {}", info.source_type); - println!(" Targets: {}", info.targets); - - if info.targets == 0 { - println!("\n⚠ Warning: Alert has no resolved targets."); - println!(" This alert may not send notifications. Check your _targets.yml file."); - } - } - Ok(()) - } else { - // Display error with source location if available - if let Some(error_msg) = validation.error { - if let Some(loc) = validation.error_location { - // Calculate byte offset for miette - let mut byte_offset = 0; - for (idx, line_content) in content.lines().enumerate() { - if idx + 1 < loc.line { - byte_offset += line_content.len() + 1; // +1 for newline - } else if idx + 1 == loc.line { - byte_offset += loc.column.saturating_sub(1); - break; - } - } - - let span_start = byte_offset; - let span_len = content[span_start..] - .lines() - .next() - .map(|l| l.len().min(80)) - .unwrap_or(1); - - Err(miette::miette!( - labels = vec![miette::LabeledSpan::at( - SourceSpan::new(span_start.into(), span_len), - "here" - )], - "{}", - error_msg - ) - .with_source_code(NamedSource::new(file.display().to_string(), content))) - } else { - Err(miette::miette!("{}", error_msg) - .with_source_code(NamedSource::new(file.display().to_string(), content))) - } - } else { - Err(miette::miette!("validation failed with no error message")) - } - } -} diff --git a/crates/alertd/src/context.rs b/crates/alertd/src/context.rs new file mode 100644 index 00000000..49ec270f --- /dev/null +++ b/crates/alertd/src/context.rs @@ -0,0 +1,12 @@ +use std::sync::Arc; + +use crate::canopy::CanopyClient; + +/// Shared resources the daemon holds for the lifetime of the process and hands +/// to background tasks and HTTP endpoints. +#[derive(Debug, Clone)] +pub struct InternalContext { + pub pg_pool: bestool_postgres::pool::PgPool, + pub http_client: reqwest::Client, + pub canopy_client: Option>, +} diff --git a/crates/alertd/src/daemon.rs b/crates/alertd/src/daemon.rs index aec6f775..ad60807d 100644 --- a/crates/alertd/src/daemon.rs +++ b/crates/alertd/src/daemon.rs @@ -1,80 +1,17 @@ -use std::{collections::HashSet, path::PathBuf, sync::Arc, time::Duration}; +use std::{sync::Arc, time::Duration}; -use miette::{IntoDiagnostic, Result, miette}; -use notify::{Event, EventKind, RecursiveMode, Watcher}; -use tokio::sync::{RwLock, mpsc, oneshot}; -use tracing::{debug, error, info, warn}; +use miette::{Result, miette}; +use tokio::sync::{mpsc, oneshot}; +use tracing::{error, info}; use crate::{ - DaemonConfig, LogError, - alert::InternalContext, - canopy::CanopyClient, - events::{EventContext, EventType}, - http_server, metrics, - scheduler::Scheduler, - state_file, + DaemonConfig, LogError, canopy::CanopyClient, context::InternalContext, http_server, metrics, tasks::TaskContext, }; enum DaemonEvent { - FileChanged, Shutdown, WatchdogTimeout, - ResolveGlobs, -} - -struct WatchManager { - watcher: notify::RecommendedWatcher, - watched_paths: HashSet, -} - -impl WatchManager { - fn new(event_tx: mpsc::Sender) -> Result { - let watcher = - notify::recommended_watcher(move |res: std::result::Result| match res { - Ok(event) => match event.kind { - EventKind::Create(_) | EventKind::Modify(_) | EventKind::Remove(_) => { - debug!(?event, "file system event detected"); - let _ = event_tx.blocking_send(DaemonEvent::FileChanged); - } - _ => {} - }, - Err(e) => error!("watch error: {:?}", e), - }) - .into_diagnostic()?; - - Ok(Self { - watcher, - watched_paths: HashSet::new(), - }) - } - - fn update_watches(&mut self, paths: &[PathBuf]) -> Result<()> { - let new_paths: HashSet<_> = paths.iter().cloned().collect(); - - // Remove watches for paths that no longer exist - for old_path in &self.watched_paths { - if !new_paths.contains(old_path) { - debug!(?old_path, "removing watch for path"); - if let Err(e) = self.watcher.unwatch(old_path) { - warn!(?old_path, "failed to remove watch: {e}"); - } - } - } - - // Add watches for new paths - for new_path in &new_paths { - if !self.watched_paths.contains(new_path) && new_path.exists() { - debug!(?new_path, "adding watch for path"); - if let Err(e) = self.watcher.watch(new_path, RecursiveMode::Recursive) { - warn!(?new_path, "failed to watch path: {e}"); - } - } - } - - self.watched_paths = new_paths; - Ok(()) - } } pub async fn run(daemon_config: DaemonConfig) -> Result<()> { @@ -85,18 +22,9 @@ pub async fn run(daemon_config: DaemonConfig) -> Result<()> { pub async fn run_with_shutdown( daemon_config: DaemonConfig, external_shutdown: oneshot::Receiver<()>, -) -> Result<()> { - run_with_shutdown_and_reload(daemon_config, external_shutdown, None).await -} - -pub async fn run_with_shutdown_and_reload( - daemon_config: DaemonConfig, - external_shutdown: oneshot::Receiver<()>, - external_reload: Option>, ) -> Result<()> { info!("starting alertd daemon"); - // Initialize metrics metrics::init_metrics(); metrics::record_activity(); @@ -134,7 +62,7 @@ pub async fn run_with_shutdown_and_reload( } Ok(None) => { info!( - "no canopy auth path available (no tailscale, no device key); canopy targets will be skipped" + "no canopy auth path available (no tailscale, no device key); canopy posting will be skipped" ); None } @@ -150,84 +78,26 @@ pub async fn run_with_shutdown_and_reload( canopy_client, }); - let scheduler = Arc::new(Scheduler::new( - daemon_config.alert_globs.clone(), - ctx.clone(), - daemon_config.email.clone(), - daemon_config.dry_run, - daemon_config.server_kind, - )); - - // Resolve the persistence file path and seed cold-start state from it. - // On dry-run we skip persistence entirely — the daemon doesn't tick. - let state_file_path = (!daemon_config.dry_run) - .then(state_file::default_state_file_path) - .flatten(); - if let Some(path) = state_file_path.as_ref() { - info!(?path, "alertd state file"); - let persisted = state_file::read(path); - scheduler.set_pending_hydration(persisted).await; - } else if !daemon_config.dry_run { - warn!("could not resolve a state directory; running without persistence"); - } - - scheduler.load_and_schedule_alerts().await?; - - // If dry run, execute all alerts once and quit - if daemon_config.dry_run { - info!("dry run mode: executing all alerts once"); - scheduler.execute_all_alerts_once().await?; - info!("dry run complete"); - return Ok(()); - } - let (event_tx, mut event_rx) = mpsc::channel(100); - let (reload_tx, mut reload_rx) = mpsc::channel::<()>(10); // Start HTTP server if !daemon_config.no_server { - let event_manager_for_server = scheduler.get_event_manager(); let ctx_for_server = ctx.clone(); - let email_for_server = daemon_config.email.clone(); - let dry_run_for_server = daemon_config.dry_run; - let scheduler_for_server = scheduler.clone(); - let reload_tx_for_server = reload_tx.clone(); let background_tasks_for_server = daemon_config.background_tasks.clone(); + let server_addrs = daemon_config.server_addrs.clone(); + let watchdog_timeout = daemon_config.watchdog_timeout; tokio::spawn(async move { - // Wait for event manager to be initialised - let event_mgr = loop { - let guard = event_manager_for_server.read().await; - if let Some(ref mgr) = *guard { - break Some(Arc::new(mgr.clone())); - } - drop(guard); - tokio::time::sleep(std::time::Duration::from_millis(100)).await; - }; - let watchdog_timeout_for_server = daemon_config.watchdog_timeout; http_server::start_server( - reload_tx_for_server, - event_mgr, ctx_for_server, - email_for_server, - dry_run_for_server, - daemon_config.server_addrs.clone(), - scheduler_for_server, - watchdog_timeout_for_server, + server_addrs, + watchdog_timeout, &background_tasks_for_server, ) .await; }); } - // Set up file watcher - let watch_manager = Arc::new(RwLock::new(WatchManager::new(event_tx.clone())?)); - - // Get initial paths to watch - let initial_paths = scheduler.get_resolved_paths().await; - watch_manager.write().await.update_watches(&initial_paths)?; - info!(count = initial_paths.len(), "watching paths for changes"); - - // Setup signal handler + // SIGINT handler let signal_tx = event_tx.clone(); tokio::spawn(async move { match tokio::signal::ctrl_c().await { @@ -260,148 +130,6 @@ pub async fn run_with_shutdown_and_reload( info!("received SIGTERM, shutting down"); let _ = signal_tx_term.send(DaemonEvent::Shutdown).await; }); - - let scheduler_hup = scheduler.clone(); - let watch_manager_hup = watch_manager.clone(); - let ctx_hup = ctx.clone(); - tokio::spawn(async move { - let mut sighup = signal(SignalKind::hangup()).expect("failed to setup SIGHUP handler"); - loop { - sighup.recv().await; - info!("received SIGHUP, reloading configuration"); - metrics::inc_reloads(); - refresh_canopy_client(&ctx_hup).await; - if let Err(err) = scheduler_hup.reload_alerts().await { - error!("failed to reload alerts: {}", LogError(&err)); - } else { - // Update watches after reload - let new_paths = scheduler_hup.get_resolved_paths().await; - if let Err(err) = watch_manager_hup.write().await.update_watches(&new_paths) { - error!("failed to update watches: {}", LogError(&err)); - } - } - } - }); - } - - // Periodically re-resolve globs (every 5 minutes) - let glob_resolve_tx = event_tx.clone(); - tokio::spawn(async move { - let mut interval = tokio::time::interval(Duration::from_secs(5 * 60)); - interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip); - loop { - interval.tick().await; - debug!("triggering periodic glob resolution"); - let _ = glob_resolve_tx.send(DaemonEvent::ResolveGlobs).await; - } - }); - - // Persistence task: wake on state_dirty, debounce, write the file. - if let Some(path) = state_file_path.clone() { - let dirty = scheduler.state_dirty(); - let snap_scheduler = scheduler.clone(); - tokio::spawn(async move { - loop { - dirty.notified().await; - // Coalesce bursts: a single tick that touches several fields, - // or several alerts changing within a window, all collapse - // into one write. - tokio::time::sleep(Duration::from_millis(500)).await; - let snapshot = snap_scheduler.snapshot_for_persistence().await; - match state_file::write(&path, &snapshot) { - Ok(()) => debug!(?path, "wrote alertd state file"), - Err(err) => error!(?path, "failed to write state file: {}", LogError(&err)), - } - } - }); - } - - // Periodic database health check (every 30 seconds) - { - let health_ctx = ctx.clone(); - let health_scheduler = scheduler.clone(); - let health_email = daemon_config.email.clone(); - let health_dry_run = daemon_config.dry_run; - let health_db_url = daemon_config.database_url.clone(); - tokio::spawn(async move { - // Seed from persisted state so a recovery that happens while the - // daemon was down still produces a canopy clear on the next tick. - let mut was_down = health_scheduler.database_was_down().await; - let mut check_interval = tokio::time::interval(Duration::from_secs(30)); - check_interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip); - loop { - check_interval.tick().await; - - let healthy = match health_ctx.pg_pool.get_timeout(Duration::from_secs(5)).await { - Ok(conn) => conn.simple_query("SELECT 1").await.is_ok(), - Err(_) => false, - }; - - if healthy { - if was_down { - info!("database connection restored, clearing database-down event"); - was_down = false; - health_scheduler.set_database_was_down(false).await; - if let Some(ref event_mgr) = - *health_scheduler.get_event_manager().read().await - && let Err(err) = event_mgr - .trigger_clear( - EventType::DatabaseDown, - &health_ctx, - health_dry_run, - None, - ) - .await - { - error!("failed to clear database-down event: {}", LogError(&err)); - } - } - } else if !was_down { - was_down = true; - health_scheduler.set_database_was_down(true).await; - error!("database health check failed, triggering database-down event"); - - // Redact password from URL for the alert context - let redacted_url = match url::Url::parse(&health_db_url) { - Ok(mut parsed) => { - if parsed.password().is_some() { - let _ = parsed.set_password(Some("***")); - } - parsed.to_string() - } - Err(_) => "(unparsable)".to_string(), - }; - - let event_context = EventContext::DatabaseDown { - database_url: redacted_url, - error_message: "health check SELECT 1 failed or timed out".to_string(), - }; - - if let Some(ref event_mgr) = *health_scheduler.get_event_manager().read().await - { - if let Err(err) = event_mgr - .trigger_event( - EventType::DatabaseDown, - &health_ctx, - health_email.as_ref(), - health_dry_run, - event_context, - None, - ) - .await - { - error!("failed to trigger database-down event: {}", LogError(&err)); - } - } else { - warn!( - "event manager not yet initialized, cannot trigger database-down event" - ); - } - } else { - debug!("database still unreachable"); - } - } - }); } // Registered background tasks (e.g. the doctor sweep). Each ticks at its @@ -427,7 +155,7 @@ pub async fn run_with_shutdown_and_reload( }); } - // Watchdog: if no alert task has ticked within the timeout, shut down so the + // Watchdog: if no task has ticked within the timeout, shut down so the // service manager (Windows SCM / systemd / etc.) can restart us. if let Some(watchdog_timeout) = daemon_config.watchdog_timeout { let watchdog_tx = event_tx.clone(); @@ -447,7 +175,7 @@ pub async fn run_with_shutdown_and_reload( error!( ?elapsed, ?watchdog_timeout, - "watchdog: no alert activity detected within timeout, shutting down" + "watchdog: no task activity detected within timeout, shutting down" ); let _ = watchdog_tx.send(DaemonEvent::WatchdogTimeout).await; break; @@ -456,97 +184,19 @@ pub async fn run_with_shutdown_and_reload( }); } - // Listen for external reload signals (e.g., from Windows TIME_CHANGE event) - if let Some(mut external_reload_rx) = external_reload { - let reload_tx = reload_tx.clone(); - tokio::spawn(async move { - while (external_reload_rx.recv().await).is_some() { - info!("received external reload signal"); - let _ = reload_tx.send(()).await; - } - }); - } - - let mut reload_debounce = tokio::time::interval(Duration::from_secs(2)); - reload_debounce.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip); - let mut needs_reload = false; - info!("daemon started successfully"); - loop { - tokio::select! { - Some(event) = event_rx.recv() => { - match event { - DaemonEvent::FileChanged => { - needs_reload = true; - } - DaemonEvent::ResolveGlobs => { - debug!("re-resolving glob patterns"); - if let Err(err) = scheduler.check_and_reload_if_paths_changed().await { - error!("failed to check and reload: {}", LogError(&err)); - } else { - // Update watches with new paths - let new_paths = scheduler.get_resolved_paths().await; - if let Err(err) = watch_manager.write().await.update_watches(&new_paths) { - error!("failed to update watches: {}", LogError(&err)); - } - } - } - DaemonEvent::Shutdown => { - scheduler.shutdown().await; - info!("daemon stopped"); - break; - } - DaemonEvent::WatchdogTimeout => { - scheduler.shutdown().await; - error!("daemon exiting due to watchdog timeout"); - return Err(miette!("watchdog timeout: no alert activity detected")); - } - } - } - Some(()) = reload_rx.recv() => { - info!("reloading alerts via HTTP"); - metrics::inc_reloads(); - refresh_canopy_client(&ctx).await; - if let Err(err) = scheduler.reload_alerts().await { - error!("failed to reload alerts: {}", LogError(&err)); - } else { - // Update watches after reload - let new_paths = scheduler.get_resolved_paths().await; - if let Err(err) = watch_manager.write().await.update_watches(&new_paths) { - error!("failed to update watches: {}", LogError(&err)); - } - } - } - _ = reload_debounce.tick() => { - if needs_reload { - needs_reload = false; - info!("reloading alerts due to file system changes"); - metrics::inc_reloads(); - refresh_canopy_client(&ctx).await; - if let Err(err) = scheduler.reload_alerts().await { - error!("failed to reload alerts: {}", LogError(&err)); - } else { - // Update watches after reload - let new_paths = scheduler.get_resolved_paths().await; - if let Err(err) = watch_manager.write().await.update_watches(&new_paths) { - error!("failed to update watches: {}", LogError(&err)); - } - } - } - } + // Block until the first lifecycle event arrives: a shutdown signal, or the + // watchdog firing. `None` means every sender was dropped, which we treat as + // a shutdown too. + match event_rx.recv().await { + Some(DaemonEvent::Shutdown) | None => { + info!("daemon stopped"); + Ok(()) + } + Some(DaemonEvent::WatchdogTimeout) => { + error!("daemon exiting due to watchdog timeout"); + Err(miette!("watchdog timeout: no task activity detected")) } - } - - Ok(()) -} - -/// Re-probe canopy auth on reload; logs failures but never blocks the reload. -async fn refresh_canopy_client(ctx: &InternalContext) { - let Some(client) = ctx.canopy_client.as_ref() else { - return; - }; - if let Err(err) = client.refresh().await { - error!("canopy client refresh failed: {}", LogError(&err)); } } diff --git a/crates/alertd/src/events.rs b/crates/alertd/src/events.rs deleted file mode 100644 index 26e79f9d..00000000 --- a/crates/alertd/src/events.rs +++ /dev/null @@ -1,471 +0,0 @@ -use std::collections::HashMap; - -use miette::Result; -use tera::Context as TeraCtx; -use tracing::{debug, error, info, warn}; - -use crate::{ - LogError, - alert::{AlertDefinition, InternalContext}, - targets::{ExternalTarget, ResolvedTarget, determine_default_target}, -}; - -/// Build the synthetic alert file used by event triggers and clears. -/// -/// Embedding the entity key in the file stem keeps the canopy ref stable -/// across trigger and clear for the same logical entity (e.g. the same -/// failing alert), and distinct across different entities. -fn synthetic_alert_file(event_type: &EventType, entity_key: Option<&str>) -> std::path::PathBuf { - match entity_key { - Some(key) => format!("[internal:{}:{}]", event_type.as_str(), key).into(), - None => format!("[internal:{}]", event_type.as_str()).into(), - } -} - -fn synthetic_alert(event_type: &EventType, entity_key: Option<&str>) -> AlertDefinition { - AlertDefinition { - file: synthetic_alert_file(event_type, entity_key), - enabled: true, - interval: "0 seconds".to_string(), - interval_duration: std::time::Duration::from_secs(0), - always_send: crate::alert::AlwaysSend::Boolean(false), - when_changed: crate::alert::WhenChanged::default(), - send: Vec::new(), - // Synthesised internal-event alerts (e.g. source-error, definition-error) - // aren't tied to a specific server kind — they run wherever the daemon - // does. `None` means "any" (no filter). - server_kind: None, - source: crate::alert::TicketSource::Event { - event: event_type.clone(), - }, - } -} - -/// Internal event types that can trigger alerts -#[derive(Debug, Clone, PartialEq, Eq, Hash, serde::Deserialize, serde::Serialize)] -#[serde(rename_all = "kebab-case")] -pub enum EventType { - SourceError, - DefinitionError, - DatabaseDown, -} - -impl EventType { - pub fn as_str(&self) -> &'static str { - match self { - Self::SourceError => "source-error", - Self::DefinitionError => "definition-error", - Self::DatabaseDown => "database-down", - } - } -} - -/// Context data for an event -#[derive(Debug, Clone)] -pub enum EventContext { - SourceError { - alert_file: String, - error_message: String, - }, - DefinitionError { - alert_file: String, - error_message: String, - }, - DatabaseDown { - database_url: String, - error_message: String, - }, -} - -impl EventContext { - pub fn to_tera_context(&self) -> TeraCtx { - let mut ctx = TeraCtx::new(); - match self { - Self::SourceError { - alert_file, - error_message, - } => { - ctx.insert("alert_file", alert_file); - ctx.insert("error_message", error_message); - } - Self::DefinitionError { - alert_file, - error_message, - } => { - ctx.insert("alert_file", alert_file); - ctx.insert("error_message", error_message); - } - Self::DatabaseDown { - database_url, - error_message, - } => { - ctx.insert("database_url", database_url); - ctx.insert("error_message", error_message); - } - } - ctx - } -} - -/// Manages event-triggered alerts -#[derive(Clone)] -pub struct EventManager { - /// Alerts that listen for specific events - event_alerts: HashMap)>>, - /// Default target for fallback alerts - default_target: Option, -} - -impl EventManager { - pub fn new( - alerts: Vec<(AlertDefinition, Vec)>, - external_targets: &HashMap>, - ) -> Self { - let mut event_alerts: HashMap)>> = - HashMap::new(); - - debug!(total_alerts = alerts.len(), "initializing event manager"); - - // Separate event-based alerts from regular alerts - for (alert, targets) in alerts { - if let crate::alert::TicketSource::Event { event } = &alert.source { - debug!( - file = ?alert.file, - event = event.as_str(), - targets = targets.len(), - "registered event alert" - ); - event_alerts - .entry(event.clone()) - .or_default() - .push((alert, targets)); - } - } - - info!( - event_types = ?event_alerts.keys().collect::>(), - total_event_alerts = event_alerts.values().map(|v| v.len()).sum::(), - "event manager initialized" - ); - - let default_target = determine_default_target(external_targets).map(|t| ResolvedTarget { - target_id: t.id.clone(), - subject: None, - template: String::new(), - conn: t.conn.clone(), - }); - if let Some(ref target) = default_target { - let target_desc = match &target.conn { - crate::targets::TargetConnection::Email(email) => email - .addresses - .first() - .cloned() - .unwrap_or_else(|| "unknown".into()), - crate::targets::TargetConnection::Slack(slack) => { - format!("slack:{}", slack.webhook.host_str().unwrap_or("unknown")) - } - crate::targets::TargetConnection::Canopy(_) => "canopy".to_string(), - }; - info!( - from = %target_desc, - "determined default target for fallback alerts" - ); - } - - Self { - event_alerts, - default_target, - } - } - - /// Trigger an event with the given context. - /// - /// `entity_key` identifies the specific subject of the event (e.g. the - /// erroring alert's file path). It's embedded in the synthetic alert file - /// so canopy refs are stable across trigger and clear for the same - /// subject, and distinct across different subjects. Pass `None` for - /// events that have only one possible subject per host (e.g. the database). - pub async fn trigger_event( - &self, - event_type: EventType, - ctx: &InternalContext, - email: Option<&crate::EmailConfig>, - dry_run: bool, - event_context: EventContext, - entity_key: Option<&str>, - ) -> Result<()> { - info!( - event = event_type.as_str(), - entity_key, - has_alerts = self.event_alerts.contains_key(&event_type), - has_default_target = self.default_target.is_some(), - "triggering event" - ); - - // Check if there are explicit alerts for this event - if let Some(alerts) = self.event_alerts.get(&event_type) { - info!(count = alerts.len(), "executing event alerts"); - for (alert, targets) in alerts { - let mut tera_ctx = crate::templates::build_context(alert, jiff::Timestamp::now()); - // Merge event context - tera_ctx.extend(event_context.to_tera_context()); - - for target in targets { - if let Err(err) = target.send(alert, &mut tera_ctx, email, ctx, dry_run).await { - error!(file = ?alert.file, "failed to send event alert: {}", LogError(&err)); - } - } - } - } else if let Some(ref default_target) = self.default_target { - // No explicit alert, use default target with event-specific template - info!( - event = event_type.as_str(), - "using default target for event (no explicit alert configured)" - ); - - let (subject_template, body_template) = default_event_template(&event_type); - - let default_target_for_event = ResolvedTarget { - target_id: default_target.target_id.clone(), - subject: Some(subject_template), - template: body_template, - conn: default_target.conn.clone(), - }; - - let alert = synthetic_alert(&event_type, entity_key); - - let mut tera_ctx = crate::templates::build_context(&alert, jiff::Timestamp::now()); - tera_ctx.extend(event_context.to_tera_context()); - - if let Err(err) = default_target_for_event - .send(&alert, &mut tera_ctx, email, ctx, dry_run) - .await - { - error!("failed to send default event alert: {}", LogError(&err)); - } - } else { - warn!( - event = event_type.as_str(), - "no alerts or default target for event, skipping notification" - ); - } - - Ok(()) - } - - /// Send a clearing notification for an event. - /// - /// Mirrors `trigger_event`, but calls `send_clear` on each target so - /// stateful sinks (canopy) flip the corresponding issue to `active=false`. - /// Non-stateful sinks (email, slack) return immediately — that's the same - /// behaviour as SQL-alert clears. - /// - /// `entity_key` must match the value passed to the original - /// `trigger_event` so the canopy ref lines up. - pub async fn trigger_clear( - &self, - event_type: EventType, - ctx: &InternalContext, - dry_run: bool, - entity_key: Option<&str>, - ) -> Result<()> { - info!( - event = event_type.as_str(), - entity_key, - has_alerts = self.event_alerts.contains_key(&event_type), - has_default_target = self.default_target.is_some(), - "clearing event" - ); - - if let Some(alerts) = self.event_alerts.get(&event_type) { - info!(count = alerts.len(), "clearing event alerts"); - for (alert, targets) in alerts { - for target in targets { - if let Err(err) = target.send_clear(alert, ctx, dry_run).await { - error!(file = ?alert.file, "failed to clear event alert: {}", LogError(&err)); - } - } - } - } else if let Some(ref default_target) = self.default_target { - info!( - event = event_type.as_str(), - "clearing default target for event" - ); - - let default_target_for_event = ResolvedTarget { - target_id: default_target.target_id.clone(), - subject: None, - template: String::new(), - conn: default_target.conn.clone(), - }; - - let alert = synthetic_alert(&event_type, entity_key); - - if let Err(err) = default_target_for_event - .send_clear(&alert, ctx, dry_run) - .await - { - error!("failed to clear default event alert: {}", LogError(&err)); - } - } else { - debug!( - event = event_type.as_str(), - "no alerts or default target for event, skipping clear" - ); - } - - Ok(()) - } -} - -fn default_event_template(event_type: &EventType) -> (String, String) { - match event_type { - EventType::SourceError => ( - "[bestool-alertd] {{ hostname }}: Failed alert: {{ alert_file }}".to_string(), - "
{{ error_message }}
".to_string(), - ), - EventType::DefinitionError => ( - "[bestool-alertd] {{ hostname }}: Invalid alert definition: {{ alert_file }}" - .to_string(), - "
{{ error_message }}
".to_string(), - ), - EventType::DatabaseDown => ( - "[bestool-alertd] {{ hostname }}: Database unreachable".to_string(), - "The PostgreSQL database is unreachable.\n\n\ - Database URL: {{ database_url }}\n\ - Error:
{{ error_message }}
\n\n\ - All SQL-based alerts are non-functional until the database is restored." - .to_string(), - ), - } -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn test_event_type_parsing() { - let yaml = "source-error"; - let event: EventType = serde_yaml::from_str(yaml).unwrap(); - assert_eq!(event, EventType::SourceError); - } - - #[test] - fn test_event_type_as_str() { - assert_eq!(EventType::SourceError.as_str(), "source-error"); - assert_eq!(EventType::DefinitionError.as_str(), "definition-error"); - assert_eq!(EventType::DatabaseDown.as_str(), "database-down"); - } - - #[test] - fn test_event_type_serialization() { - let event = EventType::SourceError; - let yaml = serde_yaml::to_string(&event).unwrap(); - assert!(yaml.contains("source-error")); - } - - #[test] - fn test_event_context_to_tera_source_error() { - let ctx = EventContext::SourceError { - alert_file: "/etc/alerts/test.yml".to_string(), - error_message: "Something went wrong".to_string(), - }; - - let tera_ctx = ctx.to_tera_context(); - assert_eq!( - tera_ctx.get("alert_file").unwrap().as_str().unwrap(), - "/etc/alerts/test.yml" - ); - assert_eq!( - tera_ctx.get("error_message").unwrap().as_str().unwrap(), - "Something went wrong" - ); - } - - #[test] - fn test_event_type_database_down() { - let yaml = "database-down"; - let event: EventType = serde_yaml::from_str(yaml).unwrap(); - assert_eq!(event, EventType::DatabaseDown); - } - - #[test] - fn test_event_context_to_tera_database_down() { - let ctx = EventContext::DatabaseDown { - database_url: "postgresql://localhost/mydb".to_string(), - error_message: "connection refused".to_string(), - }; - - let tera_ctx = ctx.to_tera_context(); - assert_eq!( - tera_ctx.get("database_url").unwrap().as_str().unwrap(), - "postgresql://localhost/mydb" - ); - assert_eq!( - tera_ctx.get("error_message").unwrap().as_str().unwrap(), - "connection refused" - ); - } - - #[test] - fn test_event_type_definition_error() { - let yaml = "definition-error"; - let event: EventType = serde_yaml::from_str(yaml).unwrap(); - assert_eq!(event, EventType::DefinitionError); - } - - #[test] - fn test_event_context_to_tera_definition_error() { - let ctx = EventContext::DefinitionError { - alert_file: "/etc/alerts/broken.yml".to_string(), - error_message: "Invalid YAML syntax".to_string(), - }; - - let tera_ctx = ctx.to_tera_context(); - assert_eq!( - tera_ctx.get("alert_file").unwrap().as_str().unwrap(), - "/etc/alerts/broken.yml" - ); - assert_eq!( - tera_ctx.get("error_message").unwrap().as_str().unwrap(), - "Invalid YAML syntax" - ); - } - - #[test] - fn synthetic_alert_file_without_entity_key() { - let path = synthetic_alert_file(&EventType::DatabaseDown, None); - assert_eq!(path.to_string_lossy(), "[internal:database-down]"); - } - - #[test] - fn synthetic_alert_file_with_entity_key() { - let path = synthetic_alert_file(&EventType::SourceError, Some("disk-full.yml")); - assert_eq!( - path.to_string_lossy(), - "[internal:source-error:disk-full.yml]" - ); - } - - #[test] - fn synthetic_alert_file_distinguishes_entities() { - let a = synthetic_alert_file(&EventType::DefinitionError, Some("a.yml")); - let b = synthetic_alert_file(&EventType::DefinitionError, Some("b.yml")); - assert_ne!(a, b); - } - - #[test] - fn synthetic_alert_file_stem_includes_entity_for_canopy_ref() { - // build_ref uses file_stem() to form the canopy ref. Verify that - // trigger and clear with the same entity key produce identical stems. - let trigger = synthetic_alert_file(&EventType::SourceError, Some("my-alert")); - let clear = synthetic_alert_file(&EventType::SourceError, Some("my-alert")); - assert_eq!(trigger.file_stem(), clear.file_stem()); - - // And distinct entities produce distinct stems so canopy treats them - // as separate issues. - let other = synthetic_alert_file(&EventType::SourceError, Some("other-alert")); - assert_ne!(trigger.file_stem(), other.file_stem()); - } -} diff --git a/crates/alertd/src/glob_resolver.rs b/crates/alertd/src/glob_resolver.rs deleted file mode 100644 index 56908093..00000000 --- a/crates/alertd/src/glob_resolver.rs +++ /dev/null @@ -1,85 +0,0 @@ -use std::{ - collections::HashSet, - path::{Path, PathBuf}, -}; - -use glob::glob; -use miette::{IntoDiagnostic, Result}; -use tracing::{debug, warn}; - -/// Resolves glob patterns to concrete paths (directories and files) -#[derive(Debug, Clone)] -pub struct GlobResolver { - patterns: Vec, -} - -impl GlobResolver { - pub fn new(patterns: Vec) -> Self { - Self { patterns } - } - - /// Resolve all glob patterns to concrete paths that exist - /// - /// Returns directories and files separately for different handling - pub fn resolve(&self) -> Result { - let mut dirs = HashSet::new(); - let mut files = HashSet::new(); - - for pattern in &self.patterns { - debug!(?pattern, "resolving glob pattern"); - - let entries = glob(pattern).into_diagnostic()?; - - for entry in entries { - match entry { - Ok(path) => { - if path.is_dir() { - debug!(?path, "resolved to directory"); - dirs.insert(path); - } else if path.is_file() { - debug!(?path, "resolved to file"); - files.insert(path); - } else { - debug!(?path, "skipping non-file, non-directory"); - } - } - Err(e) => { - warn!("glob error for pattern {}: {}", pattern, e); - } - } - } - } - - Ok(ResolvedPaths { - dirs: dirs.into_iter().collect(), - files: files.into_iter().collect(), - }) - } -} - -/// Paths resolved from glob patterns -#[derive(Debug, Clone)] -pub struct ResolvedPaths { - /// Directories that match the patterns - pub dirs: Vec, - /// Individual files that match the patterns - pub files: Vec, -} - -impl ResolvedPaths { - /// Get all unique paths (both dirs and files) - pub fn all_paths(&self) -> Vec<&Path> { - self.dirs - .iter() - .map(|p| p.as_path()) - .chain(self.files.iter().map(|p| p.as_path())) - .collect() - } - - /// Check if this set of paths differs from another - pub fn differs_from(&self, other: &ResolvedPaths) -> bool { - let self_set: HashSet<_> = self.all_paths().into_iter().collect(); - let other_set: HashSet<_> = other.all_paths().into_iter().collect(); - self_set != other_set - } -} diff --git a/crates/alertd/src/http_server.rs b/crates/alertd/src/http_server.rs index 4170c87a..41099ec8 100644 --- a/crates/alertd/src/http_server.rs +++ b/crates/alertd/src/http_server.rs @@ -2,20 +2,13 @@ use std::{collections::HashMap, sync::Arc, time::Duration}; -use axum::{ - Router, - routing::{get, post}, -}; +use axum::{Router, routing::get}; use jiff::Timestamp; -use tokio::sync::mpsc; use tower_http::trace::{DefaultMakeSpan, DefaultOnResponse, TraceLayer}; use tracing::{Level, error, info, warn}; use crate::{ - EmailConfig, - alert::InternalContext, - events::EventManager, - scheduler::Scheduler, + context::InternalContext, tasks::{BackgroundTask, TaskEndpointHandler}, }; @@ -29,18 +22,9 @@ pub use endpoints::*; pub use state::ServerState; pub use types::*; -#[expect( - clippy::too_many_arguments, - reason = "server startup needs all these pieces" -)] pub async fn start_server( - reload_tx: mpsc::Sender<()>, - event_manager: Option>, internal_context: Arc, - email_config: Option, - dry_run: bool, addrs: Vec, - scheduler: Arc, watchdog_timeout: Option, background_tasks: &[Arc], ) { @@ -50,24 +34,15 @@ pub async fn start_server( let task_endpoints = collect_task_endpoints(background_tasks); let state = ServerState { - reload_tx, started_at, pid, - event_manager, internal_context, - email_config, - dry_run, - scheduler, watchdog_timeout, task_endpoints: Arc::new(task_endpoints), }; let app = Router::new() .route("/", get(handle_index)) - .route("/reload", post(handle_reload)) - .route("/alerts", get(handle_alerts).delete(handle_pause_alert)) - .route("/targets", get(handle_targets)) - .route("/validate", post(handle_validate)) .route("/metrics", get(handle_metrics)) .route("/status", get(handle_status)) .route("/health", get(handle_health)) diff --git a/crates/alertd/src/http_server/endpoints.rs b/crates/alertd/src/http_server/endpoints.rs index cc7e7315..fe87c73e 100644 --- a/crates/alertd/src/http_server/endpoints.rs +++ b/crates/alertd/src/http_server/endpoints.rs @@ -1,21 +1,11 @@ -mod alerts; mod health; mod index; mod metrics; -mod pause_alert; -mod reload; mod status; -mod targets; mod tasks; -mod validate; -pub use alerts::handle_alerts; pub use health::handle_health; pub use index::handle_index; pub use metrics::handle_metrics; -pub use pause_alert::handle_pause_alert; -pub use reload::handle_reload; pub use status::handle_status; -pub use targets::handle_targets; pub use tasks::handle_task_endpoint; -pub use validate::handle_validate; diff --git a/crates/alertd/src/http_server/endpoints/alert.rs b/crates/alertd/src/http_server/endpoints/alert.rs deleted file mode 100644 index 9fa33701..00000000 --- a/crates/alertd/src/http_server/endpoints/alert.rs +++ /dev/null @@ -1,132 +0,0 @@ -use std::sync::Arc; - -use axum::{Json, extract::State, http::StatusCode, response::IntoResponse}; -use tracing::{error, info}; - -use crate::{ - events::{EventContext, EventType}, - http_server::{state::ServerState, types::AlertRequest}, -}; - -pub async fn handle_alert( - State(state): State>, - Json(payload): Json, -) -> impl IntoResponse { - info!(message = %payload.message, "received HTTP alert"); - - let event_context = EventContext::Http { - message: payload.message, - subject: payload.subject, - custom: payload.custom, - }; - - if let Some(ref event_mgr) = state.event_manager { - match event_mgr - .trigger_event( - EventType::Http, - &state.internal_context, - state.email_config.as_ref(), - state.dry_run, - event_context, - ) - .await - { - Ok(()) => { - info!("HTTP alert triggered successfully"); - (StatusCode::OK, "Alert triggered\n") - } - Err(e) => { - error!("failed to trigger HTTP alert: {e:?}"); - ( - StatusCode::INTERNAL_SERVER_ERROR, - "Failed to trigger alert\n", - ) - } - } - } else { - error!("no event manager available"); - ( - StatusCode::SERVICE_UNAVAILABLE, - "Event manager not available\n", - ) - } -} - -#[cfg(test)] -mod tests { - use std::sync::Arc; - - use axum::{extract::State, http::StatusCode, response::IntoResponse}; - use jiff::Timestamp; - use tokio::sync::mpsc; - - use super::*; - use crate::{ - alert::InternalContext, events::EventManager, http_server::test_utils::create_test_state, - scheduler::Scheduler, - }; - - #[tokio::test] - async fn test_alert_endpoint_no_event_manager() { - let state = create_test_state().await; - - let payload = AlertRequest { - message: "Test alert".to_string(), - subject: Some("Test subject".to_string()), - custom: serde_json::json!({"key": "value"}), - }; - - let response = handle_alert(State(state), axum::Json(payload)) - .await - .into_response(); - - assert_eq!(response.status(), StatusCode::SERVICE_UNAVAILABLE); - } - - #[tokio::test] - async fn test_alert_endpoint_with_event_manager() { - let db_url = std::env::var("DATABASE_URL").expect("DATABASE_URL must be set for tests"); - let pool = bestool_postgres::pool::create_pool(&db_url, "bestool-alertd-test") - .await - .unwrap(); - let ctx = Arc::new(InternalContext { - pg_pool: pool, - http_client: reqwest::Client::new(), - canopy_client: None, - }); - let scheduler = Arc::new(Scheduler::new(vec![], ctx.clone(), None, true, None)); - - let (reload_tx, _reload_rx) = mpsc::channel::<()>(10); - - let event_manager = EventManager::new(vec![], &std::collections::HashMap::new()); - - let state = Arc::new(crate::http_server::state::ServerState { - reload_tx, - started_at: Timestamp::now(), - pid: std::process::id(), - event_manager: Some(Arc::new(event_manager)), - internal_context: ctx, - email_config: None, - dry_run: true, - scheduler, - watchdog_timeout: None, - }); - - let payload = AlertRequest { - message: "Test alert".to_string(), - subject: Some("Test subject".to_string()), - custom: serde_json::json!({"key": "value"}), - }; - - let response = handle_alert(State(state), axum::Json(payload)) - .await - .into_response(); - - assert_eq!(response.status(), StatusCode::OK); - let body = axum::body::to_bytes(response.into_body(), usize::MAX) - .await - .unwrap(); - let body_str = String::from_utf8(body.to_vec()).unwrap(); - assert_eq!(body_str, "Alert triggered\n"); - } -} diff --git a/crates/alertd/src/http_server/endpoints/alerts.rs b/crates/alertd/src/http_server/endpoints/alerts.rs deleted file mode 100644 index 4d3d9287..00000000 --- a/crates/alertd/src/http_server/endpoints/alerts.rs +++ /dev/null @@ -1,95 +0,0 @@ -use std::sync::Arc; - -use axum::{ - Json, - extract::{Query, State}, - response::IntoResponse, -}; - -use crate::http_server::{ - state::ServerState, - types::{AlertStateInfo, AlertsQuery}, -}; - -pub async fn handle_alerts( - State(state): State>, - Query(query): Query, -) -> impl IntoResponse { - if query.detail { - let states = state.scheduler.get_alert_states().await; - let mut alert_states: Vec = states - .iter() - .map(|(path, state)| { - let always_send = match &state.definition.always_send { - crate::alert::AlwaysSend::Boolean(true) => "true".to_string(), - crate::alert::AlwaysSend::Boolean(false) => "false".to_string(), - crate::alert::AlwaysSend::Timed(config) => { - format!("after: {}", config.after) - } - }; - - AlertStateInfo { - path: path.display().to_string(), - enabled: state.definition.enabled, - interval: state.definition.interval.clone(), - triggered_at: state.triggered_at.map(|t| t.to_string()), - last_sent_at: state.last_sent_at.map(|t| t.to_string()), - paused_until: state.paused_until.map(|t| t.to_string()), - always_send, - } - }) - .collect(); - alert_states.sort_by(|a, b| a.path.cmp(&b.path)); - Json(alert_states).into_response() - } else { - let files = state.scheduler.get_loaded_alerts().await; - let alerts: Vec = files.iter().map(|p| p.display().to_string()).collect(); - Json(alerts).into_response() - } -} - -#[cfg(test)] -mod tests { - use axum::{ - extract::{Query, State}, - http::StatusCode, - response::IntoResponse, - }; - - use super::*; - use crate::http_server::test_utils::create_test_state; - - #[tokio::test] - async fn test_alerts_endpoint() { - let state = create_test_state().await; - - let query = Query(AlertsQuery { detail: false }); - let response = handle_alerts(State(state), query).await.into_response(); - - assert_eq!(response.status(), StatusCode::OK); - let body = axum::body::to_bytes(response.into_body(), usize::MAX) - .await - .unwrap(); - let alerts: Vec = serde_json::from_slice(&body).unwrap(); - - // Should be empty for test state - assert!(alerts.is_empty()); - } - - #[tokio::test] - async fn test_alerts_endpoint_with_detail() { - let state = create_test_state().await; - - let query = Query(AlertsQuery { detail: true }); - let response = handle_alerts(State(state), query).await.into_response(); - - assert_eq!(response.status(), StatusCode::OK); - let body = axum::body::to_bytes(response.into_body(), usize::MAX) - .await - .unwrap(); - let alert_states: Vec = serde_json::from_slice(&body).unwrap(); - - // Should be empty for test state - assert!(alert_states.is_empty()); - } -} diff --git a/crates/alertd/src/http_server/endpoints/index.rs b/crates/alertd/src/http_server/endpoints/index.rs index 8af04762..7e09cc87 100644 --- a/crates/alertd/src/http_server/endpoints/index.rs +++ b/crates/alertd/src/http_server/endpoints/index.rs @@ -7,31 +7,6 @@ pub async fn handle_index() -> impl IntoResponse { "path": "/", "description": "List of available endpoints" }, - { - "method": "POST", - "path": "/reload", - "description": "Trigger a configuration reload (equivalent to SIGHUP)" - }, - { - "method": "GET", - "path": "/alerts", - "description": "List currently loaded alert files" - }, - { - "method": "DELETE", - "path": "/alerts", - "description": "Temporarily pause an alert until the specified timestamp (JSON body: {\"alert\": \"PATH\", \"until\": \"TIMESTAMP\"})" - }, - { - "method": "GET", - "path": "/targets", - "description": "List all currently loaded external targets" - }, - { - "method": "POST", - "path": "/validate", - "description": "Validate an alert definition (send YAML as request body, returns validation result as JSON)" - }, { "method": "GET", "path": "/metrics", @@ -46,6 +21,11 @@ pub async fn handle_index() -> impl IntoResponse { "method": "GET", "path": "/health", "description": "Health check endpoint (returns 200 if healthy, 530 if stalled)" + }, + { + "method": "GET", + "path": "/tasks/{task}/{endpoint}", + "description": "Invoke an endpoint exposed by a registered background task" } ]); diff --git a/crates/alertd/src/http_server/endpoints/pause_alert.rs b/crates/alertd/src/http_server/endpoints/pause_alert.rs deleted file mode 100644 index 83366121..00000000 --- a/crates/alertd/src/http_server/endpoints/pause_alert.rs +++ /dev/null @@ -1,41 +0,0 @@ -use std::{path::PathBuf, sync::Arc}; - -use axum::{Json, extract::State, http::StatusCode, response::IntoResponse}; -use tracing::{error, info}; - -use crate::http_server::{state::ServerState, types::PauseAlertRequest}; - -pub async fn handle_pause_alert( - State(state): State>, - Json(payload): Json, -) -> impl IntoResponse { - info!(alert = %payload.alert, until = %payload.until, "pausing alert"); - - let until = match payload.until.parse::() { - Ok(ts) => ts, - Err(e) => { - error!("failed to parse timestamp: {e:?}"); - return ( - StatusCode::BAD_REQUEST, - format!("Invalid timestamp: {}\n", e), - ) - .into_response(); - } - }; - - let path = PathBuf::from(&payload.alert); - match state.scheduler.pause_alert(&path, until).await { - Ok(true) => { - info!("alert paused successfully"); - (StatusCode::OK, "Alert paused\n").into_response() - } - Ok(false) => { - info!("alert not found"); - (StatusCode::NOT_FOUND, "Alert not found\n").into_response() - } - Err(e) => { - error!("failed to pause alert: {e:?}"); - (StatusCode::INTERNAL_SERVER_ERROR, "Failed to pause alert\n").into_response() - } - } -} diff --git a/crates/alertd/src/http_server/endpoints/reload.rs b/crates/alertd/src/http_server/endpoints/reload.rs deleted file mode 100644 index d111a674..00000000 --- a/crates/alertd/src/http_server/endpoints/reload.rs +++ /dev/null @@ -1,70 +0,0 @@ -use std::sync::Arc; - -use axum::{extract::State, http::StatusCode, response::IntoResponse}; -use tracing::{error, info}; - -use crate::http_server::state::ServerState; - -pub async fn handle_reload(State(state): State>) -> impl IntoResponse { - match state.reload_tx.send(()).await { - Ok(()) => { - info!("reload triggered via HTTP"); - (StatusCode::OK, "Reload triggered\n") - } - Err(_) => { - error!("failed to send reload signal"); - ( - StatusCode::INTERNAL_SERVER_ERROR, - "Failed to trigger reload\n", - ) - } - } -} - -#[cfg(test)] -mod tests { - use std::sync::Arc; - - use axum::{extract::State, http::StatusCode, response::IntoResponse}; - use jiff::Timestamp; - use tokio::sync::mpsc; - - use super::*; - use crate::{alert::InternalContext, http_server::state::ServerState, scheduler::Scheduler}; - - #[tokio::test] - async fn test_reload_endpoint() { - let db_url = std::env::var("DATABASE_URL").expect("DATABASE_URL must be set for tests"); - let pool = bestool_postgres::pool::create_pool(&db_url, "bestool-alertd-test") - .await - .unwrap(); - let ctx = Arc::new(InternalContext { - pg_pool: pool, - http_client: reqwest::Client::new(), - canopy_client: None, - }); - let scheduler = Arc::new(Scheduler::new(vec![], ctx.clone(), None, true, None)); - - let (reload_tx, mut reload_rx) = mpsc::channel::<()>(10); - - let state = Arc::new(ServerState { - reload_tx, - started_at: Timestamp::now(), - pid: std::process::id(), - event_manager: None, - internal_context: ctx, - email_config: None, - dry_run: true, - scheduler, - watchdog_timeout: None, - task_endpoints: Arc::new(std::collections::HashMap::new()), - }); - - let response = handle_reload(State(state)).await.into_response(); - - assert_eq!(response.status(), StatusCode::OK); - - // Verify the reload signal was sent - assert!(reload_rx.try_recv().is_ok()); - } -} diff --git a/crates/alertd/src/http_server/endpoints/targets.rs b/crates/alertd/src/http_server/endpoints/targets.rs deleted file mode 100644 index 9a13a3e3..00000000 --- a/crates/alertd/src/http_server/endpoints/targets.rs +++ /dev/null @@ -1,27 +0,0 @@ -use std::sync::Arc; - -use axum::{Json, extract::State, response::IntoResponse}; - -use crate::http_server::state::ServerState; - -pub async fn handle_targets(State(state): State>) -> impl IntoResponse { - let targets = state.scheduler.get_external_targets().await; - Json(targets) -} - -#[cfg(test)] -mod tests { - use axum::{extract::State, http::StatusCode, response::IntoResponse}; - - use super::*; - use crate::http_server::test_utils::create_test_state; - - #[tokio::test] - async fn test_targets_endpoint() { - let state = create_test_state().await; - - let response = handle_targets(State(state)).await.into_response(); - - assert_eq!(response.status(), StatusCode::OK); - } -} diff --git a/crates/alertd/src/http_server/endpoints/validate.rs b/crates/alertd/src/http_server/endpoints/validate.rs deleted file mode 100644 index 9bc0369f..00000000 --- a/crates/alertd/src/http_server/endpoints/validate.rs +++ /dev/null @@ -1,262 +0,0 @@ -use axum::{Json, http::StatusCode, response::IntoResponse}; - -use crate::{ - alert::{AlertDefinition, TicketSource}, - http_server::types::{ValidationInfo, ValidationResponse}, -}; - -pub async fn handle_validate(body: String) -> impl IntoResponse { - // Try to parse as YAML with serde_path_to_error for better error messages - let deserializer = serde_yaml::Deserializer::from_str(&body); - let alert: AlertDefinition = match serde_path_to_error::deserialize(deserializer) { - Ok(alert) => alert, - Err(err) => { - // Parse error - return detailed error information - let path = err.path().to_string(); - let inner = err.into_inner(); - let error_msg = format!("{}", inner); - - // The inner error is already a serde_yaml::Error, extract location if available - // Note: serde_yaml::Error doesn't expose location() in all cases - let response = ValidationResponse { - valid: false, - error: Some(format!("Parse error at '{}': {}", path, error_msg)), - error_location: None, // Location info is included in the error message - info: None, - }; - - return (StatusCode::OK, Json(response)).into_response(); - } - }; - - // Validate templates BEFORE normalizing (normalization clears send targets) - if let Err(err) = validate_templates(&alert) { - let response = ValidationResponse { - valid: false, - error: Some(format!("Template validation error: {:#}", err)), - error_location: None, - info: None, - }; - - return (StatusCode::OK, Json(response)).into_response(); - } - - // Try to normalize the alert (this validates send targets and other fields) - let external_targets = std::collections::HashMap::new(); - match alert.normalise(&external_targets) { - Ok((alert, resolved_targets)) => { - let source_type = match &alert.source { - TicketSource::Sql { .. } => "sql", - TicketSource::Shell { .. } => "shell", - TicketSource::Event { .. } => "event", - TicketSource::None => "none", - } - .to_string(); - - let response = ValidationResponse { - valid: true, - error: None, - error_location: None, - info: Some(ValidationInfo { - enabled: alert.enabled, - interval: alert.interval.clone(), - source_type, - targets: resolved_targets.len(), - }), - }; - - (StatusCode::OK, Json(response)).into_response() - } - Err(err) => { - // Normalization error (e.g., invalid interval, missing targets) - let response = ValidationResponse { - valid: false, - error: Some(format!("Validation error: {:#}", err)), - error_location: None, - info: None, - }; - - (StatusCode::OK, Json(response)).into_response() - } - } -} - -fn validate_templates(alert: &AlertDefinition) -> miette::Result<()> { - use crate::templates; - use miette::Context as _; - - // Validate each send target's templates by compiling them - // We only compile, not render, because we don't know the actual data structure - // that will be available at runtime (e.g., SQL column names, shell output format) - // Compilation catches syntax errors, which is the main goal - for (idx, target) in alert.send.iter().enumerate() { - // Load and compile templates for this target - // This will catch syntax errors like mismatched tags, invalid filters, etc. - templates::load_templates(target.subject(), target.template()) - .wrap_err_with(|| format!("validating templates for send target #{}", idx + 1))?; - } - - Ok(()) -} - -#[cfg(test)] -mod tests { - use axum::{http::StatusCode, response::IntoResponse}; - - use super::*; - - #[tokio::test] - async fn test_validate_valid_sql_alert() { - let yaml = r#" -sql: "SELECT 1" -send: - - id: test - subject: Test - template: Test -"#; - - let response = handle_validate(yaml.to_string()).await.into_response(); - - assert_eq!(response.status(), StatusCode::OK); - let body = axum::body::to_bytes(response.into_body(), usize::MAX) - .await - .unwrap(); - let validation: ValidationResponse = serde_json::from_slice(&body).unwrap(); - - assert!(validation.valid); - assert!(validation.info.is_some()); - } - - #[tokio::test] - async fn test_validate_valid_shell_alert() { - let yaml = r#" -shell: uptime -run: uptime -send: - - id: test - subject: Test - template: Test -"#; - - let response = handle_validate(yaml.to_string()).await.into_response(); - - assert_eq!(response.status(), StatusCode::OK); - let body = axum::body::to_bytes(response.into_body(), usize::MAX) - .await - .unwrap(); - let validation: ValidationResponse = serde_json::from_slice(&body).unwrap(); - - assert!(validation.valid); - assert!(validation.info.is_some()); - } - - #[tokio::test] - async fn test_validate_event_alert() { - let yaml = r#" -event: source-error -send: - - id: test - subject: Test - template: Test -"#; - - let response = handle_validate(yaml.to_string()).await.into_response(); - - assert_eq!(response.status(), StatusCode::OK); - let body = axum::body::to_bytes(response.into_body(), usize::MAX) - .await - .unwrap(); - let validation: ValidationResponse = serde_json::from_slice(&body).unwrap(); - - assert!(validation.valid); - assert!(validation.info.is_some()); - } - - #[tokio::test] - async fn test_validate_invalid_yaml() { - let yaml = "this is: not: valid: yaml:"; - - let response = handle_validate(yaml.to_string()).await.into_response(); - - assert_eq!(response.status(), StatusCode::OK); - let body = axum::body::to_bytes(response.into_body(), usize::MAX) - .await - .unwrap(); - let validation: ValidationResponse = serde_json::from_slice(&body).unwrap(); - - assert!(!validation.valid); - assert!(validation.error.is_some()); - } - - #[tokio::test] - async fn test_validate_template_syntax_error() { - let yaml = r#" -sql: "SELECT 1" -send: - - id: test - subject: Test - template: "{{ unclosed tag" -"#; - - let response = handle_validate(yaml.to_string()).await.into_response(); - - assert_eq!(response.status(), StatusCode::OK); - let body = axum::body::to_bytes(response.into_body(), usize::MAX) - .await - .unwrap(); - let validation: ValidationResponse = serde_json::from_slice(&body).unwrap(); - - assert!(!validation.valid); - assert!(validation.error.is_some()); - assert!(validation.error.unwrap().contains("Template")); - } - - #[tokio::test] - async fn test_validate_template_mismatched_tags() { - let yaml = r#" -sql: "SELECT 1" -send: - - id: test - subject: Test - template: "{% if foo %}bar" -"#; - - let response = handle_validate(yaml.to_string()).await.into_response(); - - assert_eq!(response.status(), StatusCode::OK); - let body = axum::body::to_bytes(response.into_body(), usize::MAX) - .await - .unwrap(); - let validation: ValidationResponse = serde_json::from_slice(&body).unwrap(); - - assert!(!validation.valid); - assert!(validation.error.is_some()); - } - - #[tokio::test] - async fn test_validate_multiple_targets() { - let yaml = r#" -sql: "SELECT 1" -send: - - id: test1 - subject: Test 1 - template: Test 1 - - id: test2 - subject: Test 2 - template: Test 2 -"#; - - let response = handle_validate(yaml.to_string()).await.into_response(); - - assert_eq!(response.status(), StatusCode::OK); - let body = axum::body::to_bytes(response.into_body(), usize::MAX) - .await - .unwrap(); - let validation: ValidationResponse = serde_json::from_slice(&body).unwrap(); - - assert!(validation.valid); - let info = validation.info.as_ref().unwrap(); - // Should have 0 targets because we don't provide external targets - assert_eq!(info.targets, 0); - } -} diff --git a/crates/alertd/src/http_server/state.rs b/crates/alertd/src/http_server/state.rs index 8252cf1c..b81a9e67 100644 --- a/crates/alertd/src/http_server/state.rs +++ b/crates/alertd/src/http_server/state.rs @@ -1,23 +1,14 @@ use std::{collections::HashMap, sync::Arc, time::Duration}; use jiff::Timestamp; -use tokio::sync::mpsc; -use crate::{ - EmailConfig, alert::InternalContext, events::EventManager, scheduler::Scheduler, - tasks::TaskEndpointHandler, -}; +use crate::{context::InternalContext, tasks::TaskEndpointHandler}; #[derive(Clone)] pub struct ServerState { - pub reload_tx: mpsc::Sender<()>, pub started_at: Timestamp, pub pid: u32, - pub event_manager: Option>, pub internal_context: Arc, - pub email_config: Option, - pub dry_run: bool, - pub scheduler: Arc, pub watchdog_timeout: Option, /// Endpoint handlers exposed by registered background tasks. Keyed by /// `(task_name, endpoint_name)` so the `/tasks/:task/:endpoint` route can diff --git a/crates/alertd/src/http_server/test_utils.rs b/crates/alertd/src/http_server/test_utils.rs index a2272855..2bd4ef9f 100644 --- a/crates/alertd/src/http_server/test_utils.rs +++ b/crates/alertd/src/http_server/test_utils.rs @@ -1,9 +1,8 @@ use std::{collections::HashMap, sync::Arc}; use jiff::Timestamp; -use tokio::sync::mpsc; -use crate::{alert::InternalContext, scheduler::Scheduler}; +use crate::context::InternalContext; use super::ServerState; @@ -17,25 +16,11 @@ pub async fn create_test_state() -> Arc { http_client: reqwest::Client::new(), canopy_client: None, }); - let scheduler = Arc::new(Scheduler::new( - vec![], - ctx.clone(), - None, - true, // dry_run - None, // server_kind - )); - - let (reload_tx, _reload_rx) = mpsc::channel::<()>(10); Arc::new(ServerState { - reload_tx, started_at: Timestamp::now(), pid: std::process::id(), - event_manager: None, internal_context: ctx, - email_config: None, - dry_run: true, - scheduler, watchdog_timeout: Some(std::time::Duration::from_secs(600)), task_endpoints: Arc::new(HashMap::new()), }) diff --git a/crates/alertd/src/http_server/types.rs b/crates/alertd/src/http_server/types.rs index cce69a34..cd05c48a 100644 --- a/crates/alertd/src/http_server/types.rs +++ b/crates/alertd/src/http_server/types.rs @@ -7,52 +7,3 @@ pub struct StatusResponse { pub started_at: String, pub pid: u32, } - -#[derive(Deserialize)] -pub struct PauseAlertRequest { - pub alert: String, - pub until: String, -} - -#[derive(Serialize, Deserialize)] -pub struct ValidationResponse { - pub valid: bool, - #[serde(skip_serializing_if = "Option::is_none")] - pub error: Option, - #[serde(skip_serializing_if = "Option::is_none")] - pub error_location: Option, - #[serde(skip_serializing_if = "Option::is_none")] - pub info: Option, -} - -#[derive(Serialize, Deserialize)] -pub struct ErrorLocation { - pub line: usize, - pub column: usize, - pub path: String, -} - -#[derive(Serialize, Deserialize)] -pub struct ValidationInfo { - pub enabled: bool, - pub interval: String, - pub source_type: String, - pub targets: usize, -} - -#[derive(Debug, Deserialize)] -pub struct AlertsQuery { - #[serde(default)] - pub detail: bool, -} - -#[derive(Debug, Serialize, Deserialize)] -pub struct AlertStateInfo { - pub path: String, - pub enabled: bool, - pub interval: String, - pub triggered_at: Option, - pub last_sent_at: Option, - pub paused_until: Option, - pub always_send: String, -} diff --git a/crates/alertd/src/lib.rs b/crates/alertd/src/lib.rs index 1db2e10d..9633ec25 100644 --- a/crates/alertd/src/lib.rs +++ b/crates/alertd/src/lib.rs @@ -3,33 +3,18 @@ use std::{fmt, sync::Arc, time::Duration}; pub use bestool_canopy as canopy; pub use bestool_canopy::Redacted; -mod alert; -pub mod commands; +mod context; mod daemon; pub mod doctor; -mod events; -mod glob_resolver; pub mod http_server; -mod loader; mod metrics; - -pub mod scheduler; -pub mod state_file; -mod targets; pub mod tasks; -pub mod templates; #[cfg(windows)] pub mod windows_service; -pub use alert::{ - AlertDefinition, AlwaysSend, InternalContext, TicketSource, WhenChanged, server_kind_matches, -}; -pub use daemon::{run, run_with_shutdown, run_with_shutdown_and_reload}; -pub use events::EventType; -pub use targets::{ - AlertTargets, ExternalTarget, ResolvedTarget, SendTarget, TargetConnection, TargetEmail, -}; +pub use context::InternalContext; +pub use daemon::{run, run_with_shutdown}; pub use tasks::{BackgroundTask, TaskContext, TaskEndpoint, TaskEndpointResponse}; /// The version of the alertd library @@ -49,24 +34,9 @@ pub fn http_client() -> reqwest::Client { .build() .expect("failed to build alertd HTTP client") } - -/// Email server configuration -#[derive(Debug, Clone)] -pub struct EmailConfig { - pub from: String, - pub mailgun_api_key: String, - pub mailgun_domain: String, -} - /// Configuration for the alertd daemon #[derive(Clone)] pub struct DaemonConfig { - /// Glob patterns for directories/files containing alert definitions - /// - /// Patterns are resolved to directories and files, and watched for changes. - /// On occasion, patterns are re-evaluated to pick up newly created paths. - pub alert_globs: Vec, - /// Database connection pool, opened by the caller. /// /// Centralising pool creation at the caller lets `bestool tamanu alertd` @@ -74,29 +44,21 @@ pub struct DaemonConfig { /// lookup) instead of opening additional short-lived connections. pub pg_pool: bestool_postgres::pool::PgPool, - /// Database connection URL, retained for redacted display and as a - /// substitution variable in alert templates (e.g. the `DatabaseDown` - /// event context). + /// Database connection URL, retained for redacted display. pub database_url: String, - /// Email server configuration - pub email: Option, - - /// Tamanu device key PEM, used as the client identity for canopy targets. + /// Tamanu device key PEM, used as the client identity for canopy. /// /// Held only long enough to build the canopy `reqwest::Client` at startup, /// then dropped. Wrapped in `Redacted` so debug-logging the config can't /// leak the key. pub device_key_pem: Option>, - /// Tamanu version of the install this daemon is alerting for. Sent in the + /// Tamanu version of the install this daemon is monitoring. Sent in the /// `X-Version` header on every canopy request — canopy rejects requests /// without one. pub tamanu_version: String, - /// Whether to perform a dry run (execute all alerts once and quit) - pub dry_run: bool, - /// Whether to disable the HTTP server pub no_server: bool, @@ -105,35 +67,24 @@ pub struct DaemonConfig { /// Watchdog timeout duration /// - /// If no alert task reports activity within this duration, the daemon + /// If no background task reports activity within this duration, the daemon /// will exit with an error so it can be restarted by the service manager. /// Set to `None` to disable the watchdog. pub watchdog_timeout: Option, - /// Background tasks to run on a schedule alongside the alert scheduler. + /// Background tasks to run on a schedule. /// /// Each task ticks at its own `interval()`. Errors are logged but do not /// kill the daemon. Activity from each tick counts towards the watchdog. pub background_tasks: Vec>, - - /// Opaque label identifying this daemon's deployment role, used to - /// filter alert definitions by their `server-kind:` field. Alertd is - /// agnostic about what the string means — it's whatever the configurer - /// (e.g. `bestool tamanu alertd`) decides to pass through. `None` means - /// "no filtering": every alert applies regardless of its declared - /// `server-kind`. - pub server_kind: Option, } impl fmt::Debug for DaemonConfig { fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { f.debug_struct("DaemonConfig") - .field("alert_globs", &self.alert_globs) .field("database_url", &self.database_url) - .field("email", &self.email) .field("device_key_pem", &self.device_key_pem) .field("tamanu_version", &self.tamanu_version) - .field("dry_run", &self.dry_run) .field("no_server", &self.no_server) .field("server_addrs", &self.server_addrs) .field("watchdog_timeout", &self.watchdog_timeout) @@ -145,59 +96,38 @@ impl fmt::Debug for DaemonConfig { .map(|t| t.name()) .collect::>(), ) - .field("server_kind", &self.server_kind) .finish() } } impl DaemonConfig { pub fn new( - alert_globs: Vec, pg_pool: bestool_postgres::pool::PgPool, database_url: String, tamanu_version: String, ) -> Self { Self { - alert_globs, pg_pool, database_url, - email: None, device_key_pem: None, tamanu_version, - dry_run: false, no_server: false, server_addrs: Vec::new(), watchdog_timeout: Some(Duration::from_secs(10 * 60)), background_tasks: Vec::new(), - server_kind: None, } } - pub fn with_server_kind(mut self, kind: impl Into) -> Self { - self.server_kind = Some(kind.into()); - self - } - pub fn with_task(mut self, task: Arc) -> Self { self.background_tasks.push(task); self } - pub fn with_email(mut self, email: EmailConfig) -> Self { - self.email = Some(email); - self - } - pub fn with_device_key_pem(mut self, pem: String) -> Self { self.device_key_pem = Some(Redacted(pem)); self } - pub fn with_dry_run(mut self, dry_run: bool) -> Self { - self.dry_run = dry_run; - self - } - pub fn with_no_server(mut self, no_server: bool) -> Self { self.no_server = no_server; self diff --git a/crates/alertd/src/loader.rs b/crates/alertd/src/loader.rs deleted file mode 100644 index 3a071a92..00000000 --- a/crates/alertd/src/loader.rs +++ /dev/null @@ -1,384 +0,0 @@ -use std::{collections::HashMap, path::Path}; - -use miette::Result; -use tracing::{debug, error, warn}; -use walkdir::WalkDir; - -use crate::{ - LogError, - alert::{AlertDefinition, server_kind_matches}, - canopy::{DEFAULT_CANOPY_URL, Severity}, - glob_resolver::ResolvedPaths, - targets::{AlertTargets, CanopyConfig, ExternalTarget, TargetCanopy, TargetConnection}, -}; - -pub struct LoadedAlerts { - pub alerts: Vec<(AlertDefinition, Vec)>, - pub external_targets: HashMap>, - pub definition_errors: Vec, -} - -#[derive(Debug, Clone)] -pub struct DefinitionError { - pub file: std::path::PathBuf, - pub error: String, -} - -pub fn load_alerts_from_paths( - resolved: &ResolvedPaths, - canopy_available: bool, - server_kind: Option<&str>, -) -> Result { - let mut alerts = Vec::::new(); - let mut external_targets = HashMap::new(); - let mut definition_errors = Vec::new(); - - // Load external targets from files - for external_targets_path in &resolved.files { - if let Some(name) = external_targets_path.file_name() - && (name.eq_ignore_ascii_case("_targets.yml") - || name.eq_ignore_ascii_case("_targets.yaml")) - && let Some(AlertTargets { targets }) = std::fs::read_to_string(external_targets_path) - .ok() - .and_then(|content| { - debug!(path=?external_targets_path, "parsing external targets"); - serde_yaml::from_str::(&content) - .map_err( - |err| warn!(path=?external_targets_path, "_targets.yml has errors! {err}"), - ) - .ok() - }) { - debug!(path=?external_targets_path, count=targets.len(), "loaded external targets from file"); - for target in targets { - debug!(id=%target.id, path=?external_targets_path, "adding external target"); - external_targets - .entry(target.id.clone()) - .or_insert(Vec::new()) - .push(target); - } - } - } - - // Load external targets from directories - for dir in &resolved.dirs { - for external_targets_path in [dir.join("_targets.yml"), dir.join("_targets.yaml")] { - if let Some(AlertTargets { targets }) = std::fs::read_to_string(&external_targets_path) - .ok() - .and_then(|content| { - debug!(path=?external_targets_path, "parsing external targets"); - serde_yaml::from_str::(&content) - .map_err( - |err| warn!(path=?external_targets_path, "_targets.yml has errors! {err}"), - ) - .ok() - }) { - debug!(path=?external_targets_path, count=targets.len(), "loaded external targets from directory"); - for target in targets { - debug!(id=%target.id, path=?external_targets_path, "adding external target"); - external_targets - .entry(target.id.clone()) - .or_insert(Vec::new()) - .push(target); - } - } - } - } - - // Load alerts from directories (recursively) - for dir in &resolved.dirs { - for entry in WalkDir::new(dir) - .into_iter() - .filter_map(|e| e.ok()) - .filter(|e| e.file_type().is_file()) - { - match load_alert_from_file(entry.path()) { - LoadAlertResult::Success(alert) => { - push_if_targeted(&mut alerts, alert, server_kind) - } - LoadAlertResult::Error(err) => definition_errors.push(err), - LoadAlertResult::Disabled | LoadAlertResult::Skip => {} - } - } - } - - // Load alerts from individual files - for file in &resolved.files { - match load_alert_from_file(file) { - LoadAlertResult::Success(alert) => push_if_targeted(&mut alerts, alert, server_kind), - LoadAlertResult::Error(err) => definition_errors.push(err), - LoadAlertResult::Disabled | LoadAlertResult::Skip => {} - } - } - - if !external_targets.is_empty() { - debug!( - count=%external_targets.len(), - ids=?external_targets.keys().collect::>(), - "found external targets" - ); - } else { - warn!("no external targets found"); - } - - // If no `default` target was explicitly configured and canopy auth is - // available, register a synthesised canopy target under "default" so - // alerts that reference `id: default` (and the event-manager fallback) - // route to canopy automatically. - if canopy_available && !external_targets.contains_key("default") { - debug!("no 'default' target configured, synthesising canopy default"); - external_targets.insert( - "default".to_string(), - vec![ExternalTarget { - id: "default".to_string(), - conn: TargetConnection::Canopy(TargetCanopy { - canopy: CanopyConfig { - url: DEFAULT_CANOPY_URL - .parse() - .expect("default canopy URL is valid"), - source: "bestool-alertd".to_string(), - severity: Some(Severity::Error), - }, - }), - }], - ); - } - - let alerts_with_targets: Vec<_> = alerts - .into_iter() - .filter_map(|alert| { - let file = alert.file.clone(); - let send_target_ids: Vec<_> = alert.send.iter().map(|t| t.id()).collect(); - debug!( - file=?file, - send_targets=?send_target_ids, - available_targets=?external_targets.keys().collect::>(), - "normalising alert" - ); - match alert.normalise(&external_targets) { - Ok(normalized) => Some(normalized), - Err(err) => { - error!(file=?file, "failed to normalise alert: {}", LogError(&err)); - definition_errors.push(DefinitionError { - file: file.clone(), - error: format!("{:#}", err), - }); - None - } - } - }) - .collect(); - - debug!(count=%alerts_with_targets.len(), "found some alerts"); - - if !definition_errors.is_empty() { - warn!(count=%definition_errors.len(), "found alert definition errors"); - } - - Ok(LoadedAlerts { - alerts: alerts_with_targets, - external_targets, - definition_errors, - }) -} - -/// Push an alert onto the accumulator iff its `server-kind:` matches the -/// daemon's configured server kind. Logs the drop at debug so an operator -/// wondering where a facility-only alert went can spot it in trace output. -fn push_if_targeted( - alerts: &mut Vec, - alert: AlertDefinition, - server_kind: Option<&str>, -) { - if server_kind_matches(alert.server_kind.as_deref(), server_kind) { - alerts.push(alert); - } else { - debug!( - file = %alert.file.display(), - alert_kind = ?alert.server_kind, - daemon_kind = ?server_kind, - "skipping alert: server-kind does not match daemon" - ); - } -} - -enum LoadAlertResult { - Success(AlertDefinition), - Disabled, - Skip, - Error(DefinitionError), -} - -fn load_alert_from_file(file: &Path) -> LoadAlertResult { - if !file.extension().is_some_and(|e| e == "yaml" || e == "yml") { - return LoadAlertResult::Skip; - } - - if file.file_stem().is_some_and(|n| n == "_targets") { - return LoadAlertResult::Skip; - } - - debug!(?file, "parsing YAML file"); - let content = match std::fs::read_to_string(file) { - Ok(content) => content, - Err(err) => { - error!(?file, "failed to read file: {err}"); - return LoadAlertResult::Error(DefinitionError { - file: file.to_path_buf(), - error: format!("Failed to read file: {}", err), - }); - } - }; - - let mut alert: AlertDefinition = match serde_yaml::from_str(&content) { - Ok(alert) => alert, - Err(err) => { - error!(?file, "failed to parse YAML: {err}"); - return LoadAlertResult::Error(DefinitionError { - file: file.to_path_buf(), - error: format!("Failed to parse YAML: {}", err), - }); - } - }; - - alert.file = file.to_path_buf(); - debug!(?alert, "parsed alert file"); - - if alert.enabled { - LoadAlertResult::Success(alert) - } else { - LoadAlertResult::Disabled - } -} - -#[cfg(test)] -mod tests { - use super::*; - use tempfile::TempDir; - - fn empty_resolved(dir: &Path) -> ResolvedPaths { - ResolvedPaths { - dirs: vec![dir.to_path_buf()], - files: vec![], - } - } - - #[test] - fn canopy_default_injected_when_no_targets_and_canopy_available() { - let tmp = TempDir::new().unwrap(); - let resolved = empty_resolved(tmp.path()); - - let loaded = load_alerts_from_paths(&resolved, true, None).unwrap(); - assert!(loaded.external_targets.contains_key("default")); - let default = &loaded.external_targets["default"][0]; - assert_eq!(default.id, "default"); - assert!(matches!(default.conn, TargetConnection::Canopy(_))); - } - - #[test] - fn no_canopy_default_when_canopy_unavailable() { - let tmp = TempDir::new().unwrap(); - let resolved = empty_resolved(tmp.path()); - - let loaded = load_alerts_from_paths(&resolved, false, None).unwrap(); - assert!(loaded.external_targets.is_empty()); - } - - #[test] - fn explicit_default_takes_precedence_over_canopy_synth() { - let tmp = TempDir::new().unwrap(); - std::fs::write( - tmp.path().join("_targets.yml"), - r#" -targets: - - id: default - addresses: [team@example.com] -"#, - ) - .unwrap(); - let resolved = empty_resolved(tmp.path()); - - let loaded = load_alerts_from_paths(&resolved, true, None).unwrap(); - let default = &loaded.external_targets["default"][0]; - // User's explicit email default wins; no canopy injection. - assert!(matches!(default.conn, TargetConnection::Email(_))); - assert_eq!(loaded.external_targets["default"].len(), 1); - } - - #[test] - fn alert_referencing_default_resolves_to_synth_canopy() { - let tmp = TempDir::new().unwrap(); - std::fs::write( - tmp.path().join("disk.yml"), - r#" -sql: "SELECT 1" -send: - - id: default - subject: "Test" - template: "Body" -"#, - ) - .unwrap(); - let resolved = empty_resolved(tmp.path()); - - let loaded = load_alerts_from_paths(&resolved, true, None).unwrap(); - assert_eq!(loaded.alerts.len(), 1); - let (_, resolved_targets) = &loaded.alerts[0]; - assert_eq!(resolved_targets.len(), 1); - assert!(matches!( - resolved_targets[0].conn, - TargetConnection::Canopy(_) - )); - } - - fn write_alert(dir: &Path, name: &str, server_kind: Option<&str>) { - let server_kind_line = server_kind - .map(|t| format!("server-kind: {t}\n")) - .unwrap_or_default(); - std::fs::write( - dir.join(name), - format!( - "sql: \"SELECT 1\"\n\ - send:\n - id: default\n subject: \"x\"\n template: \"y\"\n\ - {server_kind_line}" - ), - ) - .unwrap(); - } - - #[test] - fn target_filter_keeps_matching_alerts_only() { - let tmp = TempDir::new().unwrap(); - write_alert(tmp.path(), "central-only.yml", Some("central")); - write_alert(tmp.path(), "facility-only.yml", Some("facility")); - write_alert(tmp.path(), "kiosk-only.yml", Some("kiosk")); - write_alert(tmp.path(), "no-target.yml", None); - let resolved = empty_resolved(tmp.path()); - - let loaded = load_alerts_from_paths(&resolved, true, Some("central")).unwrap(); - let kept: Vec = loaded - .alerts - .iter() - .map(|(a, _)| a.file.file_name().unwrap().to_string_lossy().into_owned()) - .collect(); - assert!(kept.contains(&"central-only.yml".into())); - assert!(kept.contains(&"no-target.yml".into())); - assert!(!kept.contains(&"facility-only.yml".into())); - assert!( - !kept.contains(&"kiosk-only.yml".into()), - "alertd matches the daemon's server_kind by string equality; unrelated kinds are dropped" - ); - } - - #[test] - fn target_filter_absent_kind_admits_everything() { - // alertd running with no `server_kind` configured (e.g. outside a - // Tamanu install) shouldn't silently swallow targeted alerts. - let tmp = TempDir::new().unwrap(); - write_alert(tmp.path(), "central-only.yml", Some("central")); - write_alert(tmp.path(), "facility-only.yml", Some("facility")); - let resolved = empty_resolved(tmp.path()); - - let loaded = load_alerts_from_paths(&resolved, true, None).unwrap(); - assert_eq!(loaded.alerts.len(), 2); - } -} diff --git a/crates/alertd/src/main.rs b/crates/alertd/src/main.rs deleted file mode 100644 index e430252e..00000000 --- a/crates/alertd/src/main.rs +++ /dev/null @@ -1,392 +0,0 @@ -use clap::{Parser, Subcommand}; -use lloggs::{LoggingArgs, PreArgs, WorkerGuard}; -use miette::{Result, miette}; -use tracing::debug; - -/// BES tooling: Alert daemon -/// -/// The daemon watches for changes to alert definition files and automatically reloads -/// when changes are detected. You can also send SIGHUP to manually trigger a reload. -/// -/// On Windows, the daemon can be installed as a native Windows service using the -/// 'install' subcommand. See 'bestool-alertd install --help' for details. -/// -/// The alert and target definitions are documented online at: -/// -/// and . -#[derive(Debug, Clone, Parser)] -pub struct Args { - #[command(flatten)] - logging: LoggingArgs, - - #[command(subcommand)] - command: Command, -} - -/// Common arguments for running the daemon -#[derive(Debug, Clone, Parser)] -struct DaemonArgs { - /// Database connection URL - /// - /// PostgreSQL connection URL, e.g., postgresql://user:pass@localhost/dbname - #[arg(long, env = "DATABASE_URL")] - database_url: Option, - - /// Glob patterns for alert definitions - /// - /// Patterns can match directories (which will be read recursively) or individual files. - /// Can be provided multiple times. - /// Examples: /etc/tamanu/alerts, /opt/*/alerts, /etc/tamanu/alerts/**/*.yml - #[arg(long)] - glob: Vec, - - /// Email sender address - #[arg(long, env = "EMAIL_FROM")] - email_from: Option, - - /// Mailgun API key - #[arg(long, env = "MAILGUN_API_KEY")] - mailgun_api_key: Option, - - /// Mailgun domain - #[arg(long, env = "MAILGUN_DOMAIN")] - mailgun_domain: Option, - - /// Tamanu version of the install this daemon alerts for. Sent on every - /// canopy request via the `X-Version` header. - #[arg(long, env = "TAMANU_VERSION", default_value = "0.0.0")] - tamanu_version: String, - - /// Path to a Tamanu device key PEM, used as client identity for canopy targets. - /// - /// Required for any alert that targets a canopy `/events` endpoint. The key - /// is the same value Tamanu stores in `local_system_facts(key='deviceKey')`; - /// only the private key is read (a fresh self-signed cert is generated from - /// it at startup). - #[arg(long, env = "DEVICE_KEY_FILE")] - device_key_file: Option, - - /// Execute all alerts once and quit (ignoring intervals) - #[arg(long)] - dry_run: bool, - - /// Disable the HTTP server - #[arg(long)] - no_server: bool, - - /// HTTP server bind address(es) - /// - /// Can be provided multiple times. The server will attempt to bind to each address - /// in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - #[arg(long)] - server_addr: Vec, - - /// Watchdog timeout in seconds - /// - /// If no alert task reports activity within this many seconds, the daemon - /// will exit so the service manager can restart it. Defaults to 600 (10 minutes). - #[arg(long, default_value = "600")] - watchdog_timeout: u64, - - /// Disable the watchdog - /// - /// By default, the daemon will exit if no alert activity is detected within - /// the watchdog timeout. This flag disables that behavior. - #[arg(long)] - no_watchdog: bool, -} - -#[derive(Debug, Clone, Subcommand)] -enum Command { - /// Run the alert daemon - /// - /// Starts the daemon which monitors alert definition files and executes alerts - /// based on their configured schedules. The daemon will watch for file changes - /// and automatically reload when definitions are modified. - Run { - #[command(flatten)] - daemon: DaemonArgs, - }, - - /// Show status and health of a running daemon - /// - /// Connects to the running daemon's HTTP API and displays version, uptime, - /// health, and watchdog information. Exits with code 1 if the daemon is unhealthy. - Status { - /// HTTP server address(es) to try - /// - /// Can be provided multiple times. Will attempt to connect to each address - /// in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - #[arg(long)] - server_addr: Vec, - }, - - /// Send reload signal to running daemon - /// - /// Connects to the running daemon's HTTP API and triggers a reload. - /// This is an alternative to SIGHUP that works on all platforms including Windows. - Reload { - /// HTTP server address(es) to try - /// - /// Can be provided multiple times. Will attempt to connect to each address - /// in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - #[arg(long)] - server_addr: Vec, - }, - - /// List currently loaded alert files - /// - /// Connects to the running daemon's HTTP API and retrieves the list of - /// currently loaded alert definition files. - LoadedAlerts { - /// HTTP server address(es) to try - /// - /// Can be provided multiple times. Will attempt to connect to each address - /// in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - #[arg(long)] - server_addr: Vec, - - /// Show detailed state information for each alert - #[arg(long)] - detail: bool, - }, - - /// Temporarily pause an alert - /// - /// Pauses an alert until the specified time. The alert will not execute during - /// this period. The pause is lost when the daemon restarts. - PauseAlert { - /// Alert file path to pause - alert: String, - - /// Time until which to pause the alert (fuzzy time format) - /// - /// Examples: "1 hour", "2 days", "next monday", "2024-12-25T10:00:00Z" - /// Defaults to 1 week from now if not specified. - #[arg(long)] - until: Option, - - /// HTTP server address(es) to try - /// - /// Can be provided multiple times. Will attempt to connect to each address - /// in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - #[arg(long)] - server_addr: Vec, - }, - - /// Validate an alert definition file - /// - /// Parses an alert definition file and reports any syntax or validation errors. - /// Uses pretty error reporting to pinpoint the exact location of problems. - /// Requires the daemon to be running. - Validate { - /// Path to the alert definition file to validate - file: std::path::PathBuf, - - /// HTTP server address(es) to try - /// - /// Can be provided multiple times. Will attempt to connect to each address - /// in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - #[arg(long)] - server_addr: Vec, - }, - - /// Install the daemon as a Windows service - /// - /// Creates a Windows service named 'bestool-alertd' that will start automatically - /// and starts it immediately. - #[cfg(windows)] - Install, - - /// Uninstall the Windows service - /// - /// Stops the 'bestool-alertd' Windows service if running and then removes it. - #[cfg(windows)] - Uninstall, - - /// Configure failure recovery on an existing Windows service - /// - /// Updates the 'bestool-alertd' service to automatically restart on failure. - /// This is done automatically on new installs, but can be run separately to - /// update an already-installed service. - #[cfg(windows)] - ConfigureRecovery, - - #[cfg(windows)] - #[command(hide = true)] - Service { - #[command(flatten)] - daemon: DaemonArgs, - }, - - /// Generate markdown documentation - #[command(hide = true, name = "_docs")] - Docs, -} - -fn get_args() -> Result<(Args, WorkerGuard)> { - let log_guard = PreArgs::parse().setup().map_err(|err| miette!("{err}"))?; - - debug!("parsing arguments"); - let args = Args::parse(); - - let log_guard = match log_guard { - Some(g) => g, - None => args - .logging - .setup(|v| match v { - 0 => "bestool_alertd=info", - 1 => "info,bestool_alertd=debug", - 2 => "debug", - 3 => "debug,bestool_alertd=trace", - _ => "trace", - }) - .map_err(|err| miette!("{err}"))?, - }; - - debug!(?args, "got arguments"); - Ok((args, log_guard)) -} - -async fn build_daemon_config(daemon: DaemonArgs) -> Result { - let database_url = daemon - .database_url - .ok_or_else(|| miette!("--database-url is required"))?; - - if daemon.glob.is_empty() { - return Err(miette!("at least one --glob must be specified")); - } - - let email = match ( - daemon.email_from, - daemon.mailgun_api_key, - daemon.mailgun_domain, - ) { - (Some(from), Some(api_key), Some(domain)) => Some(bestool_alertd::EmailConfig { - from, - mailgun_api_key: api_key, - mailgun_domain: domain, - }), - (None, None, None) => None, - _ => { - return Err(miette!( - "either provide all email options (--email-from, --mailgun-api-key, --mailgun-domain) or none" - )); - } - }; - - let watchdog_timeout = if daemon.no_watchdog { - None - } else { - Some(std::time::Duration::from_secs(daemon.watchdog_timeout)) - }; - - let device_key_pem = if let Some(path) = daemon.device_key_file { - Some( - std::fs::read_to_string(&path) - .map_err(|err| miette!("reading device key file {}: {err}", path.display()))?, - ) - } else { - None - }; - - let pg_pool = bestool_postgres::pool::create_pool(&database_url, "bestool-alertd").await?; - - let mut daemon_config = bestool_alertd::DaemonConfig::new( - daemon.glob, - pg_pool, - database_url, - daemon.tamanu_version, - ) - .with_dry_run(daemon.dry_run) - .with_no_server(daemon.no_server) - .with_server_addrs(daemon.server_addr) - .with_watchdog_timeout(watchdog_timeout); - - if let Some(email) = email { - daemon_config = daemon_config.with_email(email); - } - - if let Some(pem) = device_key_pem { - daemon_config = daemon_config.with_device_key_pem(pem); - } - - Ok(daemon_config) -} - -async fn run_daemon(daemon: DaemonArgs) -> Result<()> { - let daemon_config = build_daemon_config(daemon).await?; - bestool_alertd::run(daemon_config).await -} - -#[tokio::main] -async fn main() -> Result<()> { - let (args, _guard) = get_args()?; - - match args.command { - Command::Run { daemon } => run_daemon(daemon).await, - Command::Status { server_addr } => { - let addrs = if server_addr.is_empty() { - bestool_alertd::commands::default_server_addrs() - } else { - server_addr - }; - bestool_alertd::commands::get_status(&addrs).await - } - Command::Reload { server_addr } => { - let addrs = if server_addr.is_empty() { - bestool_alertd::commands::default_server_addrs() - } else { - server_addr - }; - bestool_alertd::commands::send_reload(&addrs).await - } - Command::LoadedAlerts { - server_addr, - detail, - } => { - let addrs = if server_addr.is_empty() { - bestool_alertd::commands::default_server_addrs() - } else { - server_addr - }; - bestool_alertd::commands::get_loaded_alerts(&addrs, detail).await - } - Command::PauseAlert { - alert, - until, - server_addr, - } => { - let addrs = if server_addr.is_empty() { - bestool_alertd::commands::default_server_addrs() - } else { - server_addr - }; - bestool_alertd::commands::pause_alert(&alert, until.as_deref(), &addrs).await - } - Command::Validate { file, server_addr } => { - let addrs = if server_addr.is_empty() { - bestool_alertd::commands::default_server_addrs() - } else { - server_addr - }; - bestool_alertd::commands::validate_alert(&file, &addrs).await - } - #[cfg(windows)] - Command::Install => bestool_alertd::windows_service::install_service(), - #[cfg(windows)] - Command::Uninstall => bestool_alertd::windows_service::uninstall_service(), - #[cfg(windows)] - Command::ConfigureRecovery => bestool_alertd::windows_service::configure_recovery(), - #[cfg(windows)] - Command::Service { daemon } => { - let daemon_config = build_daemon_config(daemon).await?; - bestool_alertd::windows_service::run_service(daemon_config) - } - Command::Docs => { - let markdown = clap_markdown::help_markdown::(); - println!("{}", markdown); - Ok(()) - } - } -} diff --git a/crates/alertd/src/metrics.rs b/crates/alertd/src/metrics.rs index 1b27a637..04c006cc 100644 --- a/crates/alertd/src/metrics.rs +++ b/crates/alertd/src/metrics.rs @@ -1,10 +1,6 @@ //! Prometheus metrics for the alertd daemon. //! //! Tracks the following metrics: -//! - `bes_alertd_alerts_loaded`: Number of alerts currently loaded (gauge) -//! - `bes_alertd_alerts_sent_total`: Total number of alerts sent successfully (counter) -//! - `bes_alertd_alerts_failed_total`: Total number of alerts that failed to send (counter) -//! - `bes_alertd_reloads_total`: Total number of configuration reloads (counter) //! - `bes_alertd_last_activity_unix`: Unix timestamp of the last activity (gauge) use std::sync::OnceLock; @@ -12,107 +8,31 @@ use std::sync::atomic::{AtomicI64, Ordering}; use jiff::Timestamp; use miette::{IntoDiagnostic, Result}; -use prometheus::{Encoder, IntCounter, IntGauge, Registry, TextEncoder}; +use prometheus::{Encoder, IntGauge, Registry, TextEncoder}; static REGISTRY: OnceLock = OnceLock::new(); -static ALERTS_LOADED: OnceLock = OnceLock::new(); -static ALERTS_SENT_TOTAL: OnceLock = OnceLock::new(); -static ALERTS_FAILED_TOTAL: OnceLock = OnceLock::new(); -static RELOADS_TOTAL: OnceLock = OnceLock::new(); static LAST_ACTIVITY_GAUGE: OnceLock = OnceLock::new(); static LAST_ACTIVITY: AtomicI64 = AtomicI64::new(0); pub fn init_metrics() { let registry = Registry::new(); - let alerts_loaded = IntGauge::new( - "bes_alertd_alerts_loaded", - "Number of alerts currently loaded", - ) - .expect("failed to create alerts_loaded metric"); - - let alerts_sent_total = IntCounter::new( - "bes_alertd_alerts_sent_total", - "Total number of alerts sent", - ) - .expect("failed to create alerts_sent_total metric"); - - let alerts_failed_total = IntCounter::new( - "bes_alertd_alerts_failed_total", - "Total number of alerts that failed to send", - ) - .expect("failed to create alerts_failed_total metric"); - - let reloads_total = IntCounter::new( - "bes_alertd_reloads_total", - "Total number of configuration reloads", - ) - .expect("failed to create reloads_total metric"); - let last_activity_gauge = IntGauge::new( "bes_alertd_last_activity_unix", "Unix timestamp of the last activity", ) .expect("failed to create last_activity_gauge metric"); - registry - .register(Box::new(alerts_loaded.clone())) - .expect("failed to register alerts_loaded metric"); - registry - .register(Box::new(alerts_sent_total.clone())) - .expect("failed to register alerts_sent_total metric"); - registry - .register(Box::new(alerts_failed_total.clone())) - .expect("failed to register alerts_failed_total metric"); - registry - .register(Box::new(reloads_total.clone())) - .expect("failed to register reloads_total metric"); registry .register(Box::new(last_activity_gauge.clone())) .expect("failed to register last_activity_gauge metric"); REGISTRY.set(registry).expect("metrics already initialized"); - ALERTS_LOADED - .set(alerts_loaded) - .expect("metrics already initialized"); - ALERTS_SENT_TOTAL - .set(alerts_sent_total) - .expect("metrics already initialized"); - ALERTS_FAILED_TOTAL - .set(alerts_failed_total) - .expect("metrics already initialized"); - RELOADS_TOTAL - .set(reloads_total) - .expect("metrics already initialized"); LAST_ACTIVITY_GAUGE .set(last_activity_gauge) .expect("metrics already initialized"); } -pub fn set_alerts_loaded(count: usize) { - if let Some(metric) = ALERTS_LOADED.get() { - metric.set(count as i64); - } -} - -pub fn inc_alerts_sent() { - if let Some(metric) = ALERTS_SENT_TOTAL.get() { - metric.inc(); - } -} - -pub fn inc_alerts_failed() { - if let Some(metric) = ALERTS_FAILED_TOTAL.get() { - metric.inc(); - } -} - -pub fn inc_reloads() { - if let Some(metric) = RELOADS_TOTAL.get() { - metric.inc(); - } -} - pub fn record_activity() { let now = Timestamp::now().as_second(); LAST_ACTIVITY.store(now, Ordering::Relaxed); diff --git a/crates/alertd/src/scheduler.rs b/crates/alertd/src/scheduler.rs deleted file mode 100644 index 74f75461..00000000 --- a/crates/alertd/src/scheduler.rs +++ /dev/null @@ -1,1009 +0,0 @@ -use std::{ - collections::{HashMap, HashSet}, - path::PathBuf, - sync::Arc, - time::Duration, -}; - -use jiff::Timestamp; -use miette::Result; -use tokio::{ - sync::{Mutex, Notify, RwLock}, - task::JoinHandle, - time::{interval, sleep}, -}; -use tracing::{debug, error, info, warn}; - -use crate::{ - EmailConfig, LogError, - alert::{AlertDefinition, InternalContext, TicketSource}, - events::{EventContext, EventManager, EventType}, - glob_resolver::{GlobResolver, ResolvedPaths}, - loader::{LoadedAlerts, load_alerts_from_paths}, - metrics, - state_file::{PersistedAlertState, PersistedState}, - targets::ResolvedTarget, -}; - -#[derive(Debug, Clone)] -pub struct AlertState { - pub definition: AlertDefinition, - pub resolved_targets: Vec, - pub triggered_at: Option, - pub last_sent_at: Option, - pub last_output: Option, - pub paused_until: Option, - /// Was the last source read for this alert an error? - /// - /// Used to detect the transition error → OK so a clearing canopy event - /// can be sent for the source-error issue. - pub source_was_erroring: bool, -} - -impl AlertState { - pub fn new(definition: AlertDefinition, resolved_targets: Vec) -> Self { - Self { - definition, - resolved_targets, - triggered_at: None, - last_sent_at: None, - last_output: None, - paused_until: None, - source_was_erroring: false, - } - } - - pub fn preserve_state_from(&mut self, old_state: &AlertState) { - self.triggered_at = old_state.triggered_at; - self.last_sent_at = old_state.last_sent_at; - self.last_output = old_state.last_output.clone(); - self.paused_until = old_state.paused_until; - self.source_was_erroring = old_state.source_was_erroring; - } - - pub fn hydrate_from_persisted(&mut self, entry: &PersistedAlertState) { - self.triggered_at = entry.triggered_at; - self.last_sent_at = entry.last_sent_at; - self.last_output = entry.last_output.clone(); - self.paused_until = entry.paused_until; - self.source_was_erroring = entry.source_was_erroring; - } - - pub fn to_persisted(&self) -> PersistedAlertState { - PersistedAlertState { - triggered_at: self.triggered_at, - last_sent_at: self.last_sent_at, - last_output: self.last_output.clone(), - paused_until: self.paused_until, - source_was_erroring: self.source_was_erroring, - } - } -} - -pub struct Scheduler { - glob_resolver: GlobResolver, - resolved_paths: Arc>, - ctx: Arc, - email: Option, - dry_run: bool, - alerts: Arc>>>>, - tasks: Arc>>>, - event_manager: Arc>>, - external_targets: - Arc>>>, - state_dirty: Arc, - pending_hydration: Arc>>, - /// Files that errored during definition loading on the previous - /// scheduling pass. Used to detect recovery so we can clear the - /// corresponding canopy issue. - last_definition_error_files: Arc>>, - /// Mirrors the daemon's database-down tracking. - /// - /// Kept on the scheduler so it can be persisted in the state snapshot - /// and restored on the next start — that way a recovery that happens - /// while the daemon was down still produces a canopy clear. - database_was_down: Arc>, - /// Configured Tamanu server kind, threaded into the loader so alerts - /// whose `target:` doesn't match get dropped at load time. - server_kind: Option, -} - -impl Scheduler { - pub fn new( - alert_globs: Vec, - ctx: Arc, - email: Option, - dry_run: bool, - server_kind: Option, - ) -> Self { - let glob_resolver = GlobResolver::new(alert_globs); - Self { - glob_resolver, - resolved_paths: Arc::new(RwLock::new(ResolvedPaths { - dirs: Vec::new(), - files: Vec::new(), - })), - ctx, - email, - dry_run, - alerts: Arc::new(RwLock::new(HashMap::new())), - tasks: Arc::new(RwLock::new(HashMap::new())), - event_manager: Arc::new(RwLock::new(None)), - external_targets: Arc::new(RwLock::new(HashMap::new())), - state_dirty: Arc::new(Notify::new()), - pending_hydration: Arc::new(Mutex::new(None)), - last_definition_error_files: Arc::new(RwLock::new(HashSet::new())), - database_was_down: Arc::new(RwLock::new(false)), - server_kind, - } - } - - /// Read the persisted database-down flag. - pub async fn database_was_down(&self) -> bool { - *self.database_was_down.read().await - } - - /// Update the persisted database-down flag. - pub async fn set_database_was_down(&self, value: bool) { - *self.database_was_down.write().await = value; - self.state_dirty.notify_one(); - } - - /// Handle used by the persistence task to wake when alert state changes. - pub fn state_dirty(&self) -> Arc { - self.state_dirty.clone() - } - - /// Seed the next `load_and_schedule_alerts` call with persisted state. - /// - /// Consumed on the next load (cold start). Reload calls leave previously - /// in-memory state in place via `preserve_state_from`, so hydration is a - /// cold-start-only concern. - pub async fn set_pending_hydration(&self, state: PersistedState) { - *self.pending_hydration.lock().await = Some(state); - } - - pub fn get_event_manager(&self) -> Arc>> { - self.event_manager.clone() - } - - pub async fn get_loaded_alerts(&self) -> Vec { - let alerts = self.alerts.read().await; - let mut files: Vec = alerts.keys().cloned().collect(); - files.sort(); - files - } - - pub async fn get_alert_states(&self) -> HashMap { - let alerts = self.alerts.read().await; - let mut states = HashMap::new(); - for (path, state_lock) in alerts.iter() { - let state = state_lock.read().await; - states.insert(path.clone(), state.clone()); - } - states - } - - pub async fn pause_alert(&self, path: &PathBuf, until: Timestamp) -> Result { - let alerts = self.alerts.read().await; - if let Some(alert_state) = alerts.get(path) { - let mut state = alert_state.write().await; - state.paused_until = Some(until); - info!(?path, until = %until, "paused alert"); - drop(state); - self.state_dirty.notify_one(); - Ok(true) - } else { - Ok(false) - } - } - - /// Snapshot the in-memory state for serialisation by the persistence task. - pub async fn snapshot_for_persistence(&self) -> PersistedState { - let alerts = self.alerts.read().await; - let mut out = HashMap::with_capacity(alerts.len()); - for (path, state_lock) in alerts.iter() { - let state = state_lock.read().await; - out.insert(path.clone(), state.to_persisted()); - } - PersistedState { - saved_at: Some(Timestamp::now()), - alerts: out, - database_was_down: *self.database_was_down.read().await, - definition_error_files: self.last_definition_error_files.read().await.clone(), - } - } - - pub async fn get_external_targets( - &self, - ) -> std::collections::HashMap> { - self.external_targets.read().await.clone() - } - - pub async fn load_and_schedule_alerts(&self) -> Result<()> { - info!("resolving glob patterns and loading alerts"); - - // Consume any pending hydration first so subsequent code can rely on - // hydrated daemon-level state (database_was_down, last definition - // errors). Per-alert hydration happens later from the same value. - let hydration = self.pending_hydration.lock().await.take(); - if let Some(ref h) = hydration { - *self.database_was_down.write().await = h.database_was_down; - *self.last_definition_error_files.write().await = h.definition_error_files.clone(); - } - - let resolved = self.glob_resolver.resolve()?; - debug!( - dirs = resolved.dirs.len(), - files = resolved.files.len(), - "resolved paths from globs" - ); - - let canopy_available = self.ctx.canopy_client.is_some(); - let LoadedAlerts { - alerts, - external_targets, - definition_errors, - } = load_alerts_from_paths(&resolved, canopy_available, self.server_kind.as_deref())?; - - // Update resolved paths - *self.resolved_paths.write().await = resolved; - - // Separate event alerts from regular alerts - let (event_alerts, regular_alerts): (Vec<_>, Vec<_>) = alerts - .into_iter() - .partition(|(alert, _)| matches!(alert.source, TicketSource::Event { .. })); - - // Store external targets - *self.external_targets.write().await = external_targets.clone(); - - // Create event manager with event alerts and external targets - let event_manager = EventManager::new(event_alerts, &external_targets); - *self.event_manager.write().await = Some(event_manager.clone()); - - // Send definition error events for any failed alert loads - if !definition_errors.is_empty() { - info!( - count = definition_errors.len(), - "triggering definition-error events for failed alerts" - ); - } - let new_error_files: HashSet = - definition_errors.iter().map(|e| e.file.clone()).collect(); - for def_err in definition_errors { - info!( - file = %def_err.file.display(), - "triggering definition-error event" - ); - let entity_key = def_err.file.display().to_string(); - let event_context = EventContext::DefinitionError { - alert_file: entity_key.clone(), - error_message: def_err.error.clone(), - }; - if let Err(err) = event_manager - .trigger_event( - EventType::DefinitionError, - &self.ctx, - self.email.as_ref(), - self.dry_run, - event_context, - Some(&entity_key), - ) - .await - { - error!( - "failed to trigger definition-error event: {}", - LogError(&err) - ); - } - } - - // Clear definition-error events for files that errored last time but - // loaded cleanly this time. - let mut last_def_errors = self.last_definition_error_files.write().await; - for recovered in last_def_errors.difference(&new_error_files) { - info!( - file = %recovered.display(), - "clearing definition-error event (file now loads cleanly)" - ); - let entity_key = recovered.display().to_string(); - if let Err(err) = event_manager - .trigger_clear( - EventType::DefinitionError, - &self.ctx, - self.dry_run, - Some(&entity_key), - ) - .await - { - error!("failed to clear definition-error event: {}", LogError(&err)); - } - } - *last_def_errors = new_error_files; - drop(last_def_errors); - - if regular_alerts.is_empty() { - warn!("no regular alerts found"); - return Ok(()); - } - - info!(count = regular_alerts.len(), "scheduling regular alerts"); - - // Get old alerts to preserve state across hot reload. - let old_alerts = self.alerts.read().await.clone(); - - // Hydration was taken at the top of this method; reuse it here for - // per-alert state. On subsequent reloads it'll be None and - // preserve_state_from carries in-memory state forward instead. - let mut hydrated_count = 0usize; - - let mut new_alerts = HashMap::new(); - let mut tasks = HashMap::new(); - - for (definition, resolved_targets) in regular_alerts { - let file = definition.file.clone(); - - // Create new alert state - let mut new_state = AlertState::new(definition.clone(), resolved_targets.clone()); - - if let Some(old_alert_lock) = old_alerts.get(&file) { - let old_state = old_alert_lock.read().await; - new_state.preserve_state_from(&old_state); - } else if let Some(entry) = hydration.as_ref().and_then(|h| h.alerts.get(&file)) { - new_state.hydrate_from_persisted(entry); - hydrated_count += 1; - } - - let state_lock = Arc::new(RwLock::new(new_state)); - let task = self.spawn_alert_task(state_lock.clone()); - - new_alerts.insert(file.clone(), state_lock); - tasks.insert(file, task); - } - - if hydrated_count > 0 { - info!(count = hydrated_count, "hydrated alert state from disk"); - } - - // Update alerts and tasks atomically - *self.alerts.write().await = new_alerts; - *self.tasks.write().await = tasks; - - // Update metrics with count of loaded alerts - metrics::set_alerts_loaded(self.alerts.read().await.len()); - - Ok(()) - } - - pub async fn execute_all_alerts_once(&self) -> Result<()> { - info!("executing all alerts once"); - - let resolved = self.glob_resolver.resolve()?; - let canopy_available = self.ctx.canopy_client.is_some(); - let LoadedAlerts { - alerts, - external_targets, - definition_errors, - } = load_alerts_from_paths(&resolved, canopy_available, self.server_kind.as_deref())?; - - // Separate event alerts from regular alerts - let (event_alerts, regular_alerts): (Vec<_>, Vec<_>) = alerts - .into_iter() - .partition(|(alert, _)| matches!(alert.source, TicketSource::Event { .. })); - - // Store external targets - *self.external_targets.write().await = external_targets.clone(); - - // Update event manager - let event_manager = EventManager::new(event_alerts, &external_targets); - *self.event_manager.write().await = Some(event_manager.clone()); - - // Send definition error events for any failed alert loads - if !definition_errors.is_empty() { - info!( - count = definition_errors.len(), - "triggering definition-error events for failed alerts" - ); - } - for def_err in definition_errors { - info!( - file = %def_err.file.display(), - "triggering definition-error event" - ); - let entity_key = def_err.file.display().to_string(); - let event_context = EventContext::DefinitionError { - alert_file: entity_key.clone(), - error_message: def_err.error.clone(), - }; - if let Err(err) = event_manager - .trigger_event( - EventType::DefinitionError, - &self.ctx, - self.email.as_ref(), - self.dry_run, - event_context, - Some(&entity_key), - ) - .await - { - error!( - "failed to trigger definition-error event: {}", - LogError(&err) - ); - } - } - - if regular_alerts.is_empty() { - warn!("no regular alerts found"); - return Ok(()); - } - - info!(count = regular_alerts.len(), "executing alerts"); - - for (alert, resolved_targets) in regular_alerts { - let ctx = self.ctx.clone(); - let email = self.email.clone(); - let dry_run = self.dry_run; - let file = alert.file.clone(); - - debug!(?file, "executing alert"); - if let Err(err) = alert - .execute(ctx, email.as_ref(), dry_run, &resolved_targets) - .await - { - error!(?file, "error executing alert: {}", LogError(&err)); - } - } - - Ok(()) - } - - pub async fn check_and_reload_if_paths_changed(&self) -> Result<()> { - debug!("checking if resolved paths have changed"); - - let new_resolved = self.glob_resolver.resolve()?; - let old_resolved = self.resolved_paths.read().await; - - if new_resolved.differs_from(&old_resolved) { - drop(old_resolved); // Release read lock before reloading - info!("resolved paths have changed, reloading alerts"); - self.reload_alerts().await?; - } - - Ok(()) - } - - pub async fn get_resolved_paths(&self) -> Vec { - let resolved = self.resolved_paths.read().await; - resolved - .all_paths() - .iter() - .map(|p| p.to_path_buf()) - .collect() - } - - pub async fn reload_alerts(&self) -> Result<()> { - info!("reloading alerts"); - - // Cancel all existing tasks - { - let mut tasks = self.tasks.write().await; - for (path, handle) in tasks.drain() { - debug!(?path, "cancelling alert task"); - handle.abort(); - } - } - - // Load and schedule new alerts - self.load_and_schedule_alerts().await - } - - fn spawn_alert_task(&self, alert_state: Arc>) -> JoinHandle<()> { - let ctx = self.ctx.clone(); - let email = self.email.clone(); - let dry_run = self.dry_run; - let event_manager = self.event_manager.clone(); - let state_dirty = self.state_dirty.clone(); - - tokio::spawn(async move { - // Read initial values from state - let (file, interval_duration) = { - let state = alert_state.read().await; - ( - state.definition.file.clone(), - state.definition.interval_duration, - ) - }; - debug!(?file, ?interval_duration, "starting alert task"); - - // Add a small random delay to prevent all alerts from firing at exactly the same time - let jitter = Duration::from_millis(rand::random::() % 5000); - sleep(jitter).await; - - let mut ticker = interval(interval_duration); - ticker.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip); - - loop { - ticker.tick().await; - metrics::record_activity(); - - // Check if alert is paused - let is_paused = { - let state = alert_state.read().await; - if let Some(until) = state.paused_until { - let now = Timestamp::now(); - now < until - } else { - false - } - }; - - if is_paused { - debug!(?file, "alert is paused, skipping execution"); - continue; - } - - debug!(?file, "executing alert"); - - // Get alert definition and state - let ( - alert, - resolved_targets, - was_triggered, - was_source_erroring, - always_send, - when_changed, - ) = { - let state = alert_state.read().await; - ( - state.definition.clone(), - state.resolved_targets.clone(), - state.triggered_at.is_some(), - state.source_was_erroring, - state.definition.always_send.clone(), - state.definition.when_changed.clone(), - ) - }; - - // Check the triggering state - let now = jiff::Timestamp::now(); - let mut tera_ctx = crate::templates::build_context(&alert, now); - let not_before = now - alert.interval_duration; - - let mut state_changed = false; - - let is_triggering = match alert - .read_sources(&ctx.pg_pool, not_before, &mut tera_ctx, was_triggered) - .await - { - Ok(flow) => { - // Source recovered: clear the source-error canopy issue. - if was_source_erroring { - info!(?file, "source recovered, clearing source-error event"); - if let Some(ref event_mgr) = *event_manager.read().await { - let entity_key = file.display().to_string(); - if let Err(event_err) = event_mgr - .trigger_clear( - EventType::SourceError, - &ctx, - dry_run, - Some(&entity_key), - ) - .await - { - error!( - "failed to clear source_error event: {}", - LogError(&event_err) - ); - } - } - let mut state = alert_state.write().await; - state.source_was_erroring = false; - state_changed = true; - } - flow.is_continue() - } - Err(err) => { - error!(?file, "error reading sources: {}", LogError(&err)); - metrics::inc_alerts_failed(); - - // Trigger source_error event - if let Some(ref event_mgr) = *event_manager.read().await { - let entity_key = file.display().to_string(); - let event_context = EventContext::SourceError { - alert_file: entity_key.clone(), - error_message: format!("{err:?}"), - }; - if let Err(event_err) = event_mgr - .trigger_event( - EventType::SourceError, - &ctx, - email.as_ref(), - dry_run, - event_context, - Some(&entity_key), - ) - .await - { - error!( - "failed to trigger source_error event: {}", - LogError(&event_err) - ); - } - } - if !was_source_erroring { - let mut state = alert_state.write().await; - state.source_was_erroring = true; - state_dirty.notify_one(); - } - continue; - } - }; - - if is_triggering { - // Alert is in triggering state - let mut state = alert_state.write().await; - - let mut should_send = match &always_send { - crate::alert::AlwaysSend::Boolean(true) => true, - crate::alert::AlwaysSend::Boolean(false) => !was_triggered, - crate::alert::AlwaysSend::Timed(config) => { - // Check if enough time has passed since last send - match state.last_sent_at { - Some(last_sent_time) => { - let now = Timestamp::now(); - let elapsed = now.duration_since(last_sent_time); - if let Ok(after_duration) = - jiff::SignedDuration::try_from(config.after_duration) - { - elapsed >= after_duration - } else { - false - } - } - None => true, // Never sent before, should send - } - } - }; - - // Check when-changed logic if configured - if should_send - && !matches!(when_changed, crate::alert::WhenChanged::Boolean(false)) - { - let current_digest = - digest_context_for_comparison(&tera_ctx, &when_changed); - - let output_changed = match &state.last_output { - Some(prev_digest) => prev_digest != ¤t_digest, - None => true, // First run, consider it changed - }; - - if output_changed { - debug!(?file, "output changed, will send"); - state.last_output = Some(current_digest); - state_changed = true; - } else { - debug!(?file, "output unchanged, skipping"); - should_send = false; - } - } - - if should_send { - debug!(?file, "alert triggered, sending notifications"); - - // Send to targets - for target in &resolved_targets { - if let Err(err) = target - .send(&alert, &mut tera_ctx, email.as_ref(), &ctx, dry_run) - .await - { - error!("sending: {}", LogError(&err)); - } - } - - metrics::inc_alerts_sent(); - - // Update last sent timestamp - state.last_sent_at = Some(Timestamp::now()); - state_changed = true; - } else { - debug!(?file, "alert still triggered, not sending (already sent)"); - } - - // Update the triggered timestamp even if we didn't send - if !was_triggered { - state.triggered_at = Some(Timestamp::now()); - state_changed = true; - } - } else { - // Alert is not in triggering state - if was_triggered { - info!(?file, "alert is no longer triggering, sending clear"); - let all_cleared = - send_clear_to_targets(&resolved_targets, &alert, &ctx, dry_run).await; - if all_cleared { - let mut state = alert_state.write().await; - state.triggered_at = None; - state.last_sent_at = None; - - // Clear last output when alert clears - if !matches!(when_changed, crate::alert::WhenChanged::Boolean(false)) { - state.last_output = None; - } - state_changed = true; - } else { - warn!( - ?file, - "send_clear failed for one or more targets; will retry on next tick" - ); - } - } - } - - if state_changed { - state_dirty.notify_one(); - } - } - }) - } - - pub async fn shutdown(&self) { - info!("shutting down scheduler"); - let mut tasks = self.tasks.write().await; - for (path, handle) in tasks.drain() { - debug!(?path, "cancelling alert task"); - handle.abort(); - } - } -} - -/// Compute a stable digest of the alert's row output for `when-changed` -/// comparison. Hashing rather than storing the serialised rows keeps state.json -/// from growing without bound when a high-cardinality SQL alert holds many rows. -fn digest_context_for_comparison( - context: &tera::Context, - when_changed: &crate::alert::WhenChanged, -) -> String { - use crate::alert::WhenChanged; - - let rows = match context.get("rows") { - Some(value) => value, - None => return blake3_hex(b""), - }; - - let rows_array = match rows.as_array() { - Some(arr) => arr, - None => return blake3_hex(serde_json::to_string(rows).unwrap_or_default().as_bytes()), - }; - - match when_changed { - WhenChanged::Boolean(true) => { - blake3_hex(serde_json::to_string(rows).unwrap_or_default().as_bytes()) - } - WhenChanged::Boolean(false) => blake3_hex(b""), - WhenChanged::Detailed(config) => { - let filtered_rows: Vec> = rows_array - .iter() - .filter_map(|row| { - let obj = row.as_object()?; - let mut filtered = serde_json::Map::new(); - for (key, value) in obj { - let include = if !config.only.is_empty() { - config.only.contains(key) - } else if !config.except.is_empty() { - !config.except.contains(key) - } else { - true - }; - if include { - filtered.insert(key.clone(), value.clone()); - } - } - Some(filtered) - }) - .collect(); - blake3_hex( - serde_json::to_string(&filtered_rows) - .unwrap_or_default() - .as_bytes(), - ) - } - } -} - -fn blake3_hex(bytes: &[u8]) -> String { - blake3::hash(bytes).to_hex().to_string() -} - -/// Send a clear notification to every resolved target. -/// -/// Returns `true` if every target's `send_clear` succeeded; `false` if any -/// failed. Caller should leave `triggered_at` set when this returns `false` -/// so the next scheduler tick retries — otherwise a transient failure -/// (network blip, canopy 5xx, TLS handshake during cert rollover) silently -/// leaves stateful targets like canopy stuck on `active=true`. -async fn send_clear_to_targets( - targets: &[ResolvedTarget], - alert: &AlertDefinition, - ctx: &InternalContext, - dry_run: bool, -) -> bool { - let mut all_ok = true; - for target in targets { - if let Err(err) = target.send_clear(alert, ctx, dry_run).await { - error!("sending clear: {}", LogError(&err)); - all_ok = false; - } - } - all_ok -} - -#[cfg(test)] -mod tests { - use super::*; - use crate::{ - canopy::{DEFAULT_CANOPY_URL, Severity}, - targets::{CanopyConfig, TargetCanopy, TargetConnection, TargetEmail}, - }; - - async fn test_internal_context() -> InternalContext { - let db_url = std::env::var("DATABASE_URL").expect("DATABASE_URL must be set for tests"); - let pg_pool = bestool_postgres::pool::create_pool(&db_url, "bestool-alertd-test") - .await - .unwrap(); - InternalContext { - pg_pool, - http_client: reqwest::Client::new(), - canopy_client: None, - } - } - - fn email_target() -> ResolvedTarget { - ResolvedTarget { - target_id: "ops".into(), - subject: None, - template: "body".into(), - conn: TargetConnection::Email(TargetEmail { - addresses: vec!["ops@example.com".into()], - }), - } - } - - fn canopy_target() -> ResolvedTarget { - ResolvedTarget { - target_id: "default".into(), - subject: None, - template: "body".into(), - conn: TargetConnection::Canopy(TargetCanopy { - canopy: CanopyConfig { - url: DEFAULT_CANOPY_URL.parse().unwrap(), - source: "test".into(), - severity: Some(Severity::Error), - }, - }), - } - } - - fn test_alert() -> AlertDefinition { - AlertDefinition { - file: "test.yml".into(), - ..Default::default() - } - } - - #[tokio::test] - async fn send_clear_to_targets_returns_true_for_non_stateful_only() { - let ctx = test_internal_context().await; - let targets = vec![email_target()]; - assert!(send_clear_to_targets(&targets, &test_alert(), &ctx, false).await); - } - - #[tokio::test] - async fn send_clear_to_targets_returns_false_when_canopy_lacks_client() { - // Canopy target configured but ctx.canopy_client is None — send_clear - // returns Err and the helper should report failure so the caller - // leaves triggered_at set and retries on the next tick. - let ctx = test_internal_context().await; - let targets = vec![canopy_target()]; - assert!(!send_clear_to_targets(&targets, &test_alert(), &ctx, false).await); - } - - #[tokio::test] - async fn send_clear_to_targets_reports_failure_even_when_only_one_target_fails() { - // Mixed bag: email (succeeds) + canopy with no client (fails). - // Helper must return false so the scheduler keeps the alert in the - // triggered state and retries. - let ctx = test_internal_context().await; - let targets = vec![email_target(), canopy_target()]; - assert!(!send_clear_to_targets(&targets, &test_alert(), &ctx, false).await); - } - - #[tokio::test] - async fn send_clear_to_targets_in_dry_run_succeeds_regardless_of_canopy_client() { - // In dry-run mode, canopy never tries to use the missing client. - let ctx = test_internal_context().await; - let targets = vec![canopy_target()]; - assert!(send_clear_to_targets(&targets, &test_alert(), &ctx, true).await); - } - - #[test] - fn digest_is_stable_hex_length() { - let mut ctx = tera::Context::new(); - ctx.insert("rows", &serde_json::json!([{"a": 1}])); - let d = digest_context_for_comparison(&ctx, &crate::alert::WhenChanged::Boolean(true)); - assert_eq!(d.len(), 64, "blake3 hex digest should be 64 chars"); - assert!(d.chars().all(|c| c.is_ascii_hexdigit())); - } - - #[test] - fn digest_does_not_grow_with_row_count() { - // A digest of one row and a digest of ten thousand rows should both - // be the same fixed size — this is the whole point of hashing. - let mut small = tera::Context::new(); - small.insert("rows", &serde_json::json!([{"a": 1}])); - - let big_rows: Vec = (0..10_000) - .map(|i| serde_json::json!({"a": i, "b": "padding-".repeat(8)})) - .collect(); - let mut big = tera::Context::new(); - big.insert("rows", &serde_json::Value::Array(big_rows)); - - let d_small = - digest_context_for_comparison(&small, &crate::alert::WhenChanged::Boolean(true)); - let d_big = digest_context_for_comparison(&big, &crate::alert::WhenChanged::Boolean(true)); - assert_eq!(d_small.len(), d_big.len()); - assert_ne!(d_small, d_big); - } - - #[test] - fn digest_changes_when_rows_change() { - let mut a = tera::Context::new(); - a.insert("rows", &serde_json::json!([{"x": 1}])); - let mut b = tera::Context::new(); - b.insert("rows", &serde_json::json!([{"x": 2}])); - assert_ne!( - digest_context_for_comparison(&a, &crate::alert::WhenChanged::Boolean(true)), - digest_context_for_comparison(&b, &crate::alert::WhenChanged::Boolean(true)), - ); - } - - #[test] - fn digest_only_filter_ignores_other_columns() { - use crate::alert::{WhenChanged, WhenChangedConfig}; - let cfg = WhenChanged::Detailed(WhenChangedConfig { - only: vec!["id".into()], - except: Vec::new(), - }); - - let mut a = tera::Context::new(); - a.insert("rows", &serde_json::json!([{"id": 1, "noise": "x"}])); - let mut b = tera::Context::new(); - b.insert("rows", &serde_json::json!([{"id": 1, "noise": "y"}])); - - assert_eq!( - digest_context_for_comparison(&a, &cfg), - digest_context_for_comparison(&b, &cfg), - "changes to non-`only` columns should not change the digest" - ); - } - - #[test] - fn digest_except_filter_ignores_named_columns() { - use crate::alert::{WhenChanged, WhenChangedConfig}; - let cfg = WhenChanged::Detailed(WhenChangedConfig { - except: vec!["ts".into()], - only: Vec::new(), - }); - - let mut a = tera::Context::new(); - a.insert("rows", &serde_json::json!([{"id": 1, "ts": "2026-01-01"}])); - let mut b = tera::Context::new(); - b.insert("rows", &serde_json::json!([{"id": 1, "ts": "2026-02-01"}])); - - assert_eq!( - digest_context_for_comparison(&a, &cfg), - digest_context_for_comparison(&b, &cfg), - "changes to excluded columns should not change the digest" - ); - } -} diff --git a/crates/alertd/src/state_file.rs b/crates/alertd/src/state_file.rs deleted file mode 100644 index c69cf932..00000000 --- a/crates/alertd/src/state_file.rs +++ /dev/null @@ -1,324 +0,0 @@ -use std::{ - collections::{HashMap, HashSet}, - io::Write, - path::{Path, PathBuf}, -}; - -use jiff::Timestamp; -use miette::{IntoDiagnostic, Result, WrapErr}; -use serde::{Deserialize, Serialize}; -use tempfile::NamedTempFile; -use tracing::{debug, warn}; - -const STATE_FILE_NAME: &str = "state.json"; -const APP_DIR: &str = "bestool-alertd"; - -/// Persistent per-alert state kept across daemon restarts. -#[derive(Debug, Clone, Default, Serialize, Deserialize)] -pub struct PersistedAlertState { - #[serde(skip_serializing_if = "Option::is_none", default)] - pub triggered_at: Option, - #[serde(skip_serializing_if = "Option::is_none", default)] - pub last_sent_at: Option, - /// BLAKE3 hex digest of the previous output used for `when-changed` - /// comparison. Storing the digest rather than the full serialised rows - /// keeps state.json small for high-cardinality alerts. - #[serde(skip_serializing_if = "Option::is_none", default)] - pub last_output: Option, - #[serde(skip_serializing_if = "Option::is_none", default)] - pub paused_until: Option, - /// Mirrors `AlertState::source_was_erroring`. - #[serde(skip_serializing_if = "is_false", default)] - pub source_was_erroring: bool, -} - -fn is_false(b: &bool) -> bool { - !*b -} - -/// On-disk shape of the state file. -#[derive(Debug, Clone, Default, Serialize, Deserialize)] -pub struct PersistedState { - pub saved_at: Option, - pub alerts: HashMap, - /// Mirrors the daemon's database-down tracking. - /// - /// Carried across restarts so that if the daemon goes down with the - /// database unreachable and the database recovers before the daemon is - /// back, the next healthy tick still fires a canopy clear. - #[serde(skip_serializing_if = "is_false", default)] - pub database_was_down: bool, - /// Files that errored during definition loading on the previous run. - /// - /// Carried across restarts so a file that errored on shutdown but loads - /// cleanly on startup still produces a canopy clear. - #[serde(skip_serializing_if = "HashSet::is_empty", default)] - pub definition_error_files: HashSet, -} - -/// Resolve the default state-file path for this platform. -/// -/// Mirrors the path-resolution pattern used by `bestool-psql`'s audit DB. -/// Returns `None` only if every fallback fails (e.g. no `HOME`, no -/// `LOCALAPPDATA`); in that case the caller should run without persistence. -pub fn default_state_file_path() -> Option { - let base = state_base_dir()?; - Some(base.join(APP_DIR).join(STATE_FILE_NAME)) -} - -#[cfg(not(any(target_os = "macos", target_os = "windows")))] -fn state_base_dir() -> Option { - if let Some(dir) = dirs::state_dir() { - return Some(dir); - } - if let Some(dir) = std::env::var_os("XDG_STATE_HOME") { - return Some(PathBuf::from(dir)); - } - if let Some(home) = std::env::var_os("HOME") { - return Some(PathBuf::from(home).join(".local").join("state")); - } - None -} - -#[cfg(any(target_os = "macos", target_os = "windows"))] -fn state_base_dir() -> Option { - if let Some(dir) = dirs::data_local_dir() { - return Some(dir); - } - #[cfg(target_os = "macos")] - { - if let Some(home) = std::env::var_os("HOME") { - return Some( - PathBuf::from(home) - .join("Library") - .join("Application Support"), - ); - } - } - #[cfg(target_os = "windows")] - { - if let Some(localappdata) = std::env::var_os("LOCALAPPDATA") { - return Some(PathBuf::from(localappdata)); - } - } - None -} - -/// Read and parse the state file. -/// -/// If the file is missing, returns an empty state — that's the first-run path. -/// If the file is unreadable or unparsable, logs a warning, deletes the -/// file, and returns an empty state. Persistence is best-effort; a corrupted -/// file should not block the daemon. -pub fn read(path: &Path) -> PersistedState { - let content = match std::fs::read_to_string(path) { - Ok(c) => c, - Err(err) if err.kind() == std::io::ErrorKind::NotFound => { - debug!(?path, "state file missing, starting fresh"); - return PersistedState::default(); - } - Err(err) => { - warn!(?path, "failed to read state file ({err}); discarding"); - let _ = std::fs::remove_file(path); - return PersistedState::default(); - } - }; - - match serde_json::from_str::(&content) { - Ok(state) => { - debug!(?path, alerts = state.alerts.len(), "loaded state file"); - state - } - Err(err) => { - warn!(?path, "failed to parse state file ({err}); discarding"); - let _ = std::fs::remove_file(path); - PersistedState::default() - } - } -} - -/// Atomically write the state to disk. -/// -/// Writes to a tempfile in the same directory, then renames into place. -/// Creates the parent directory if missing. -pub fn write(path: &Path, state: &PersistedState) -> Result<()> { - let parent = path - .parent() - .ok_or_else(|| miette::miette!("state file path has no parent directory: {path:?}"))?; - - std::fs::create_dir_all(parent) - .into_diagnostic() - .wrap_err_with(|| format!("creating state directory {parent:?}"))?; - - let mut tmp = NamedTempFile::new_in(parent) - .into_diagnostic() - .wrap_err_with(|| format!("creating tempfile in {parent:?}"))?; - - let json = serde_json::to_vec_pretty(state) - .into_diagnostic() - .wrap_err("serialising state")?; - - tmp.write_all(&json) - .into_diagnostic() - .wrap_err("writing state tempfile")?; - - tmp.as_file() - .sync_all() - .into_diagnostic() - .wrap_err("fsyncing state tempfile")?; - - tmp.persist(path) - .map_err(|err| miette::miette!("renaming state tempfile into place: {err}"))?; - - Ok(()) -} - -#[cfg(test)] -mod tests { - use super::*; - use tempfile::TempDir; - - #[test] - fn default_path_resolves_on_test_host() { - assert!( - default_state_file_path().is_some(), - "every test host should have a resolvable state dir" - ); - } - - #[test] - fn read_missing_file_returns_empty() { - let tmp = TempDir::new().unwrap(); - let path = tmp.path().join("missing.json"); - let state = read(&path); - assert!(state.alerts.is_empty()); - } - - #[test] - fn read_corrupt_file_returns_empty_and_deletes() { - let tmp = TempDir::new().unwrap(); - let path = tmp.path().join("state.json"); - std::fs::write(&path, "{this is not json").unwrap(); - let state = read(&path); - assert!(state.alerts.is_empty()); - assert!(!path.exists(), "corrupt state file should be deleted"); - } - - #[test] - fn write_then_read_round_trips() { - let tmp = TempDir::new().unwrap(); - let path = tmp.path().join("subdir").join("state.json"); - - let mut alerts = HashMap::new(); - alerts.insert( - PathBuf::from("/etc/alerts/disk-full.yml"), - PersistedAlertState { - triggered_at: Some("2026-05-13T15:00:00Z".parse().unwrap()), - last_sent_at: Some("2026-05-13T15:00:00Z".parse().unwrap()), - last_output: Some("rows=[{...}]".into()), - paused_until: None, - source_was_erroring: false, - }, - ); - let state = PersistedState { - saved_at: Some("2026-05-13T15:00:01Z".parse().unwrap()), - alerts, - ..Default::default() - }; - - write(&path, &state).expect("write should succeed"); - assert!(path.exists(), "parent dir should be auto-created"); - - let loaded = read(&path); - assert_eq!(loaded.alerts.len(), 1); - let entry = &loaded.alerts[&PathBuf::from("/etc/alerts/disk-full.yml")]; - assert!(entry.triggered_at.is_some()); - assert_eq!(entry.last_output.as_deref(), Some("rows=[{...}]")); - assert!(entry.paused_until.is_none()); - } - - #[test] - fn daemon_level_fields_round_trip() { - let tmp = TempDir::new().unwrap(); - let path = tmp.path().join("state.json"); - - let mut definition_error_files = HashSet::new(); - definition_error_files.insert(PathBuf::from("/etc/alerts/broken.yml")); - definition_error_files.insert(PathBuf::from("/etc/alerts/also-broken.yml")); - - let state = PersistedState { - saved_at: Some("2026-05-13T15:00:01Z".parse().unwrap()), - alerts: HashMap::new(), - database_was_down: true, - definition_error_files: definition_error_files.clone(), - }; - write(&path, &state).unwrap(); - - let loaded = read(&path); - assert!(loaded.database_was_down); - assert_eq!(loaded.definition_error_files, definition_error_files); - } - - #[test] - fn daemon_level_fields_default_when_missing() { - // Older state files won't have the new fields; defaults must apply - // so the daemon doesn't panic on hydration. - let tmp = TempDir::new().unwrap(); - let path = tmp.path().join("state.json"); - std::fs::write(&path, r#"{"saved_at":null,"alerts":{}}"#).unwrap(); - let loaded = read(&path); - assert!(!loaded.database_was_down); - assert!(loaded.definition_error_files.is_empty()); - } - - #[test] - fn source_was_erroring_round_trips() { - let tmp = TempDir::new().unwrap(); - let path = tmp.path().join("state.json"); - - let mut alerts = HashMap::new(); - alerts.insert( - PathBuf::from("/etc/alerts/flaky.yml"), - PersistedAlertState { - source_was_erroring: true, - ..Default::default() - }, - ); - let state = PersistedState { - alerts, - ..Default::default() - }; - write(&path, &state).unwrap(); - - let loaded = read(&path); - let entry = &loaded.alerts[&PathBuf::from("/etc/alerts/flaky.yml")]; - assert!(entry.source_was_erroring); - } - - #[test] - fn write_overwrites_existing_atomically() { - let tmp = TempDir::new().unwrap(); - let path = tmp.path().join("state.json"); - - let first = PersistedState::default(); - write(&path, &first).unwrap(); - - let mut alerts = HashMap::new(); - alerts.insert( - PathBuf::from("a.yml"), - PersistedAlertState { - triggered_at: Some("2026-05-13T15:00:00Z".parse().unwrap()), - ..Default::default() - }, - ); - let second = PersistedState { - saved_at: None, - alerts, - ..Default::default() - }; - write(&path, &second).unwrap(); - - let loaded = read(&path); - assert_eq!(loaded.alerts.len(), 1); - } -} diff --git a/crates/alertd/src/targets.rs b/crates/alertd/src/targets.rs deleted file mode 100644 index ea33822c..00000000 --- a/crates/alertd/src/targets.rs +++ /dev/null @@ -1,386 +0,0 @@ -use std::collections::HashMap; - -use miette::Result; - -use crate::{ - EmailConfig, - alert::{AlertDefinition, InternalContext}, - templates::{load_templates, render_alert}, -}; - -pub mod canopy; -mod default; -mod email; -mod slack; - -pub use canopy::{CanopyConfig, TargetCanopy}; -pub use default::determine_default_target; -pub use email::TargetEmail; -pub use slack::TargetSlack; - -#[derive(serde::Deserialize, Debug, Clone)] -#[serde(rename_all = "snake_case")] -#[serde(untagged)] -pub enum SendTarget { - // New format: just id, subject, template - Simple { - id: String, - subject: Option, - template: String, - }, - // Old format: target: external, id, subject, template - External { - target: String, // Should be "external" but we ignore the value - id: String, - subject: Option, - template: String, - }, -} - -impl SendTarget { - pub fn id(&self) -> &str { - match self { - Self::Simple { id, .. } => id, - Self::External { id, .. } => id, - } - } - - pub fn subject(&self) -> &Option { - match self { - Self::Simple { subject, .. } => subject, - Self::External { subject, .. } => subject, - } - } - - pub fn template(&self) -> &str { - match self { - Self::Simple { template, .. } => template, - Self::External { template, .. } => template, - } - } - - pub fn resolve_external( - &self, - external_targets: &HashMap>, - ) -> Vec { - external_targets - .get(self.id()) - .map(|exts| { - exts.iter() - .map(|ext| ResolvedTarget { - target_id: ext.id.clone(), - subject: self.subject().clone(), - template: self.template().to_string(), - conn: ext.conn.clone(), - }) - .collect() - }) - .unwrap_or_default() - } -} - -#[derive(serde::Deserialize, serde::Serialize, Clone, Debug)] -#[serde(untagged)] -pub enum TargetConnection { - Slack(TargetSlack), - Email(TargetEmail), - Canopy(TargetCanopy), -} - -#[derive(Debug, Clone)] -pub struct ResolvedTarget { - pub target_id: String, - pub subject: Option, - pub template: String, - pub conn: TargetConnection, -} - -impl ResolvedTarget { - pub async fn send( - &self, - alert: &AlertDefinition, - tera_ctx: &mut tera::Context, - email: Option<&EmailConfig>, - ctx: &InternalContext, - dry_run: bool, - ) -> Result<()> { - let tera = load_templates(&self.subject, &self.template)?; - let (subject, body) = render_alert(&tera, tera_ctx)?; - - match &self.conn { - TargetConnection::Email(target) => { - target.send(alert, email, &subject, &body, dry_run).await - } - TargetConnection::Slack(target) => { - target - .send( - &ctx.http_client, - slack::SlackSendParams { - alert, - subject: &subject, - body: &body, - tera: &tera, - tera_ctx, - dry_run, - }, - ) - .await - } - TargetConnection::Canopy(target) => { - target - .send( - ctx.canopy_client.as_deref(), - alert, - &self.target_id, - &subject, - &body, - dry_run, - ) - .await - } - } - } - - /// Send a "cleared" notification for stateful targets (canopy). - /// - /// Non-stateful targets (email, slack) return Ok immediately. - pub async fn send_clear( - &self, - alert: &AlertDefinition, - ctx: &InternalContext, - dry_run: bool, - ) -> Result<()> { - match &self.conn { - TargetConnection::Canopy(target) => { - target - .send_clear( - ctx.canopy_client.as_deref(), - alert, - &self.target_id, - dry_run, - ) - .await - } - _ => Ok(()), - } - } -} - -#[derive(serde::Deserialize, Debug)] -pub struct AlertTargets { - pub targets: Vec, -} - -#[derive(serde::Deserialize, serde::Serialize, Clone, Debug)] -#[serde(rename_all = "snake_case")] -pub struct ExternalTarget { - pub id: String, - #[serde(flatten)] - pub conn: TargetConnection, -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn test_send_target_simple_format() { - let yaml = r#" -id: test-target -subject: Test Subject -template: Test template -"#; - let target: SendTarget = serde_yaml::from_str(yaml).unwrap(); - assert_eq!(target.id(), "test-target"); - assert_eq!(target.subject(), &Some("Test Subject".to_string())); - assert_eq!(target.template(), "Test template"); - } - - #[test] - fn test_send_target_external_format() { - let yaml = r#" -target: external -id: test-target -subject: Test Subject -template: Test template -"#; - let target: SendTarget = serde_yaml::from_str(yaml).unwrap(); - assert_eq!(target.id(), "test-target"); - assert_eq!(target.subject(), &Some("Test Subject".to_string())); - assert_eq!(target.template(), "Test template"); - } - - #[test] - fn test_send_target_without_subject() { - let yaml = r#" -id: test-target -template: Test template -"#; - let target: SendTarget = serde_yaml::from_str(yaml).unwrap(); - assert_eq!(target.id(), "test-target"); - assert_eq!(target.subject(), &None); - assert_eq!(target.template(), "Test template"); - } - - #[test] - fn test_external_target_email() { - let yaml = r#" -id: ops-team -addresses: - - ops@example.com - - oncall@example.com -"#; - let target: ExternalTarget = serde_yaml::from_str(yaml).unwrap(); - assert_eq!(target.id, "ops-team"); - assert!(matches!(target.conn, TargetConnection::Email(_))); - if let TargetConnection::Email(email) = &target.conn { - assert_eq!(email.addresses.len(), 2); - assert_eq!(email.addresses[0], "ops@example.com"); - } - } - - #[test] - fn test_external_target_slack() { - let yaml = r#" -id: slack-alerts -webhook: https://hooks.example.com/services/T00/B00/xxx -"#; - let target: ExternalTarget = serde_yaml::from_str(yaml).unwrap(); - assert_eq!(target.id, "slack-alerts"); - assert!(matches!(target.conn, TargetConnection::Slack(_))); - if let TargetConnection::Slack(slack) = &target.conn { - assert_eq!( - slack.webhook.as_str(), - "https://hooks.example.com/services/T00/B00/xxx" - ); - } - } - - #[test] - fn test_external_target_slack_with_fields() { - let yaml = r#" -id: slack-custom -webhook: https://hooks.example.com/services/T00/B00/xxx -fields: - - name: text - field: body - - name: server - value: production -"#; - let target: ExternalTarget = serde_yaml::from_str(yaml).unwrap(); - assert_eq!(target.id, "slack-custom"); - assert!(matches!(target.conn, TargetConnection::Slack(_))); - if let TargetConnection::Slack(slack) = &target.conn { - assert_eq!(slack.fields.len(), 2); - } - } - - #[test] - fn test_alert_targets_mixed() { - let yaml = r#" -targets: - - id: email-team - addresses: - - team@example.com - - id: slack-channel - webhook: https://hooks.example.com/services/T00/B00/xxx -"#; - let targets: AlertTargets = serde_yaml::from_str(yaml).unwrap(); - assert_eq!(targets.targets.len(), 2); - assert!(matches!( - targets.targets[0].conn, - TargetConnection::Email(_) - )); - assert!(matches!( - targets.targets[1].conn, - TargetConnection::Slack(_) - )); - } - - #[test] - fn test_resolve_email_target() { - let mut external_targets = HashMap::new(); - external_targets.insert( - "ops".to_string(), - vec![ExternalTarget { - id: "ops".to_string(), - conn: TargetConnection::Email(TargetEmail { - addresses: vec!["ops@example.com".to_string()], - }), - }], - ); - - let send = SendTarget::Simple { - id: "ops".to_string(), - subject: Some("Test".to_string()), - template: "Body".to_string(), - }; - - let resolved = send.resolve_external(&external_targets); - assert_eq!(resolved.len(), 1); - assert!(matches!(resolved[0].conn, TargetConnection::Email(_))); - } - - #[test] - fn test_resolve_slack_target() { - let mut external_targets = HashMap::new(); - external_targets.insert( - "slack".to_string(), - vec![ExternalTarget { - id: "slack".to_string(), - conn: TargetConnection::Slack(TargetSlack { - webhook: "https://hooks.example.com/services/T00/B00/xxx" - .parse() - .unwrap(), - fields: slack::SlackField::default_set(), - }), - }], - ); - - let send = SendTarget::Simple { - id: "slack".to_string(), - subject: Some("Test".to_string()), - template: "Body".to_string(), - }; - - let resolved = send.resolve_external(&external_targets); - assert_eq!(resolved.len(), 1); - assert!(matches!(resolved[0].conn, TargetConnection::Slack(_))); - } - - #[test] - fn test_resolve_mixed_targets_same_id() { - let mut external_targets = HashMap::new(); - external_targets.insert( - "all".to_string(), - vec![ - ExternalTarget { - id: "all".to_string(), - conn: TargetConnection::Email(TargetEmail { - addresses: vec!["team@example.com".to_string()], - }), - }, - ExternalTarget { - id: "all".to_string(), - conn: TargetConnection::Slack(TargetSlack { - webhook: "https://hooks.example.com/services/T00/B00/xxx" - .parse() - .unwrap(), - fields: slack::SlackField::default_set(), - }), - }, - ], - ); - - let send = SendTarget::Simple { - id: "all".to_string(), - subject: Some("Test".to_string()), - template: "Body".to_string(), - }; - - let resolved = send.resolve_external(&external_targets); - assert_eq!(resolved.len(), 2); - assert!(matches!(resolved[0].conn, TargetConnection::Email(_))); - assert!(matches!(resolved[1].conn, TargetConnection::Slack(_))); - } -} diff --git a/crates/alertd/src/targets/canopy.rs b/crates/alertd/src/targets/canopy.rs deleted file mode 100644 index 7682d500..00000000 --- a/crates/alertd/src/targets/canopy.rs +++ /dev/null @@ -1,217 +0,0 @@ -use jiff::Timestamp; -use miette::{Result, miette}; -use sysinfo::System; -use tracing::debug; -use url::Url; - -use crate::{ - alert::AlertDefinition, - canopy::{CanopyClient, DEFAULT_CANOPY_URL, NewEvent, Severity}, -}; - -fn default_canopy_url() -> Url { - DEFAULT_CANOPY_URL - .parse() - .expect("default canopy URL is valid") -} - -/// External-target connection for a canopy events endpoint. -#[derive(serde::Deserialize, serde::Serialize, Debug, Clone)] -pub struct TargetCanopy { - pub canopy: CanopyConfig, -} - -#[derive(serde::Deserialize, serde::Serialize, Debug, Clone)] -pub struct CanopyConfig { - #[serde(default = "default_canopy_url")] - pub url: Url, - pub source: String, - #[serde(default)] - pub severity: Option, -} - -/// Build the deduplication ref for a canopy event. -/// -/// Combines the hostname, alert file stem, and target id so the same alert -/// firing on different hosts or to different canopy targets produces -/// distinct canopy issues. -pub fn build_ref(alert: &AlertDefinition, target_id: &str) -> String { - let hostname = System::host_name().unwrap_or_else(|| "unknown".into()); - let stem = alert - .file - .file_stem() - .map(|s| s.to_string_lossy().into_owned()) - .unwrap_or_else(|| "alert".into()); - format!("{hostname}/{stem}:{target_id}") -} - -impl TargetCanopy { - /// Post a triggering event to canopy. - pub async fn send( - &self, - client: Option<&CanopyClient>, - alert: &AlertDefinition, - target_id: &str, - subject: &str, - body: &str, - dry_run: bool, - ) -> Result<()> { - let r#ref = build_ref(alert, target_id); - - if dry_run { - println!("-------------------------------"); - println!("Alert: {}", alert.file.display()); - println!("Recipients: canopy:{}", self.canopy.url); - println!("Source: {}", self.canopy.source); - println!("Ref: {ref}", ref = r#ref); - println!( - "Severity: {:?}", - self.canopy.severity.unwrap_or(Severity::Error) - ); - println!("Active: true"); - println!("Subject: {subject}"); - println!("Body: {body}"); - return Ok(()); - } - - let client = client.ok_or_else(|| { - miette!( - "canopy target {target_id} configured but no device key was provided to the daemon" - ) - })?; - - debug!(?alert.file, target_id, "sending canopy trigger event"); - - client - .post_event( - &self.canopy.url, - NewEvent { - source: &self.canopy.source, - r#ref: &r#ref, - message: body, - description: Some(subject), - severity: Some(self.canopy.severity.unwrap_or(Severity::Error)), - occurred_at: Some(Timestamp::now()), - active: Some(true), - }, - ) - .await - } - - /// Post a clearing event to canopy. - pub async fn send_clear( - &self, - client: Option<&CanopyClient>, - alert: &AlertDefinition, - target_id: &str, - dry_run: bool, - ) -> Result<()> { - let r#ref = build_ref(alert, target_id); - - if dry_run { - println!("-------------------------------"); - println!("Alert (cleared): {}", alert.file.display()); - println!("Recipients: canopy:{}", self.canopy.url); - println!("Source: {}", self.canopy.source); - println!("Ref: {ref}", ref = r#ref); - println!("Active: false"); - return Ok(()); - } - - let client = client.ok_or_else(|| { - miette!( - "canopy target {target_id} configured but no device key was provided to the daemon" - ) - })?; - - debug!(?alert.file, target_id, "sending canopy clear event"); - - client - .post_event( - &self.canopy.url, - NewEvent { - source: &self.canopy.source, - r#ref: &r#ref, - message: "alert cleared", - description: None, - severity: Some(self.canopy.severity.unwrap_or(Severity::Error)), - occurred_at: Some(Timestamp::now()), - active: Some(false), - }, - ) - .await - } -} - -#[cfg(test)] -mod tests { - use super::*; - use crate::targets::{ExternalTarget, TargetConnection}; - - #[test] - fn parse_canopy_external_target() { - let yaml = r#" -id: meta -canopy: - url: https://meta.tamanu.app - source: my-server - severity: warning -"#; - let target: ExternalTarget = serde_yaml::from_str(yaml).unwrap(); - assert_eq!(target.id, "meta"); - match target.conn { - TargetConnection::Canopy(canopy) => { - assert_eq!(canopy.canopy.url.as_str(), "https://meta.tamanu.app/"); - assert_eq!(canopy.canopy.source, "my-server"); - assert_eq!(canopy.canopy.severity, Some(Severity::Warning)); - } - _ => panic!("expected canopy target"), - } - } - - #[test] - fn parse_canopy_with_default_url() { - let yaml = r#" -id: meta -canopy: - source: my-server -"#; - let target: ExternalTarget = serde_yaml::from_str(yaml).unwrap(); - match target.conn { - TargetConnection::Canopy(canopy) => { - assert_eq!(canopy.canopy.url.as_str(), "https://meta.tamanu.app/"); - assert_eq!(canopy.canopy.severity, None); - } - _ => panic!("expected canopy target"), - } - } - - #[test] - fn build_ref_is_stable_across_trigger_and_clear_for_same_entity() { - // trigger and clear paths build refs from the same synthetic alert - // file; the ref must match so canopy clears the issue that was opened. - let alert = AlertDefinition { - file: "[internal:source-error:my-alert]".into(), - ..Default::default() - }; - let trigger_ref = build_ref(&alert, "default"); - let clear_ref = build_ref(&alert, "default"); - assert_eq!(trigger_ref, clear_ref); - } - - #[test] - fn build_ref_distinguishes_entities() { - let alert_a = AlertDefinition { - file: "[internal:source-error:alert-a]".into(), - ..Default::default() - }; - let alert_b = AlertDefinition { - file: "[internal:source-error:alert-b]".into(), - ..Default::default() - }; - assert_ne!( - build_ref(&alert_a, "default"), - build_ref(&alert_b, "default") - ); - } -} diff --git a/crates/alertd/src/targets/default.rs b/crates/alertd/src/targets/default.rs deleted file mode 100644 index 83751783..00000000 --- a/crates/alertd/src/targets/default.rs +++ /dev/null @@ -1,140 +0,0 @@ -use std::collections::HashMap; - -use tracing::debug; - -use crate::ExternalTarget; - -/// Pick the fallback target for system-level (synthetic) events. -/// -/// Rules: -/// - Empty map → `None`. -/// - Single target → use it. -/// - Otherwise, a target named `default` → use it. -/// - Otherwise, first alphabetical → use it. -/// -/// The loader injects a synthesised canopy target under `"default"` when no -/// explicit `_targets.yml` configures one and canopy auth is available; this -/// function makes no canopy-specific decisions itself. -pub fn determine_default_target( - external_targets: &HashMap>, -) -> Option<&ExternalTarget> { - if external_targets.is_empty() { - return None; - } - - // If there's only one target, use it - if external_targets.len() == 1 { - let (id, targets) = external_targets.iter().next().unwrap(); - debug!(id, "using only available target as default"); - return targets.first(); - } - - // If there's a target named "default", use that - if let Some(targets) = external_targets.get("default") { - debug!("using 'default' target"); - return targets.first(); - } - - // Otherwise, use the first target alphabetically - let mut sorted_ids: Vec<_> = external_targets.keys().collect(); - sorted_ids.sort(); - if let Some(id) = sorted_ids.first() { - debug!(id, "using first alphabetical target as default"); - if let Some(targets) = external_targets.get(*id) { - return targets.first(); - } - } - - None -} - -#[cfg(test)] -mod tests { - use super::*; - use crate::targets::{TargetConnection, TargetEmail}; - - #[test] - fn test_determine_default_target_single() { - let mut targets = HashMap::new(); - targets.insert( - "only-target".to_string(), - vec![ExternalTarget { - id: "only-target".to_string(), - conn: TargetConnection::Email(TargetEmail { - addresses: vec!["test@example.com".to_string()], - }), - }], - ); - - let default = determine_default_target(&targets); - assert!(default.is_some()); - } - - #[test] - fn test_determine_default_target_named_default() { - let mut targets = HashMap::new(); - targets.insert( - "default".to_string(), - vec![ExternalTarget { - id: "default".to_string(), - conn: TargetConnection::Email(TargetEmail { - addresses: vec!["default@example.com".to_string()], - }), - }], - ); - targets.insert( - "other".to_string(), - vec![ExternalTarget { - id: "other".to_string(), - conn: TargetConnection::Email(TargetEmail { - addresses: vec!["other@example.com".to_string()], - }), - }], - ); - - let default = determine_default_target(&targets); - assert!(default.is_some()); - let default = default.unwrap(); - match &default.conn { - TargetConnection::Email(email) => assert_eq!(email.addresses[0], "default@example.com"), - _ => panic!("expected email target"), - } - } - - #[test] - fn test_determine_default_target_alphabetical() { - let mut targets = HashMap::new(); - targets.insert( - "zebra".to_string(), - vec![ExternalTarget { - id: "zebra".to_string(), - conn: TargetConnection::Email(TargetEmail { - addresses: vec!["zebra@example.com".to_string()], - }), - }], - ); - targets.insert( - "alpha".to_string(), - vec![ExternalTarget { - id: "alpha".to_string(), - conn: TargetConnection::Email(TargetEmail { - addresses: vec!["alpha@example.com".to_string()], - }), - }], - ); - - let default = determine_default_target(&targets); - assert!(default.is_some()); - let default = default.unwrap(); - match &default.conn { - TargetConnection::Email(email) => assert_eq!(email.addresses[0], "alpha@example.com"), - _ => panic!("expected email target"), - } - } - - #[test] - fn test_no_default_when_empty() { - let targets = HashMap::new(); - assert!(determine_default_target(&targets).is_none()); - } -} diff --git a/crates/alertd/src/targets/email.rs b/crates/alertd/src/targets/email.rs deleted file mode 100644 index 84ebe888..00000000 --- a/crates/alertd/src/targets/email.rs +++ /dev/null @@ -1,62 +0,0 @@ -use mailgun_rs::{EmailAddress, Mailgun, Message}; -use miette::{IntoDiagnostic, Result, WrapErr, miette}; -use tracing::debug; - -use crate::{EmailConfig, alert::AlertDefinition}; - -#[derive(serde::Deserialize, serde::Serialize, Clone, Debug)] -#[serde(rename_all = "snake_case")] -pub struct TargetEmail { - pub addresses: Vec, -} - -impl TargetEmail { - pub async fn send( - &self, - alert: &AlertDefinition, - email: Option<&EmailConfig>, - subject: &str, - body: &str, - dry_run: bool, - ) -> Result<()> { - let body = { - let parser = pulldown_cmark::Parser::new(body); - let mut html_output = String::new(); - pulldown_cmark::html::push_html(&mut html_output, parser); - html_output - }; - - if dry_run { - println!("-------------------------------"); - println!("Alert: {}", alert.file.display()); - println!("Recipients: {}", self.addresses.join(", ")); - println!("Subject: {subject}"); - println!("Body: {body}"); - return Ok(()); - } - - debug!(?self.addresses, "sending email"); - let email_config = email.ok_or_else(|| miette!("missing email config"))?; - let sender = EmailAddress::address(&email_config.from); - let message = Message { - to: self - .addresses - .iter() - .map(|email| EmailAddress::address(email)) - .collect(), - subject: subject.into(), - html: body, - ..Default::default() - }; - let mailgun = Mailgun { - api_key: email_config.mailgun_api_key.clone(), - domain: email_config.mailgun_domain.clone(), - }; - mailgun - .async_send(mailgun_rs::MailgunRegion::US, &sender, message, None) - .await - .into_diagnostic() - .wrap_err("sending email") - .map(drop) - } -} diff --git a/crates/alertd/src/targets/slack.rs b/crates/alertd/src/targets/slack.rs deleted file mode 100644 index 76bde571..00000000 --- a/crates/alertd/src/targets/slack.rs +++ /dev/null @@ -1,112 +0,0 @@ -use std::collections::HashMap; - -use miette::{IntoDiagnostic, Result, WrapErr}; -use tera::Tera; -use tracing::debug; -use url::Url; - -use crate::{alert::AlertDefinition, templates::TemplateField}; - -pub struct SlackSendParams<'a> { - pub alert: &'a AlertDefinition, - pub subject: &'a str, - pub body: &'a str, - pub tera: &'a Tera, - pub tera_ctx: &'a tera::Context, - pub dry_run: bool, -} - -#[derive(serde::Deserialize, serde::Serialize, Clone, Debug)] -#[serde(rename_all = "snake_case")] -pub struct TargetSlack { - pub webhook: Url, - - #[serde(default = "SlackField::default_set")] - pub fields: Vec, -} - -#[derive(serde::Deserialize, serde::Serialize, Clone, Debug)] -#[serde(untagged, rename_all = "snake_case")] -pub enum SlackField { - Fixed { name: String, value: String }, - Field { name: String, field: TemplateField }, -} - -impl SlackField { - pub fn default_set() -> Vec { - vec![ - Self::Field { - name: "hostname".into(), - field: TemplateField::Hostname, - }, - Self::Field { - name: "filename".into(), - field: TemplateField::Filename, - }, - Self::Field { - name: "subject".into(), - field: TemplateField::Subject, - }, - Self::Field { - name: "message".into(), - field: TemplateField::Body, - }, - ] - } -} - -impl TargetSlack { - pub async fn send( - &self, - http_client: &reqwest::Client, - params: SlackSendParams<'_>, - ) -> Result<()> { - let SlackSendParams { - alert, - subject, - body, - tera, - tera_ctx, - dry_run, - } = params; - - if dry_run { - println!("-------------------------------"); - println!("Alert: {}", alert.file.display()); - println!("Recipients: slack"); - println!("Subject: {subject}"); - println!("Body: {body}"); - return Ok(()); - } - - let payload: HashMap<&String, String> = self - .fields - .iter() - .map(|field| match field { - SlackField::Fixed { name, value } => (name, value.clone()), - SlackField::Field { name, field } => ( - name, - tera.render(field.as_str(), tera_ctx) - .ok() - .or_else(|| { - tera_ctx.get(field.as_str()).map(|v| match v.as_str() { - Some(t) => t.to_owned(), - None => v.to_string(), - }) - }) - .unwrap_or_default(), - ), - }) - .collect(); - - debug!(?self.webhook, ?payload, "posting to slack webhook"); - http_client - .post(self.webhook.clone()) - .json(&payload) - .send() - .await - .into_diagnostic() - .wrap_err("posting to slack webhook") - .map(drop) - } -} diff --git a/crates/alertd/src/tasks.rs b/crates/alertd/src/tasks.rs index 3989d8af..f468d6fb 100644 --- a/crates/alertd/src/tasks.rs +++ b/crates/alertd/src/tasks.rs @@ -4,7 +4,7 @@ use futures::{future::BoxFuture, stream::BoxStream}; use miette::Result; use serde_json::Value; -use crate::{alert::InternalContext, canopy::CanopyClient}; +use crate::{canopy::CanopyClient, context::InternalContext}; /// Shared resources passed to background tasks on every tick. /// diff --git a/crates/alertd/src/templates.rs b/crates/alertd/src/templates.rs deleted file mode 100644 index 26d54a4e..00000000 --- a/crates/alertd/src/templates.rs +++ /dev/null @@ -1,112 +0,0 @@ -use std::{fmt::Display, time::Duration}; - -use miette::{Context as _, IntoDiagnostic, Result}; -use sysinfo::System; -use tera::{Context as TeraCtx, Tera}; -use tracing::instrument; - -use crate::alert::AlertDefinition; - -/// Format a duration as a single human-friendly unit, dropping any remainder. -/// -/// E.g. 90 minutes prints as "1h"; 1 day as "1d"; 30 seconds as "30s". -pub(crate) fn humanize_duration(dur: Duration) -> String { - let secs = dur.as_secs(); - if secs >= 86400 { - format!("{}d", secs / 86400) - } else if secs >= 3600 { - format!("{}h", secs / 3600) - } else if secs >= 60 { - format!("{}m", secs / 60) - } else { - format!("{}s", secs) - } -} - -const DEFAULT_SUBJECT_TEMPLATE: &str = "[Tamanu Alert] {{ filename }} ({{ hostname }})"; - -#[derive(serde::Deserialize, serde::Serialize, Clone, Copy, Debug)] -#[serde(rename_all = "snake_case")] -pub enum TemplateField { - Filename, - Subject, - Body, - Hostname, - Interval, -} - -impl TemplateField { - pub fn as_str(self) -> &'static str { - match self { - Self::Filename => "filename", - Self::Subject => "subject", - Self::Body => "body", - Self::Hostname => "hostname", - Self::Interval => "interval", - } - } -} - -impl Display for TemplateField { - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { - write!(f, "{}", self.as_str()) - } -} - -#[instrument] -pub fn load_templates(subject: &Option, template: &str) -> Result { - let mut tera = tera::Tera::default(); - - tera.add_raw_template( - TemplateField::Subject.as_str(), - subject.as_deref().unwrap_or(DEFAULT_SUBJECT_TEMPLATE), - ) - .into_diagnostic() - .wrap_err("compiling subject template")?; - tera.add_raw_template(TemplateField::Body.as_str(), template) - .into_diagnostic() - .wrap_err("compiling body template")?; - - Ok(tera) -} - -#[instrument(skip(alert, now))] -pub fn build_context(alert: &AlertDefinition, now: jiff::Timestamp) -> TeraCtx { - let mut context = TeraCtx::new(); - context.insert( - TemplateField::Interval.as_str(), - &humanize_duration(alert.interval_duration), - ); - context.insert( - TemplateField::Hostname.as_str(), - System::host_name().as_deref().unwrap_or("unknown"), - ); - context.insert( - TemplateField::Filename.as_str(), - &alert - .file - .file_name() - .map(|f| f.to_string_lossy().to_string()) - .unwrap_or_else(|| "alert.yml".to_string()), - ); - context.insert("now", &now.to_string()); - - context -} - -#[instrument(skip(tera, context))] -pub fn render_alert(tera: &Tera, context: &mut TeraCtx) -> Result<(String, String)> { - let subject = tera - .render(TemplateField::Subject.as_str(), context) - .into_diagnostic() - .wrap_err("rendering subject template")?; - - context.insert(TemplateField::Subject.as_str(), &subject.to_string()); - - let body = tera - .render(TemplateField::Body.as_str(), context) - .into_diagnostic() - .wrap_err("rendering email template")?; - - Ok((subject, body)) -} diff --git a/crates/alertd/src/windows_service.rs b/crates/alertd/src/windows_service.rs index d0c0d64a..6181c8ef 100644 --- a/crates/alertd/src/windows_service.rs +++ b/crates/alertd/src/windows_service.rs @@ -106,11 +106,6 @@ fn run_service_main() -> Result<()> { let shutdown_tx = Arc::new(Mutex::new(Some(shutdown_tx))); let shutdown_tx_clone = shutdown_tx.clone(); - // Create reload channel for TIME_CHANGE events - let (reload_tx, reload_rx) = tokio::sync::mpsc::channel(10); - let reload_tx = Arc::new(Mutex::new(reload_tx)); - let reload_tx_clone = reload_tx.clone(); - // Event handler receives control events from Windows SCM let event_handler = move |control_event| -> ServiceControlHandlerResult { match control_event { @@ -124,13 +119,6 @@ fn run_service_main() -> Result<()> { } ServiceControlHandlerResult::NoError } - ServiceControl::TimeChange => { - info!("received system time change event, triggering reload"); - // Signal daemon to reload alerts - let tx_guard = reload_tx_clone.lock().unwrap(); - let _ = tx_guard.try_send(()); - ServiceControlHandlerResult::NoError - } _ => ServiceControlHandlerResult::NotImplemented, } }; @@ -199,8 +187,7 @@ fn run_service_main() -> Result<()> { } }); - let daemon_result = - crate::daemon::run_with_shutdown_and_reload(config, shutdown_rx, Some(reload_rx)).await; + let daemon_result = crate::daemon::run_with_shutdown(config, shutdown_rx).await; // Cancel the status update task status_task.abort(); diff --git a/crates/alertd/tests/alert_features.rs b/crates/alertd/tests/alert_features.rs deleted file mode 100644 index abdefa74..00000000 --- a/crates/alertd/tests/alert_features.rs +++ /dev/null @@ -1,571 +0,0 @@ -use std::{sync::Arc, time::Duration}; - -use bestool_alertd::{AlertDefinition, InternalContext}; -use bestool_postgres::pool::{PgPool, create_pool}; - -async fn setup_test_db(table_name: &str) -> (PgPool, String) { - let db_url = std::env::var("DATABASE_URL").expect("DATABASE_URL must be set for tests"); - let pool = create_pool(&db_url, "bestool-alertd-test").await.unwrap(); - - let client = pool.get().await.unwrap(); - - // Create a unique test table for this test - let create_sql = format!( - "CREATE TABLE IF NOT EXISTS {} ( - id SERIAL PRIMARY KEY, - name TEXT NOT NULL, - value REAL NOT NULL, - error_count INTEGER NOT NULL, - created_at TIMESTAMP DEFAULT NOW(), - updated_at TIMESTAMP DEFAULT NOW() - )", - table_name - ); - client.execute(&create_sql, &[]).await.unwrap(); - - // Clean up any existing test data in this table - let delete_sql = format!("DELETE FROM {}", table_name); - client.execute(&delete_sql, &[]).await.unwrap(); - - (pool, table_name.to_string()) -} - -#[tokio::test] -async fn test_numerical_threshold_normal_trigger() { - let (pool, table_name) = setup_test_db("test_metrics_normal").await; - - // Insert test data - let client = pool.get().await.unwrap(); - let insert_sql = format!( - "INSERT INTO {} (name, value, error_count) VALUES ('cpu_usage', 95.5, 10)", - table_name - ); - client.execute(&insert_sql, &[]).await.unwrap(); - - let yaml = format!( - r#" -sql: "SELECT value FROM {} WHERE name = 'cpu_usage'" -numerical: - - field: value - alert-at: 90 - clear-at: 50 -send: - - id: test - subject: Test - template: Test -"#, - table_name - ); - - let mut alert: AlertDefinition = serde_yaml::from_str(&yaml).unwrap(); - alert.file = "test.yml".into(); - let (alert, _) = alert.normalise(&Default::default()).unwrap(); - - let ctx = Arc::new(InternalContext { - pg_pool: pool, - http_client: reqwest::Client::new(), - canopy_client: None, - }); - let mut tera_ctx = bestool_alertd::templates::build_context(&alert, jiff::Timestamp::now()); - - // First run - not yet triggered, should trigger because value >= 90 - let result = alert - .read_sources( - &ctx.pg_pool, - jiff::Timestamp::now() - alert.interval_duration, - &mut tera_ctx, - false, - ) - .await - .unwrap(); - assert!( - result.is_continue(), - "Should trigger when value >= alert-at" - ); - - // Second run - already triggered, should stay triggered because value > clear-at - let result = alert - .read_sources( - &ctx.pg_pool, - jiff::Timestamp::now() - alert.interval_duration, - &mut tera_ctx, - true, - ) - .await - .unwrap(); - assert!( - result.is_continue(), - "Should stay triggered when value > clear-at" - ); - - // Update to clear the alert - let client = ctx.pg_pool.get().await.unwrap(); - let update_sql = format!( - "UPDATE {} SET value = 40 WHERE name = 'cpu_usage'", - table_name - ); - client.execute(&update_sql, &[]).await.unwrap(); - - // Third run - should clear because value <= clear-at - let result = alert - .read_sources( - &ctx.pg_pool, - jiff::Timestamp::now() - alert.interval_duration, - &mut tera_ctx, - true, - ) - .await - .unwrap(); - assert!( - result.is_break(), - "Should clear when value <= clear-at (40 <= 50)" - ); -} - -#[tokio::test] -async fn test_numerical_threshold_inverted_trigger() { - let (pool, table_name) = setup_test_db("test_metrics_inverted").await; - - // Insert test data with low free space - let client = pool.get().await.unwrap(); - let insert_sql = format!( - "INSERT INTO {} (name, value, error_count) VALUES ('free_space_gb', 5.0, 0)", - table_name - ); - client.execute(&insert_sql, &[]).await.unwrap(); - - let yaml = format!( - r#" -sql: "SELECT value FROM {} WHERE name = 'free_space_gb'" -numerical: - - field: value - alert-at: 10 - clear-at: 50 -send: - - id: test - subject: Test - template: Test -"#, - table_name - ); - - let mut alert: AlertDefinition = serde_yaml::from_str(&yaml).unwrap(); - alert.file = "test.yml".into(); - let (alert, _) = alert.normalise(&Default::default()).unwrap(); - - let ctx = Arc::new(InternalContext { - pg_pool: pool, - http_client: reqwest::Client::new(), - canopy_client: None, - }); - let mut tera_ctx = bestool_alertd::templates::build_context(&alert, jiff::Timestamp::now()); - - // First run - not yet triggered, should trigger because value <= 10 (inverted) - let result = alert - .read_sources( - &ctx.pg_pool, - jiff::Timestamp::now() - alert.interval_duration, - &mut tera_ctx, - false, - ) - .await - .unwrap(); - assert!( - result.is_continue(), - "Should trigger when value <= alert-at (inverted)" - ); - - // Second run - already triggered, should stay triggered because value < clear-at - let result = alert - .read_sources( - &ctx.pg_pool, - jiff::Timestamp::now() - alert.interval_duration, - &mut tera_ctx, - true, - ) - .await - .unwrap(); - assert!( - result.is_continue(), - "Should stay triggered when value < clear-at (inverted)" - ); - - // Update to clear the alert - let client = ctx.pg_pool.get().await.unwrap(); - let update_sql = format!( - "UPDATE {} SET value = 60 WHERE name = 'free_space_gb'", - table_name - ); - client.execute(&update_sql, &[]).await.unwrap(); - - // Third run - should clear because value >= clear-at (inverted) - let result = alert - .read_sources( - &ctx.pg_pool, - jiff::Timestamp::now() - alert.interval_duration, - &mut tera_ctx, - true, - ) - .await - .unwrap(); - assert!( - result.is_break(), - "Should clear when value >= clear-at (60 >= 50, inverted)" - ); -} - -#[tokio::test] -async fn test_when_changed_simple() { - let (pool, table_name) = setup_test_db("test_metrics_changed_simple").await; - - // Insert initial data - let client = pool.get().await.unwrap(); - let insert_sql = format!( - "INSERT INTO {} (name, value, error_count) VALUES ('errors', 100.0, 5)", - table_name - ); - client.execute(&insert_sql, &[]).await.unwrap(); - - let yaml = format!( - r#" -sql: "SELECT error_count FROM {} WHERE name = 'errors'" -when-changed: true -send: - - id: test - subject: Test - template: Test -"#, - table_name - ); - - let mut alert: AlertDefinition = serde_yaml::from_str(&yaml).unwrap(); - alert.file = "test.yml".into(); - let (alert, _) = alert.normalise(&Default::default()).unwrap(); - - let ctx = Arc::new(InternalContext { - pg_pool: pool, - http_client: reqwest::Client::new(), - canopy_client: None, - }); - - // First execution - should trigger (first run always triggers) - alert.execute(ctx.clone(), None, true, &[]).await.unwrap(); - // No error means it executed - - // Second execution with same data - would trigger but when-changed should prevent it - // We can't easily test this without the full scheduler state, but we can verify the serialization - - let mut tera_ctx = bestool_alertd::templates::build_context(&alert, jiff::Timestamp::now()); - let _ = alert - .read_sources( - &ctx.pg_pool, - jiff::Timestamp::now() - alert.interval_duration, - &mut tera_ctx, - false, - ) - .await - .unwrap(); - - // Verify context has rows - assert!(tera_ctx.get("rows").is_some()); -} - -#[tokio::test] -async fn test_when_changed_with_except() { - let (pool, table_name) = setup_test_db("test_metrics_changed_except").await; - - // Insert initial data - let client = pool.get().await.unwrap(); - let insert_sql = format!( - "INSERT INTO {} (name, value, error_count, created_at, updated_at) - VALUES ('test', 100.0, 5, NOW(), NOW())", - table_name - ); - client.execute(&insert_sql, &[]).await.unwrap(); - - let yaml = format!( - r#" -sql: "SELECT error_count, created_at, updated_at FROM {} WHERE name = 'test'" -when-changed: - except: [created_at, updated_at] -send: - - id: test - subject: Test - template: Test -"#, - table_name - ); - - let mut alert: AlertDefinition = serde_yaml::from_str(&yaml).unwrap(); - alert.file = "test.yml".into(); - let (alert, _) = alert.normalise(&Default::default()).unwrap(); - - let ctx = Arc::new(InternalContext { - pg_pool: pool, - http_client: reqwest::Client::new(), - canopy_client: None, - }); - let mut tera_ctx = bestool_alertd::templates::build_context(&alert, jiff::Timestamp::now()); - - // Read initial data - let _ = alert - .read_sources( - &ctx.pg_pool, - jiff::Timestamp::now() - alert.interval_duration, - &mut tera_ctx, - false, - ) - .await - .unwrap(); - - let rows = tera_ctx.get("rows").unwrap(); - assert!(!rows.as_array().unwrap().is_empty()); - - // Update only timestamps - when-changed should consider this unchanged - tokio::time::sleep(Duration::from_millis(10)).await; - let client = ctx.pg_pool.get().await.unwrap(); - let update_sql = format!( - "UPDATE {} SET updated_at = NOW() WHERE name = 'test'", - table_name - ); - client.execute(&update_sql, &[]).await.unwrap(); - - // The serialization should be the same because we excluded timestamp columns - // This would be verified in the scheduler's change detection logic -} - -#[tokio::test] -async fn test_when_changed_with_only() { - let (pool, table_name) = setup_test_db("test_metrics_changed_only").await; - - // Insert initial data - let client = pool.get().await.unwrap(); - let insert_sql = format!( - "INSERT INTO {} (name, value, error_count) VALUES ('test', 100.0, 5)", - table_name - ); - client.execute(&insert_sql, &[]).await.unwrap(); - - let yaml = format!( - r#" -sql: "SELECT error_count, value FROM {} WHERE name = 'test'" -when-changed: - only: [error_count] -send: - - id: test - subject: Test - template: Test -"#, - table_name - ); - - let mut alert: AlertDefinition = serde_yaml::from_str(&yaml).unwrap(); - alert.file = "test.yml".into(); - let (alert, _) = alert.normalise(&Default::default()).unwrap(); - - let ctx = Arc::new(InternalContext { - pg_pool: pool, - http_client: reqwest::Client::new(), - canopy_client: None, - }); - let mut tera_ctx = bestool_alertd::templates::build_context(&alert, jiff::Timestamp::now()); - - // Read initial data - let _ = alert - .read_sources( - &ctx.pg_pool, - jiff::Timestamp::now() - alert.interval_duration, - &mut tera_ctx, - false, - ) - .await - .unwrap(); - - // Update value (not in 'only' list) - should be considered unchanged - let client = ctx.pg_pool.get().await.unwrap(); - let update_sql1 = format!("UPDATE {} SET value = 200 WHERE name = 'test'", table_name); - client.execute(&update_sql1, &[]).await.unwrap(); - - // Update error_count (in 'only' list) - should be considered changed - let update_sql2 = format!( - "UPDATE {} SET error_count = 10 WHERE name = 'test'", - table_name - ); - client.execute(&update_sql2, &[]).await.unwrap(); - - let mut tera_ctx2 = bestool_alertd::templates::build_context(&alert, jiff::Timestamp::now()); - let _ = alert - .read_sources( - &ctx.pg_pool, - jiff::Timestamp::now() - alert.interval_duration, - &mut tera_ctx2, - false, - ) - .await - .unwrap(); - - // Both contexts should have rows - assert!(tera_ctx.get("rows").is_some()); - assert!(tera_ctx2.get("rows").is_some()); -} - -#[tokio::test] -async fn test_numerical_and_when_changed_together() { - let (pool, table_name) = setup_test_db("test_metrics_combo").await; - - // Insert initial data - let client = pool.get().await.unwrap(); - let insert_sql = format!( - "INSERT INTO {} (name, value, error_count, created_at) - VALUES ('combo', 95.0, 100, NOW())", - table_name - ); - client.execute(&insert_sql, &[]).await.unwrap(); - - let yaml = format!( - r#" -sql: "SELECT value, error_count, created_at FROM {} WHERE name = 'combo'" -numerical: - - field: value - alert-at: 90 - clear-at: 50 -when-changed: - except: [created_at] -send: - - id: test - subject: Test - template: Test -"#, - table_name - ); - - let mut alert: AlertDefinition = serde_yaml::from_str(&yaml).unwrap(); - alert.file = "test.yml".into(); - let (alert, _) = alert.normalise(&Default::default()).unwrap(); - - let ctx = Arc::new(InternalContext { - pg_pool: pool, - http_client: reqwest::Client::new(), - canopy_client: None, - }); - let mut tera_ctx = bestool_alertd::templates::build_context(&alert, jiff::Timestamp::now()); - - // First run - should trigger due to numerical threshold - let result = alert - .read_sources( - &ctx.pg_pool, - jiff::Timestamp::now() - alert.interval_duration, - &mut tera_ctx, - false, - ) - .await - .unwrap(); - assert!( - result.is_continue(), - "Should trigger when numerical threshold exceeded" - ); - - // Verify rows are in context - assert!(tera_ctx.get("rows").is_some()); - let rows = tera_ctx.get("rows").unwrap().as_array().unwrap(); - assert_eq!(rows.len(), 1); -} - -#[tokio::test] -async fn test_multiple_numerical_thresholds() { - let (pool, table_name) = setup_test_db("test_metrics_multi").await; - - // Insert test data with multiple fields - let client = pool.get().await.unwrap(); - let insert_sql = format!( - "INSERT INTO {} (name, value, error_count) VALUES ('multi', 95.0, 150)", - table_name - ); - client.execute(&insert_sql, &[]).await.unwrap(); - - let yaml = format!( - r#" -sql: "SELECT value as cpu, error_count as errors FROM {} WHERE name = 'multi'" -numerical: - - field: cpu - alert-at: 90 - clear-at: 50 - - field: errors - alert-at: 100 - clear-at: 50 -send: - - id: test - subject: Test - template: Test -"#, - table_name - ); - - let mut alert: AlertDefinition = serde_yaml::from_str(&yaml).unwrap(); - alert.file = "test.yml".into(); - let (alert, _) = alert.normalise(&Default::default()).unwrap(); - - let ctx = Arc::new(InternalContext { - pg_pool: pool, - http_client: reqwest::Client::new(), - canopy_client: None, - }); - let mut tera_ctx = bestool_alertd::templates::build_context(&alert, jiff::Timestamp::now()); - - // Should trigger because both thresholds are exceeded - let result = alert - .read_sources( - &ctx.pg_pool, - jiff::Timestamp::now() - alert.interval_duration, - &mut tera_ctx, - false, - ) - .await - .unwrap(); - assert!( - result.is_continue(), - "Should trigger when any threshold is exceeded" - ); - - // Lower cpu but keep errors high - let client = ctx.pg_pool.get().await.unwrap(); - let update_sql1 = format!("UPDATE {} SET value = 40 WHERE name = 'multi'", table_name); - client.execute(&update_sql1, &[]).await.unwrap(); - - let mut tera_ctx2 = bestool_alertd::templates::build_context(&alert, jiff::Timestamp::now()); - let result = alert - .read_sources( - &ctx.pg_pool, - jiff::Timestamp::now() - alert.interval_duration, - &mut tera_ctx2, - true, - ) - .await - .unwrap(); - assert!( - result.is_continue(), - "Should stay triggered because errors threshold still exceeded" - ); - - // Lower both to clear - let update_sql2 = format!( - "UPDATE {} SET error_count = 30 WHERE name = 'multi'", - table_name - ); - client.execute(&update_sql2, &[]).await.unwrap(); - - let mut tera_ctx3 = bestool_alertd::templates::build_context(&alert, jiff::Timestamp::now()); - let result = alert - .read_sources( - &ctx.pg_pool, - jiff::Timestamp::now() - alert.interval_duration, - &mut tera_ctx3, - true, - ) - .await - .unwrap(); - assert!( - result.is_break(), - "Should clear when all thresholds are below clear-at" - ); -} diff --git a/crates/alertd/tests/database_health.rs b/crates/alertd/tests/database_health.rs deleted file mode 100644 index 4faf8596..00000000 --- a/crates/alertd/tests/database_health.rs +++ /dev/null @@ -1,193 +0,0 @@ -use std::collections::HashMap; - -use bestool_alertd::{ - AlertDefinition, AlwaysSend, EventType, ExternalTarget, TargetConnection, TargetEmail, - TicketSource, WhenChanged, -}; - -#[test] -fn test_database_down_event_type_parsing() { - let yaml = "database-down"; - let event: EventType = serde_yaml::from_str(yaml).unwrap(); - assert_eq!(event, EventType::DatabaseDown); -} - -#[test] -fn test_database_down_event_alert_definition() { - let yaml = r#" -event: database-down -send: - - id: test-target - subject: "DB Down: {{ hostname }}" - template: "Database {{ database_url }} is unreachable: {{ error_message }}" -"#; - let alert: AlertDefinition = serde_yaml::from_str(yaml).unwrap(); - assert!(matches!( - alert.source, - TicketSource::Event { - event: EventType::DatabaseDown - } - )); -} - -#[test] -fn test_database_down_default_template_renders() { - let subject_template = "[bestool-alertd] {{ hostname }}: Database unreachable"; - let body_template = "The PostgreSQL database is unreachable.\n\n\ - Database URL: {{ database_url }}\n\ - Error:
{{ error_message }}
\n\n\ - All SQL-based alerts are non-functional until the database is restored."; - - let tera = bestool_alertd::templates::load_templates( - &Some(subject_template.to_string()), - body_template, - ) - .unwrap(); - - let synthetic_alert = AlertDefinition { - file: "[internal:database-down]".into(), - enabled: true, - interval: "0 seconds".to_string(), - interval_duration: std::time::Duration::from_secs(0), - always_send: AlwaysSend::Boolean(false), - when_changed: WhenChanged::default(), - send: Vec::new(), - server_kind: None, - source: TicketSource::Event { - event: EventType::DatabaseDown, - }, - }; - - let mut ctx = - bestool_alertd::templates::build_context(&synthetic_alert, jiff::Timestamp::now()); - ctx.insert("database_url", "postgresql://user:***@localhost/mydb"); - ctx.insert("error_message", "connection refused"); - - let (subject, body) = bestool_alertd::templates::render_alert(&tera, &mut ctx).unwrap(); - - assert!( - subject.contains("Database unreachable"), - "Subject should mention database unreachable, got: {subject}" - ); - assert!( - body.contains("postgresql://user:***@localhost/mydb"), - "Body should contain the (redacted) database URL, got: {body}" - ); - assert!( - body.contains("connection refused"), - "Body should contain the error message, got: {body}" - ); - assert!( - body.contains("The PostgreSQL database is unreachable"), - "Body should mention database is unreachable, got: {body}" - ); -} - -#[test] -fn test_database_down_event_alert_normalises_with_targets() { - let yaml = r#" -event: database-down -send: - - id: ops - subject: "DB DOWN" - template: "The database is down: {{ error_message }}" -"#; - let mut alert: AlertDefinition = serde_yaml::from_str(yaml).unwrap(); - alert.file = "db-down-alert.yml".into(); - - let mut external_targets = HashMap::new(); - external_targets.insert( - "ops".to_string(), - vec![ExternalTarget { - id: "ops".to_string(), - conn: TargetConnection::Email(TargetEmail { - addresses: vec!["ops@example.com".to_string()], - }), - }], - ); - - let (_alert, resolved) = alert.normalise(&external_targets).unwrap(); - assert!( - !resolved.is_empty(), - "Should resolve at least one target for the database-down event alert" - ); -} - -#[tokio::test] -async fn test_health_check_detects_unreachable_database() { - let bad_url = "postgresql://localhost:59999/nonexistent?connect_timeout=1"; - let pool_result = - bestool_postgres::pool::create_pool(bad_url, "bestool-alertd-health-test").await; - - assert!( - pool_result.is_err(), - "Connecting to a non-existent database should fail" - ); -} - -#[tokio::test] -async fn test_health_check_succeeds_on_valid_database() { - let db_url = std::env::var("DATABASE_URL").expect("DATABASE_URL must be set for tests"); - let pool = bestool_postgres::pool::create_pool(&db_url, "bestool-alertd-health-test") - .await - .unwrap(); - - let conn = pool - .get_timeout(std::time::Duration::from_secs(5)) - .await - .expect("should get a connection from pool"); - let result = conn.simple_query("SELECT 1").await; - assert!(result.is_ok(), "SELECT 1 health check should succeed"); -} - -#[test] -fn test_database_url_password_redaction() { - let url_with_password = "postgresql://user:secretpass@localhost:5432/mydb"; - let mut parsed = url::Url::parse(url_with_password).unwrap(); - let _ = parsed.set_password(Some("***")); - let redacted = parsed.to_string(); - - assert!( - !redacted.contains("secretpass"), - "Password should be redacted, got: {redacted}" - ); - assert!( - redacted.contains("***"), - "Redacted password should show ***, got: {redacted}" - ); - assert!( - redacted.contains("user"), - "Username should be preserved, got: {redacted}" - ); - assert!( - redacted.contains("localhost"), - "Host should be preserved, got: {redacted}" - ); -} - -#[test] -fn test_database_url_redaction_without_password() { - let url_without_password = "postgresql://localhost/mydb"; - let parsed = url::Url::parse(url_without_password).unwrap(); - assert!(parsed.password().is_none()); - let result = parsed.to_string(); - assert!( - !result.contains("***"), - "Should not add *** when no password present, got: {result}" - ); -} - -#[test] -fn test_database_url_redaction_unparsable() { - let bad_url = "not a url at all"; - let result = match url::Url::parse(bad_url) { - Ok(mut parsed) => { - if parsed.password().is_some() { - let _ = parsed.set_password(Some("***")); - } - parsed.to_string() - } - Err(_) => "(unparsable)".to_string(), - }; - assert_eq!(result, "(unparsable)"); -} diff --git a/crates/alertd/tests/reload.rs b/crates/alertd/tests/reload.rs deleted file mode 100644 index 2c302ddb..00000000 --- a/crates/alertd/tests/reload.rs +++ /dev/null @@ -1,77 +0,0 @@ -use std::{sync::Arc, time::Duration}; - -use axum::response::IntoResponse; -use bestool_alertd::InternalContext; -use tokio::sync::mpsc; - -#[tokio::test] -async fn test_reload_command_when_no_daemon_running() { - // This test verifies that the reload command fails gracefully when no daemon is running - let client = reqwest::Client::new(); - - let result = client - .get("http://127.0.0.1:8271/status") - .timeout(Duration::from_secs(1)) - .send() - .await; - - // If there's no daemon running, the request should fail - // (either connection refused or timeout) - assert!(result.is_err()); -} - -#[tokio::test] -async fn test_status_endpoint_response_format() { - // Start a mock HTTP server - let (reload_tx, _reload_rx) = mpsc::channel::<()>(10); - let started_at = jiff::Timestamp::now(); - let pid = std::process::id(); - - let db_url = std::env::var("DATABASE_URL").expect("DATABASE_URL must be set for tests"); - let pool = bestool_postgres::pool::create_pool(&db_url, "bestool-alertd-test") - .await - .unwrap(); - let ctx = Arc::new(InternalContext { - pg_pool: pool.clone(), - http_client: reqwest::Client::new(), - canopy_client: None, - }); - - let scheduler = Arc::new(bestool_alertd::scheduler::Scheduler::new( - vec![], - ctx.clone(), - None, - true, - None, - )); - - let state = Arc::new(bestool_alertd::http_server::ServerState { - reload_tx, - started_at, - pid, - event_manager: None, - internal_context: ctx, - email_config: None, - dry_run: true, - scheduler, - watchdog_timeout: None, - task_endpoints: Arc::new(std::collections::HashMap::new()), - }); - - // This verifies the response structure without needing a full daemon - let response = bestool_alertd::http_server::handle_status(axum::extract::State(state)) - .await - .into_response(); - - assert_eq!(response.status(), axum::http::StatusCode::OK); - - let body = axum::body::to_bytes(response.into_body(), usize::MAX) - .await - .unwrap(); - let status: serde_json::Value = serde_json::from_slice(&body).unwrap(); - - assert_eq!(status["name"], "bestool-alertd"); - assert!(status["version"].is_string()); - assert!(status["started_at"].is_string()); - assert_eq!(status["pid"], pid); -} diff --git a/crates/alertd/tests/state_persistence.rs b/crates/alertd/tests/state_persistence.rs deleted file mode 100644 index ba365a77..00000000 --- a/crates/alertd/tests/state_persistence.rs +++ /dev/null @@ -1,162 +0,0 @@ -use std::{collections::HashMap, path::PathBuf, sync::Arc}; - -use bestool_alertd::{ - InternalContext, - scheduler::Scheduler, - state_file::{PersistedAlertState, PersistedState}, -}; -use tempfile::TempDir; - -async fn make_ctx() -> Arc { - let db_url = std::env::var("DATABASE_URL").expect("DATABASE_URL must be set for tests"); - let pg_pool = bestool_postgres::pool::create_pool(&db_url, "bestool-alertd-test") - .await - .unwrap(); - Arc::new(InternalContext { - pg_pool, - http_client: reqwest::Client::new(), - canopy_client: None, - }) -} - -fn write_alert(dir: &std::path::Path, name: &str, body: &str) -> PathBuf { - let path = dir.join(name); - std::fs::write(&path, body).unwrap(); - path -} - -#[tokio::test] -async fn hydration_seeds_triggered_at_for_matched_alert() { - let tmp = TempDir::new().unwrap(); - let alert_path = write_alert( - tmp.path(), - "disk.yml", - "sql: \"SELECT 1\"\nsend:\n - id: ops\n subject: x\n template: y\n", - ); - - let triggered_at: jiff::Timestamp = "2026-05-13T14:55:00Z".parse().unwrap(); - let mut alerts = HashMap::new(); - alerts.insert( - alert_path.clone(), - PersistedAlertState { - triggered_at: Some(triggered_at), - ..Default::default() - }, - ); - let persisted = PersistedState { - saved_at: None, - alerts, - ..Default::default() - }; - - let ctx = make_ctx().await; - let scheduler = Scheduler::new( - vec![tmp.path().to_string_lossy().into_owned()], - ctx, - None, - true, // dry_run keeps task wakeups inert for the test - None, - ); - - scheduler.set_pending_hydration(persisted).await; - scheduler.load_and_schedule_alerts().await.unwrap(); - - let states = scheduler.get_alert_states().await; - let state = states - .get(&alert_path) - .expect("alert should be loaded under its canonical path"); - assert_eq!( - state.triggered_at, - Some(triggered_at), - "triggered_at should be hydrated from the persisted state" - ); -} - -#[tokio::test] -async fn hydration_ignores_entries_for_unknown_alerts() { - let tmp = TempDir::new().unwrap(); - let alert_path = write_alert( - tmp.path(), - "present.yml", - "sql: \"SELECT 1\"\nsend:\n - id: ops\n subject: x\n template: y\n", - ); - - let mut alerts = HashMap::new(); - // Entry for an alert file that doesn't exist on disk. - alerts.insert( - PathBuf::from("/no/such/path.yml"), - PersistedAlertState { - triggered_at: Some("2026-05-13T14:55:00Z".parse().unwrap()), - ..Default::default() - }, - ); - let persisted = PersistedState { - saved_at: None, - alerts, - ..Default::default() - }; - - let ctx = make_ctx().await; - let scheduler = Scheduler::new( - vec![tmp.path().to_string_lossy().into_owned()], - ctx, - None, - true, - None, - ); - - scheduler.set_pending_hydration(persisted).await; - scheduler.load_and_schedule_alerts().await.unwrap(); - - let states = scheduler.get_alert_states().await; - let state = states.get(&alert_path).unwrap(); - assert!( - state.triggered_at.is_none(), - "orphan hydration entries must not seed unrelated alerts" - ); -} - -#[tokio::test] -async fn snapshot_round_trips_through_persistence() { - let tmp = TempDir::new().unwrap(); - let alert_path = write_alert( - tmp.path(), - "disk.yml", - "sql: \"SELECT 1\"\nsend:\n - id: ops\n subject: x\n template: y\n", - ); - - let triggered_at: jiff::Timestamp = "2026-05-13T14:55:00Z".parse().unwrap(); - let mut alerts = HashMap::new(); - alerts.insert( - alert_path.clone(), - PersistedAlertState { - triggered_at: Some(triggered_at), - last_output: Some("rows=...".into()), - ..Default::default() - }, - ); - let persisted = PersistedState { - saved_at: None, - alerts, - ..Default::default() - }; - - let ctx = make_ctx().await; - let scheduler = Scheduler::new( - vec![tmp.path().to_string_lossy().into_owned()], - ctx, - None, - true, - None, - ); - scheduler.set_pending_hydration(persisted).await; - scheduler.load_and_schedule_alerts().await.unwrap(); - - let snapshot = scheduler.snapshot_for_persistence().await; - let entry = snapshot - .alerts - .get(&alert_path) - .expect("snapshot should include the loaded alert"); - assert_eq!(entry.triggered_at, Some(triggered_at)); - assert_eq!(entry.last_output.as_deref(), Some("rows=...")); -} diff --git a/crates/alertd/update-usage.sh b/crates/alertd/update-usage.sh deleted file mode 100755 index 78ea70b4..00000000 --- a/crates/alertd/update-usage.sh +++ /dev/null @@ -1,5 +0,0 @@ -#!/bin/sh -set -eu - -cd "$(dirname "$0")/../.." -exec ./update-usage.sh diff --git a/crates/bestool/Cargo.toml b/crates/bestool/Cargo.toml index 3adbca9e..42870458 100644 --- a/crates/bestool/Cargo.toml +++ b/crates/bestool/Cargo.toml @@ -52,7 +52,6 @@ json5 = { version = "1.3.0", optional = true } leon = { version = "3.0.1", optional = true } leon-macros = { version = "1.0.2", optional = true } lloggs = "1.1.0" -mailgun-rs = { version = "2.0.2", optional = true } merkle_hash = { version = "3.8.0", optional = true } miette = { workspace = true, features = ["fancy"] } mimalloc = "0.1.50" @@ -61,7 +60,6 @@ owo-colors = { version = "4.2.3", optional = true } p256 = { version = "0.13.2", features = ["pkcs8", "pem"], optional = true } percent-encoding = { version = "2.3.1", optional = true } privilege = { version = "0.3.0", optional = true } -pulldown-cmark = { version = "0.13.3", optional = true } quick-xml = { version = "0.40.1", features = ["serialize"], optional = true } rand_core = { version = "0.6", features = ["getrandom"], optional = true } regex = { version = "1.11.2", optional = true } @@ -71,7 +69,6 @@ rppal = { version = "0.22.1", optional = true } semver = { version = "1.0.28", optional = true } serde = { version = "1.0.219", features = ["derive"] } serde_json = "1.0.143" -serde_path_to_error = { version = "0.1.17", optional = true } serde_yaml = { version = "0.9.33", optional = true } ssh-key = { version = "0.6.6", optional = true } sysinfo = { workspace = true } @@ -82,7 +79,6 @@ terminal_size = { version = "0.4.4", optional = true } thiserror = { workspace = true } tokio = { workspace = true, features = ["full"] } tokio-postgres = { version = "0.7.17", features = ["with-jiff-0_2", "with-uuid-1"], optional = true } -tokio-stream = { version = "0.1.17", optional = true } tokio-util = { workspace = true, optional = true } tracing = { workspace = true } upgrade = { version = "2.0.1", optional = true } @@ -118,7 +114,6 @@ ssh = ["dep:dirs", "dep:duct", "dep:fs4", "dep:ssh-key", "dep:privilege", "dep:w ## Tamanu subcommands tamanu = [ # enable all tamanu subcommands - "tamanu-alerts", "tamanu-alertd", "tamanu-artifacts", "tamanu-backup", @@ -136,21 +131,7 @@ tamanu = [ # enable all tamanu subcommands "tamanu-sync", "tamanu-tags", ] -tamanu-alerts = [ - "__tamanu", - "tamanu-config", - "dep:bestool-alertd", - "dep:bestool-canopy", - "dep:mailgun-rs", - "dep:p256", - "dep:pulldown-cmark", - "dep:serde_path_to_error", - "dep:serde_yaml", - "dep:tera", - "dep:tokio-postgres", - "dep:walkdir", -] -tamanu-alertd = ["__tamanu", "tamanu-config", "dep:bestool-alertd", "dep:bestool-postgres", "dep:p256", "dep:serde_path_to_error", "dep:serde_yaml", "dep:tokio-postgres", "dep:tokio-stream", "dep:walkdir"] +tamanu-alertd = ["__tamanu", "tamanu-config", "dep:bestool-alertd", "dep:bestool-postgres", "dep:p256"] tamanu-artifacts = ["__tamanu", "dep:comfy-table", "dep:detect-targets", "dep:target-tuples"] tamanu-backup = ["__tamanu", "file", "tamanu-config", "dep:bestool-psql", "dep:algae-cli", "dep:duct"] tamanu-backup-configs = ["__tamanu", "tamanu-backup", "dep:walkdir", "dep:zip"] diff --git a/crates/bestool/src/actions/tamanu/alertd.rs b/crates/bestool/src/actions/tamanu/alertd.rs index 89e9190d..a6150a05 100644 --- a/crates/bestool/src/actions/tamanu/alertd.rs +++ b/crates/bestool/src/actions/tamanu/alertd.rs @@ -1,12 +1,8 @@ -use std::{ - net::SocketAddr, - path::{Path, PathBuf}, - sync::Arc, -}; +use std::{net::SocketAddr, path::Path, sync::Arc}; use clap::{Parser, Subcommand}; use miette::Result; -use tracing::{debug, info}; +use tracing::debug; use bestool_tamanu::{ config::{TamanuConfig, load_config}, @@ -18,13 +14,11 @@ use bestool_alertd::doctor::DoctorTask; use super::{TamanuArgs, find_tamanu}; use crate::actions::Context; -/// Run the alert daemon -/// -/// The alert and target definitions are documented online at: -/// -/// and . +/// Run the healthcheck daemon /// -/// Configuration for database and email is read from Tamanu's config files. +/// Periodically runs the doctor healthcheck sweep and posts the result to +/// canopy. Database and device-key configuration is read from Tamanu's config +/// files. #[derive(Debug, Clone, Parser)] #[clap(verbatim_doc_comment)] pub struct AlertdArgs { @@ -35,18 +29,6 @@ pub struct AlertdArgs { /// Common arguments for running the daemon #[derive(Debug, Clone, Parser)] struct DaemonArgs { - /// Glob patterns for alert definitions - /// - /// Patterns can match directories (which will be read recursively) or individual files. - /// Can be provided multiple times. - /// Examples: /etc/tamanu/alerts, /opt/*/alerts, /etc/tamanu/alerts/**/*.yml - #[arg(long)] - glob: Vec, - - /// Execute all alerts once and quit (ignoring intervals) - #[arg(long)] - dry_run: bool, - /// Disable the HTTP server #[arg(long)] no_server: bool, @@ -60,121 +42,30 @@ struct DaemonArgs { /// Watchdog timeout in seconds /// - /// If no alert task reports activity within this many seconds, the daemon + /// If no task reports activity within this many seconds, the daemon /// will exit so the service manager can restart it. Defaults to 600 (10 minutes). #[arg(long, default_value = "600")] watchdog_timeout: u64, /// Disable the watchdog /// - /// By default, the daemon will exit if no alert activity is detected within - /// the watchdog timeout. This flag disables that behavior. + /// By default, the daemon will exit if no task activity is detected within + /// the watchdog timeout. This flag disables that behaviour. #[arg(long)] no_watchdog: bool, - - /// Disable the periodic doctor healthcheck sweep - /// - /// By default, the daemon runs the full doctor check registry every minute - /// and posts the result to canopy. This flag turns that off. - #[arg(long)] - no_healthchecks: bool, } #[derive(Debug, Clone, Subcommand)] enum Command { - /// Run the alert daemon + /// Run the healthcheck daemon /// - /// Starts the daemon which monitors alert definition files and executes alerts - /// based on their configured schedules. The daemon will watch for file changes - /// and automatically reload when definitions are modified. + /// Starts the daemon which runs the doctor healthcheck sweep on a schedule + /// and posts the result to canopy. Run { #[command(flatten)] daemon: DaemonArgs, }, - /// Show status and health of a running daemon - /// - /// Connects to the running daemon's HTTP API and displays version, uptime, - /// health, and watchdog information. Exits with code 1 if the daemon is unhealthy. - Status { - /// HTTP server address(es) to try - /// - /// Can be provided multiple times. Will attempt to connect to each address - /// in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - #[arg(long)] - server_addr: Vec, - }, - - /// Send reload signal to running daemon - /// - /// Connects to the running daemon's HTTP API and triggers a reload. - /// This is an alternative to SIGHUP that works on all platforms including Windows. - Reload { - /// HTTP server address(es) to try - /// - /// Can be provided multiple times. Will attempt to connect to each address - /// in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - #[arg(long)] - server_addr: Vec, - }, - - /// List currently loaded alert files - /// - /// Connects to the running daemon's HTTP API and retrieves the list of - /// currently loaded alert definition files. - LoadedAlerts { - /// HTTP server address(es) to try - /// - /// Can be provided multiple times. Will attempt to connect to each address - /// in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - #[arg(long)] - server_addr: Vec, - - /// Show detailed state information for each alert - #[arg(long)] - detail: bool, - }, - - /// Temporarily pause an alert - /// - /// Pauses an alert until the specified time. The alert will not execute during - /// this period. The pause is lost when the daemon restarts. - PauseAlert { - /// Alert file path to pause - alert: String, - - /// Time until which to pause the alert (fuzzy time format) - /// - /// Examples: "1 hour", "2 days", "next monday", "2024-12-25T10:00:00Z" - /// Defaults to 1 week from now if not specified. - #[arg(long)] - until: Option, - - /// HTTP server address(es) to try - /// - /// Can be provided multiple times. Will attempt to connect to each address - /// in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - #[arg(long)] - server_addr: Vec, - }, - - /// Validate an alert definition file - /// - /// Parses an alert definition file and reports any syntax or validation errors. - /// Uses pretty error reporting to pinpoint the exact location of problems. - /// Requires the daemon to be running. - Validate { - /// Path to the alert definition file to validate - file: PathBuf, - - /// HTTP server address(es) to try - /// - /// Can be provided multiple times. Will attempt to connect to each address - /// in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - #[arg(long)] - server_addr: Vec, - }, - /// Install the daemon as a Windows service /// /// Creates a Windows service named 'bestool-alertd' that will start automatically @@ -206,33 +97,6 @@ enum Command { pub async fn run(args: AlertdArgs, ctx: Context) -> Result<()> { match args.command { - Command::Status { server_addr } => { - let addrs = resolve_addrs(server_addr); - bestool_alertd::commands::get_status(&addrs).await - } - Command::Validate { file, server_addr } => { - let addrs = resolve_addrs(server_addr); - bestool_alertd::commands::validate_alert(&file, &addrs).await - } - Command::Reload { server_addr } => { - let addrs = resolve_addrs(server_addr); - bestool_alertd::commands::send_reload(&addrs).await - } - Command::LoadedAlerts { - server_addr, - detail, - } => { - let addrs = resolve_addrs(server_addr); - bestool_alertd::commands::get_loaded_alerts(&addrs, detail).await - } - Command::PauseAlert { - alert, - until, - server_addr, - } => { - let addrs = resolve_addrs(server_addr); - bestool_alertd::commands::pause_alert(&alert, until.as_deref(), &addrs).await - } Command::Run { daemon } => { let (version, root) = find_tamanu(ctx.require::())?; let config = load_config(&root, None)?; @@ -263,7 +127,7 @@ pub async fn run(args: AlertdArgs, ctx: Context) -> Result<()> { // Check and auto-apply recovery configuration if needed match bestool_alertd::windows_service::is_recovery_configured() { Ok(false) => { - info!("failure recovery not configured, applying automatically"); + tracing::info!("failure recovery not configured, applying automatically"); if let Err(e) = bestool_alertd::windows_service::configure_recovery() { tracing::warn!("failed to auto-configure recovery: {e}"); } @@ -280,53 +144,20 @@ pub async fn run(args: AlertdArgs, ctx: Context) -> Result<()> { } } -fn resolve_addrs(server_addr: Vec) -> Vec { - if server_addr.is_empty() { - bestool_alertd::commands::default_server_addrs() - } else { - server_addr - } -} - async fn build_config( root: &Path, tamanu_version: &node_semver::Version, config: TamanuConfig, DaemonArgs { - glob, - dry_run, no_server, server_addr, watchdog_timeout, no_watchdog, - no_healthchecks, }: DaemonArgs, ) -> Result { - let dirs = if glob.is_empty() { - default_dirs(root).await - } else { - glob - }; - debug!(?dirs, "alert directories"); - - if dirs.is_empty() { - return Err(miette::miette!("no alert directories found or specified")); - } - - info!("starting alertd daemon"); - let database_url = config.database_url(); let pg_pool = bestool_postgres::pool::create_pool(&database_url, "bestool-alertd").await?; - let email = config - .mailgun - .as_ref() - .map(|mg| bestool_alertd::EmailConfig { - from: mg.sender.clone(), - mailgun_api_key: mg.api_key.clone(), - mailgun_domain: mg.domain.clone(), - }); - let watchdog = if no_watchdog { None } else { @@ -346,33 +177,21 @@ async fn build_config( let config = Arc::new(config); - let server_kind = detect_server_kind(&config, &pg_pool).await; - let mut daemon_config = bestool_alertd::DaemonConfig::new( - dirs, pg_pool.clone(), database_url.clone(), tamanu_version.to_string(), ) - .with_dry_run(dry_run) .with_no_server(no_server) .with_server_addrs(server_addr) .with_watchdog_timeout(watchdog) - .with_server_kind(server_kind); - - if !no_healthchecks { - daemon_config = daemon_config.with_task(Arc::new(DoctorTask::new( - env!("CARGO_PKG_VERSION").to_string(), - tamanu_version.clone(), - root.to_path_buf(), - config.clone(), - database_url, - ))); - } - - if let Some(email) = email { - daemon_config = daemon_config.with_email(email); - } + .with_task(Arc::new(DoctorTask::new( + env!("CARGO_PKG_VERSION").to_string(), + tamanu_version.clone(), + root.to_path_buf(), + config.clone(), + database_url, + ))); if let Some(pem) = device_key_pem { daemon_config = daemon_config.with_device_key_pem(pem); @@ -380,53 +199,3 @@ async fn build_config( Ok(daemon_config) } - -/// Resolve the Tamanu server kind for alertd's alert-definition filter. -/// -/// Borrows a connection from the daemon pool so -/// [`bestool_tamanu::detect_kind`] can check `local_system_facts`. Alertd is -/// plugin-agnostic — it takes the kind as an opaque string and compares -/// against alert YAML `server-kind:` values by equality. The Tamanu -/// vocabulary (`central` / `facility`) is decided here, at the integration -/// boundary, not by alertd. -async fn detect_server_kind( - config: &bestool_tamanu::config::TamanuConfig, - pg_pool: &bestool_postgres::pool::PgPool, -) -> &'static str { - let conn = match pg_pool.get().await { - Ok(c) => Some(c), - Err(err) => { - tracing::debug!(%err, "no DB for kind detection; falling back to config-only"); - None - } - }; - match bestool_tamanu::detect_kind(config, conn.as_deref()).await { - bestool_tamanu::ApiServerKind::Central => "central", - bestool_tamanu::ApiServerKind::Facility => "facility", - } -} - -async fn default_dirs(root: &std::path::Path) -> Vec { - use futures::future::join_all; - - let mut dirs = vec![ - PathBuf::from(r"C:\Tamanu\alerts"), - root.join("alerts"), - PathBuf::from("/opt/tamanu-toolbox/alerts"), - PathBuf::from("/etc/tamanu/alerts"), - PathBuf::from("/alerts"), - ]; - if let Ok(cwd) = std::env::current_dir() { - dirs.push(cwd.join("alerts")); - } - - join_all( - dirs.into_iter() - .map(|dir| async { if dir.exists() { Some(dir) } else { None } }), - ) - .await - .into_iter() - .flatten() - .map(|p| p.display().to_string()) - .collect() -} diff --git a/update-usage.sh b/update-usage.sh index f1bdd146..894cf0b9 100755 --- a/update-usage.sh +++ b/update-usage.sh @@ -17,7 +17,6 @@ echo "Updating USAGE.md files..." genusage algae algae-cli genusage "bestool --features iti,iti-improv-wifi" bestool -genusage bestool-alertd alertd genusage bestool-psql psql echo "All USAGE.md files updated successfully" From 3ec22f554e11f4fa61f8baa2058cf2122bc89b02 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?F=C3=A9lix=20Saparelli?= Date: Sun, 31 May 2026 18:25:03 +1200 Subject: [PATCH 10/12] unplan: healthchecks consolidated into alertd All four phases shipped: doctor subsystem moved into bestool-alertd, the 16 YAML alerts migrated to checks, the YAML alert engine + standalone CLI retired, and check thresholds tiered with host capacity reported. windows_service was kept (the daemon still runs as a Windows service) and cpuCores/totalMemoryBytes were added to the status payload. Co-authored-by: Claude --- docs/plans/healthchecks-into-alertd.md | 72 -------------------------- 1 file changed, 72 deletions(-) delete mode 100644 docs/plans/healthchecks-into-alertd.md diff --git a/docs/plans/healthchecks-into-alertd.md b/docs/plans/healthchecks-into-alertd.md deleted file mode 100644 index edcf65b6..00000000 --- a/docs/plans/healthchecks-into-alertd.md +++ /dev/null @@ -1,72 +0,0 @@ -# Plan: consolidate healthchecks into alertd, migrate YAML alerts to checks, retire the alert engine (TODO #10) - -## Context - -Two monitoring systems run in parallel: - -- **Healthchecks** (doctor): code-defined `Check`s run as a concurrent "sweep" → pass/warning/fail + JSON details, POSTed to **canopy** (`POST /status/{server_id}`). Today these live in `crates/tamanu/src/doctor/` (the `Check` type + checks), with the sweep orchestration (`perform_sweep`), canopy posting, and the `DoctorTask` background task in `crates/bestool/`. Viewable via `bestool tamanu doctor`. -- **alertd** (`crates/alertd/`): a daemon loading **YAML alert definitions** (deployed in Tamanu installs at `/etc/tamanu/alerts`, …), scheduling each on its own interval, evaluating SQL/shell/event sources, and dispatching to email/Slack/canopy `/events`. Ships a standalone `bestool-alertd` binary + library; also hosts the `DoctorTask`. - -Decisions: - -1. **Invert the crate relationship.** Move the whole doctor subsystem (framework + checks + sweep + canopy posting + `DoctorTask`) **into `bestool-alertd`**, which calls into `bestool-tamanu` for common Tamanu domain utilities. alertd becomes the monitoring engine that owns both the framework and the checks. No dependency cycle: `bestool-tamanu` never depends on alertd. -2. **Migrate** all 16 production YAML alerts (`~/code/work/tamanu/alerts`) into checks. Migrated checks default to **`Check::fail`** when triggered (single severity, no warn tier). -3. **Canopy owns alerting** and has its own logic — it ignores the sweep's top-level `healthy:false`, so the warn-vs-fail-for-top-level distinction is irrelevant at the canopy level. bestool just posts the sweep; drop email/Slack/per-alert targets, dedup, hysteresis, cadence. -4. **Retire the YAML alert engine and the standalone CLI**; alertd keeps only the daemon framework + the doctor subsystem. -5. **Then review thresholds** across all checks (migrated and pre-existing). - -Note: deployed installs still have YAML files under `/etc/tamanu/alerts`; once the loader is removed they're simply ignored (no error). Operators can delete them later. - -## Target architecture - -- **`bestool-alertd`**: owns the monitoring framework (`BackgroundTask` daemon, http server) **and** the doctor subsystem — `Check`/`CheckStatus`/`OverallResult` wire types, `CheckContext`, the registry + `checks/*`, `progress`, the `ServerInfo` facts, `perform_sweep` + `SweepResult` + canopy status posting, and a built-in `DoctorTask` it registers itself. Depends on `bestool-tamanu` (common domain), `bestool-canopy`, `bestool-postgres`, `bestool-kopia`. -- **`bestool-tamanu`**: common Tamanu domain library only — `config`, `roots`, `connection_url`, `services`, `systemd`, `pm2`, `server_info` (DB queries: metaServerId, patient-portal), `versions`, `ApiServerKind`, `find_tamanu`, `detect_kind`. The `doctor` module and `doctor` feature are removed; description updated. -- **`bestool`**: thin CLI. `bestool tamanu doctor` keeps arg parsing + human rendering + daemon-fetch (`/tasks/doctor/latest`/`recompute`) and calls `bestool_alertd::doctor` for local sweeps + types. `bestool tamanu alertd` configures and runs the alertd daemon (which self-registers its `DoctorTask`). - -## Phase 1 — Invert: move the doctor subsystem into alertd (behaviour-preserving refactor) - -- Relocate `crates/tamanu/src/doctor/{check,checks,checks/*,progress,server_info}.rs` → `crates/alertd/src/doctor/…`. -- Move `perform_sweep` + `SweepResult` + canopy status posting from `crates/bestool/src/actions/tamanu/doctor.rs` into alertd (e.g. `bestool_alertd::doctor::perform_sweep`). -- Move `DoctorTask` (`crates/bestool/src/actions/tamanu/alertd/doctor_task.rs`) into alertd as the built-in task; alertd registers it (or exposes a constructor) so bestool no longer wires it. -- Add `bestool-tamanu` as an alertd dependency; rewrite check imports from `crate::{ApiServerKind, config::TamanuConfig, services, systemd, pm2, server_info, detect_kind, versions}` → `bestool_tamanu::{…}`. -- Move doctor-only deps (`bestool-kopia`, `hickory-resolver`, and `reqwest`/`owo-colors` as needed) from `crates/tamanu/Cargo.toml` to `crates/alertd/Cargo.toml`; remove tamanu's `doctor` feature and update its package description. -- bestool side: `doctor.rs` keeps CLI args + rendering + daemon-fetch, calling `bestool_alertd::doctor`; delete the moved `doctor_task` module; retarget Cargo features (`bestool-tamanu/doctor` → alertd). -- **Behaviour-preserving** — no check logic changes. This is large but mechanical (mostly imports + module moves). - -## Phase 2 — Migrate the 16 YAML alerts to checks (now in alertd), default FAIL, central-only - -Migrated checks emit `Check::fail` when triggered, skip on Facility (gate on `ctx.kind`, mirroring `fhir_jobs`), and attach offending rows as `details`. A shared "recent error rows" helper serves the 7 recent-error alerts (run query → `fail` with rows if any match, else `pass`); the old per-alert `$1 = now - interval` becomes a per-check lookback constant. Verbatim SQL is in `~/code/work/tamanu/alerts/.yml`. - -**New checks (~10):** -| Alert(s) | New check | Style | -|---|---|---| -| certificate-notification-error | `certificate_notification_errors` | recent-error | -| ips-error | `ips_errors` | recent-error | -| patient-communications-error | `patient_communication_errors` | recent-error | -| report-error | `report_errors` | recent-error | -| fhir-error | `fhir_job_errors` | recent-error | -| sync-errors-mobile + sync-errors-server | `sync_session_errors` (one check; detail splits mobile/server; keep benign-error exclusions) | recent-error | -| sync-facility-not-syncing + sync-no-sessions | `sync_facility_stale` (one check; facilities with no recent successful sync) | stuck | -| sync-lookup-stale | `sync_lookup` (**= TODO #8**) | stuck | -| sync-restart-loop | `sync_restart_loop` | threshold | -| fhir-unresolvable-service-requests-labs | `fhir_service_requests_unresolved` | stuck | - -**Already covered (confirm/extend detail, no new check):** fhir-queue-incredibly-large, fhir-queued-job-long, fhir-running-job-long → `fhir_jobs`; sync-long → `sync_sessions`. - -Add via the registry pattern: `pub mod ;` + `entry!("", )` in the registry; `pub async fn run(ctx: CheckContext) -> Check`. Split into ~3 PRs by theme (error-notification / sync / fhir+reconcile). - -## Phase 3 — Retire the YAML alert engine + standalone CLI - -- Remove from alertd: `alert.rs`, `loader.rs`, `glob_resolver.rs`, `events.rs`, `targets.rs` + `targets/*`, `templates.rs`, per-alert `state_file.rs`, the alert parts of `scheduler.rs`, `commands.rs` + `commands/*`, `main.rs`, the `[[bin]]` + `cli` feature, `windows_service.rs`. Trim `DaemonConfig` (drop `alert_globs`, `email`, `server_kind`, alert `dry_run`; keep `pg_pool`, `database_url`, `device_key_pem`, `tamanu_version`, `no_server`, `server_addrs`, `watchdog_timeout`, `background_tasks`), `daemon.rs`, `http_server` (drop `/alerts`,`/targets`,`/validate`,`/reload`,`/pause`; keep `/`,`/status`,`/health`,`/metrics`,`/tasks/*`), and `lib.rs` exports. Relocate `InternalContext` out of `alert.rs` into `daemon.rs`/`context.rs`, slimmed to `{ pg_pool, http_client, canopy_client }`. -- bestool: simplify `tamanu alertd` (drop alert-dir discovery/globs, email/Mailgun flags, alert-filtering `server_kind`, and the passthrough subcommands `status`/`reload`/`pause`/`validate`/`loaded-alerts`); keep pg pool, device-key fetch (canopy auth), `tamanu_version`, build `DaemonConfig`, run. Remove the legacy `bestool tamanu alerts` command + module. Delete example alerts (`alerts/`) and alert test fixtures (`crates/bestool/tests/cmd/alerts*`). -- Gated after Phase 2 so coverage isn't lost. Optional follow-up (not in scope): rename `bestool-alertd` / `bestool tamanu alertd` now that it owns healthchecks, not alerts — deferred to avoid crates.io + systemd/install churn. - -## Phase 4 — Threshold review (all checks) - -After migration, review every check (the 10 migrated + the pre-existing ones) for triggering behaviour: warn-vs-fail, threshold values, central/facility gating, and whether any migrated check should be a warning rather than fail. Produce a short follow-up (possibly its own plan) and adjust. Migrated checks land at FAIL in Phase 2; this pass tunes them. - -## Verification - -- **Phase 1 (refactor)**: `cargo build`/`clippy` across the workspace and all feature combos; `cargo check -p bestool --target x86_64-pc-windows-gnu`; confirm identical behaviour — `bestool tamanu doctor` (local `--no-daemon` and daemon-fetch `--fresh`), canopy `/status` posting, and `/tasks/doctor/{latest,recompute}` all work; grep for dangling `bestool_tamanu::doctor` references. -- **Phase 2 (checks)**: against the local `tamanu-central` / `tamanu-facility` databases, `cargo test -p bestool-alertd` (DB-backed tests where feasible) and `bestool tamanu doctor --json --no-daemon`; confirm each new check appears as pass/fail with `details`, and is skipped on a facility install. -- **Phase 3 (teardown)**: full-workspace `cargo build`/`clippy` + Windows cross-check (windows_service removed); `bestool tamanu alertd` starts, ticks the sweep, posts to canopy, and `bestool tamanu doctor` still fetches from it; grep for leftover references (`loader`, `targets`, `templates`, `AlertDefinition`, `tamanu alerts`). From 3edb697cc752e1088bca2f85405ab17cc310167d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?F=C3=A9lix=20Saparelli?= Date: Tue, 2 Jun 2026 16:41:09 +1200 Subject: [PATCH 11/12] chore: clippy round across the workspace Co-authored-by: Claude --- crates/bestool/src/actions/iti/battery.rs | 12 ++++++++---- crates/bestool/src/actions/iti/lcd.rs | 2 +- crates/bestool/src/actions/iti/sparks.rs | 7 +++---- 3 files changed, 12 insertions(+), 9 deletions(-) diff --git a/crates/bestool/src/actions/iti/battery.rs b/crates/bestool/src/actions/iti/battery.rs index 165a170d..84ee3c15 100644 --- a/crates/bestool/src/actions/iti/battery.rs +++ b/crates/bestool/src/actions/iti/battery.rs @@ -126,8 +126,7 @@ pub async fn once(args: &BatteryArgs, rolling: Option<&mut VecDeque>) -> Re Some(curr - pre) }) .enumerate() - .filter(|(n, diff)| *n >= 4.min(rolling.len() - 1) && *diff != 0.0) - .next() + .find(|(n, diff)| *n >= 4.min(rolling.len() - 1) && *diff != 0.0) .map(|(n, _)| n) .unwrap_or(rolling.len() - 1); @@ -239,7 +238,7 @@ pub async fn once(args: &BatteryArgs, rolling: Option<&mut VecDeque>) -> Re const BLACK: [u8; 3] = [0, 0, 0]; const WHITE: [u8; 3] = [255, 255, 255]; - let (fill, stroke) = if estimates.as_ref().map_or(false, |(rate, _)| *rate > 0.0) { + let (fill, stroke) = if estimates.as_ref().is_some_and(|(rate, _)| *rate > 0.0) { (GREEN, BLACK) } else if capacity <= 3.0 { (RED, WHITE) @@ -273,7 +272,12 @@ pub async fn once(args: &BatteryArgs, rolling: Option<&mut VecDeque>) -> Re ..Default::default() }); (18, 254) - } else if estimates.is_some_and(|(rate, _)| !(rate > 0.0) && !(rate < -0.0)) { + } else if estimates.is_some_and(|(rate, _)| { + !matches!( + rate.partial_cmp(&0.0), + Some(std::cmp::Ordering::Less | std::cmp::Ordering::Greater) + ) + }) { if capacity == 100.0 { items.push(Item { x: 20, diff --git a/crates/bestool/src/actions/iti/lcd.rs b/crates/bestool/src/actions/iti/lcd.rs index 27c3492d..622f042d 100644 --- a/crates/bestool/src/actions/iti/lcd.rs +++ b/crates/bestool/src/actions/iti/lcd.rs @@ -219,7 +219,7 @@ fn loop_inner( let polled = zmq::poll(&mut polls, 1000) .into_diagnostic() .wrap_err("zmq: poll")?; - if running.load(Ordering::SeqCst) == false { + if !running.load(Ordering::SeqCst) { info!("ctrl-c received, exiting"); return Ok(ControlFlow::Break(())); } diff --git a/crates/bestool/src/actions/iti/sparks.rs b/crates/bestool/src/actions/iti/sparks.rs index 38d2ecb6..15570aef 100644 --- a/crates/bestool/src/actions/iti/sparks.rs +++ b/crates/bestool/src/actions/iti/sparks.rs @@ -1,4 +1,4 @@ -use std::{collections::VecDeque, iter::repeat, time::Duration}; +use std::{collections::VecDeque, iter::repeat_n, time::Duration}; use clap::Parser; use miette::Result; @@ -144,7 +144,7 @@ pub fn render( FG_MEM, )); - send(&zmq_socket, Screen::Layout(items))?; + send(zmq_socket, Screen::Layout(items))?; Ok(()) } @@ -156,8 +156,7 @@ fn spark_line<'a>( height: i32, colour: [u8; 3], ) -> impl Iterator + 'a { - repeat(0.0) - .take((INNER_WIDTH as usize).saturating_sub(data.len())) + repeat_n(0.0, (INNER_WIDTH as usize).saturating_sub(data.len())) .chain(data) .map(move |v| { let v = v.clamp(0.0, 1.0); From 45c9cf79a515ff6f80604d5cee6e84091531468d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?F=C3=A9lix=20Saparelli?= Date: Wed, 3 Jun 2026 02:59:33 +1200 Subject: [PATCH 12/12] update usage --- crates/bestool/USAGE.md | 172 +++------------------------------------- 1 file changed, 10 insertions(+), 162 deletions(-) diff --git a/crates/bestool/USAGE.md b/crates/bestool/USAGE.md index 08f6879b..69a7a33e 100644 --- a/crates/bestool/USAGE.md +++ b/crates/bestool/USAGE.md @@ -48,14 +48,8 @@ This document contains the help content for the `bestool` command-line program. * [`bestool ssh`↴](#bestool-ssh) * [`bestool ssh add-key`↴](#bestool-ssh-add-key) * [`bestool tamanu`↴](#bestool-tamanu) -* [`bestool tamanu alerts`↴](#bestool-tamanu-alerts) * [`bestool tamanu alertd`↴](#bestool-tamanu-alertd) * [`bestool tamanu alertd run`↴](#bestool-tamanu-alertd-run) -* [`bestool tamanu alertd status`↴](#bestool-tamanu-alertd-status) -* [`bestool tamanu alertd reload`↴](#bestool-tamanu-alertd-reload) -* [`bestool tamanu alertd loaded-alerts`↴](#bestool-tamanu-alertd-loaded-alerts) -* [`bestool tamanu alertd pause-alert`↴](#bestool-tamanu-alertd-pause-alert) -* [`bestool tamanu alertd validate`↴](#bestool-tamanu-alertd-validate) * [`bestool tamanu artifacts`↴](#bestool-tamanu-artifacts) * [`bestool tamanu backup`↴](#bestool-tamanu-backup) * [`bestool tamanu backup-configs`↴](#bestool-tamanu-backup-configs) @@ -1197,8 +1191,7 @@ Alias: t ###### **Subcommands:** -* `alerts` — Execute alert definitions against Tamanu -* `alertd` — Run the alert daemon +* `alertd` — Run the healthcheck daemon * `artifacts` — List available artifacts for a Tamanu version * `backup` — Backup a local Tamanu database to a single file * `backup-configs` — Backup local Tamanu-related config files to a zip archive @@ -1225,189 +1218,44 @@ pseudo-services. -## `bestool tamanu alerts` - -Execute alert definitions against Tamanu - -DEPRECATED. Use `bestool tamanu alertd` for all new deployments. - -The alert and target definitions are documented online at: - -and . - -**Usage:** `bestool tamanu alerts [OPTIONS]` - -###### **Options:** - -* `--dir ` — Folder containing alert definitions. - - This folder will be read recursively for files with the `.yaml` or `.yml` extension. - - Files that don't match the expected format will be skipped, as will files with `enabled: false` at the top level. Syntax errors will be reported for YAML files. - - It's entirely valid to provide a folder that only contains a `_targets.yml` file. - - Can be provided multiple times. Defaults to (depending on platform): `C:\Tamanu\alerts`, `C:\Tamanu\{current-version}\alerts`, `/opt/tamanu-toolbox/alerts`, `/etc/tamanu/alerts`, `/alerts`, and `./alerts`. -* `--interval ` — How far back to look for alerts. - - This is a duration string, e.g. `1d` for one day, `1h` for one hour, etc. It should match the task scheduling / cron interval for this command. - - Default value: `15m` -* `--timeout ` — Timeout for each alert. - - If an alert takes longer than this to query the database or run the shell script, it will be skipped. Defaults to 30 seconds. - - This is a duration string, e.g. `1d` for one day, `1h` for one hour, etc. - - Default value: `30s` -* `--dry-run` — Don't actually send alerts, just print them to stdout - - - ## `bestool tamanu alertd` -Run the alert daemon +Run the healthcheck daemon -The alert and target definitions are documented online at: - -and . - -Configuration for database and email is read from Tamanu's config files. +Periodically runs the doctor healthcheck sweep and posts the result to +canopy. Database and device-key configuration is read from Tamanu's config +files. **Usage:** `bestool tamanu alertd ` ###### **Subcommands:** -* `run` — Run the alert daemon -* `status` — Show status and health of a running daemon -* `reload` — Send reload signal to running daemon -* `loaded-alerts` — List currently loaded alert files -* `pause-alert` — Temporarily pause an alert -* `validate` — Validate an alert definition file +* `run` — Run the healthcheck daemon ## `bestool tamanu alertd run` -Run the alert daemon +Run the healthcheck daemon -Starts the daemon which monitors alert definition files and executes alerts based on their configured schedules. The daemon will watch for file changes and automatically reload when definitions are modified. +Starts the daemon which runs the doctor healthcheck sweep on a schedule and posts the result to canopy. **Usage:** `bestool tamanu alertd run [OPTIONS]` ###### **Options:** -* `--glob ` — Glob patterns for alert definitions - - Patterns can match directories (which will be read recursively) or individual files. Can be provided multiple times. Examples: /etc/tamanu/alerts, /opt/*/alerts, /etc/tamanu/alerts/**/*.yml -* `--dry-run` — Execute all alerts once and quit (ignoring intervals) * `--no-server` — Disable the HTTP server * `--server-addr ` — HTTP server bind address(es) Can be provided multiple times. The server will attempt to bind to each address in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 * `--watchdog-timeout ` — Watchdog timeout in seconds - If no alert task reports activity within this many seconds, the daemon will exit so the service manager can restart it. Defaults to 600 (10 minutes). + If no task reports activity within this many seconds, the daemon will exit so the service manager can restart it. Defaults to 600 (10 minutes). Default value: `600` * `--no-watchdog` — Disable the watchdog - By default, the daemon will exit if no alert activity is detected within the watchdog timeout. This flag disables that behavior. -* `--no-healthchecks` — Disable the periodic doctor healthcheck sweep - - By default, the daemon runs the full doctor check registry every minute and posts the result to canopy. This flag turns that off. - - - -## `bestool tamanu alertd status` - -Show status and health of a running daemon - -Connects to the running daemon's HTTP API and displays version, uptime, health, and watchdog information. Exits with code 1 if the daemon is unhealthy. - -**Usage:** `bestool tamanu alertd status [OPTIONS]` - -###### **Options:** - -* `--server-addr ` — HTTP server address(es) to try - - Can be provided multiple times. Will attempt to connect to each address in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - - - -## `bestool tamanu alertd reload` - -Send reload signal to running daemon - -Connects to the running daemon's HTTP API and triggers a reload. This is an alternative to SIGHUP that works on all platforms including Windows. - -**Usage:** `bestool tamanu alertd reload [OPTIONS]` - -###### **Options:** - -* `--server-addr ` — HTTP server address(es) to try - - Can be provided multiple times. Will attempt to connect to each address in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - - - -## `bestool tamanu alertd loaded-alerts` - -List currently loaded alert files - -Connects to the running daemon's HTTP API and retrieves the list of currently loaded alert definition files. - -**Usage:** `bestool tamanu alertd loaded-alerts [OPTIONS]` - -###### **Options:** - -* `--server-addr ` — HTTP server address(es) to try - - Can be provided multiple times. Will attempt to connect to each address in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 -* `--detail` — Show detailed state information for each alert - - - -## `bestool tamanu alertd pause-alert` - -Temporarily pause an alert - -Pauses an alert until the specified time. The alert will not execute during this period. The pause is lost when the daemon restarts. - -**Usage:** `bestool tamanu alertd pause-alert [OPTIONS] ` - -###### **Arguments:** - -* `` — Alert file path to pause - -###### **Options:** - -* `--until ` — Time until which to pause the alert (fuzzy time format) - - Examples: "1 hour", "2 days", "next monday", "2024-12-25T10:00:00Z" Defaults to 1 week from now if not specified. -* `--server-addr ` — HTTP server address(es) to try - - Can be provided multiple times. Will attempt to connect to each address in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 - - - -## `bestool tamanu alertd validate` - -Validate an alert definition file - -Parses an alert definition file and reports any syntax or validation errors. Uses pretty error reporting to pinpoint the exact location of problems. Requires the daemon to be running. - -**Usage:** `bestool tamanu alertd validate [OPTIONS] ` - -###### **Arguments:** - -* `` — Path to the alert definition file to validate - -###### **Options:** - -* `--server-addr ` — HTTP server address(es) to try - - Can be provided multiple times. Will attempt to connect to each address in order until one succeeds. Defaults to [::1]:8271 and 127.0.0.1:8271 + By default, the daemon will exit if no task activity is detected within the watchdog timeout. This flag disables that behaviour.