Add fmd inventory to the database#10345
Conversation
Exposes illumos Fault Management Daemon (FMD) data through the sled-agent inventory endpoint. This lets the control plane see diagnosed hardware/software faults on each sled. New API version 35 adds an `fmd: Option<FmdInventory>` field to `Inventory`. When present, it contains: - Cases: diagnosed faults with UUID, diagnostic code, URL, and the full event nvlist serialized as JSON - Resources: affected components with FMRI, fault status flags On illumos, sled-agent queries FMD on each inventory request. On non-illumos (sim, tests), the field is None. Database storage is not included — that's a follow-up.
The inventory endpoint is async, but FMD queries go through door calls to fmd(1M) that can stall the calling thread. Move the work onto spawn_blocking so it doesn't occupy a Tokio worker; surface any JoinError as FmdInventory::Error.
The FmdCase.url docstring now uses backticks instead of quotes around the example URL, which changes the schema description and thus the spec hash.
The verify-libraries xtask checks that binaries don't link against unexpected libraries. Add libfmd_adm.so.1 to the allowlist for the binaries that legitimately need it (sled-agent, sled-agent-sim, and omicron-dev which spins up sled-agent for tests).
oxidecomputer/fmd-adm#2 replaced the bool argument on `FmdAdm::resources()` with an `InvisibleResources` enum to make callsites self-documenting (per hawk's review feedback). Bump to the post-merge rev and update the call in sled-agent.
The Err arm was dead code: omicron compiles with panic="abort", so a panic inside the blocking task aborts the process before the JoinHandle can return Err. Switch to .expect with a descriptive message — if the invariant ever changes, the panic will say what happened. Addresses the take-it-or-leave-it nit at #10283 (comment r3112399887).
The illumos module is cfg-gated, so cargo check on Linux skips it entirely — the missing trait import wasn't visible locally. CI on helios caught it: three E0599 errors for from_untyped_uuid on FmdHostCaseUuid and FmdResourceUuid. Verified fix on atrium.
Adds three tables for persisting per-sled FMD data: inv_fmd_status — per-sled outcome of FMD collection inv_fmd_host_case — diagnosed cases (event payload as JSONB) inv_fmd_resource — resources affected by cases Bumps SCHEMA_VERSION to 254 with directory schema/crdb/inv-fmd. Adds diesel table! entries, db-model structs, and From impls for the read path. No callers yet — write/read/display follow in subsequent commits.
Wires the fmd field added to sled-agent's Inventory by the parent PR through into Nexus's in-memory inventory representation. The collector builder copies inventory.fmd verbatim. The DB read path will populate it from the inv_fmd_* tables in a follow-on commit; for now, the read path substitutes Available(empty) so existing tests round-trip cleanly.
Insert one InvFmdStatus row per sled in each inventory collection, plus a row per case and resource when collection succeeded. Wire the three tables into the existing prune transaction so old collections clean up after themselves.
Loads inv_fmd_status, inv_fmd_host_case, and inv_fmd_resource for the
collection and reconstructs SledAgent.fmd. Status row's NULL
error_message indicates Available; non-NULL becomes Error{error}. A
missing status row falls back to Available with whatever cases/resources
were found (defensive, in case of historical data predating this PR).
Adds Display wrappers for FmdInventoryResult/FmdInventory/FmdHostCase/ FmdResource on the sled-agent types. Wires them into nexus/types/src/inventory/display.rs::display_sleds so that `omdb db inventory collections show` (and reconfigurator-cli scripts that print sled inventories) include the FMD section. The FmdHostCase event payload is the FMD nvlist serialized to JSON; we intentionally don't interpret the schema, so it's pretty-printed verbatim under the case heading. Also seeds the representative test inventory (nexus/inventory examples) with a single fault case + resource so the inv_fmd_* tables get rows under test_representative_collection_populates_database. The reconfigurator-cli golden outputs grow a 'fmd:' section accordingly.
| // Load all FMD inventory rows. We expect at most ~tens of cases or | ||
| // resources per sled, so we don't bother paginating. |
There was a problem hiding this comment.
Could we get into some sort of trouble if this assumption is wrong (due to a bug, for example)?
There was a problem hiding this comment.
So the silver lining is that inventories are periodically trimmed, so this wouldn't last forever.
Dealing with this in a "pathological case" is complicated.
Suppose we have a bug which emits hundreds of thousands of faults on a host.
If we use an artificial cap of "N", it's possible that "the first N faults are spew, the N+1'th one is actually useful". But it's also possible that "N" doesn't fit in an inventory collection? So the caller may want to know "I saw N ereports, of M total, (for N < M), and I need some way of accessing the rest of them". But that is more complicated too, because then we're sorta eyeing inventory and sorta side-stepping it?
It probably does make sense to handle this in the limit, but are you okay with side-stepping this in the short-term, so we can get the flow working end-to-end? I think we'll probably "really want to see all faults" for the short-term future, but I can file an issue
There was a problem hiding this comment.
Definitely, just asking questions. If there isn't an easy solution to this one, it seems fair to punt.
There was a problem hiding this comment.
... I had a vision of a broken system with a million faults, and I re-visited this.
The majority of the change exists in the parent PR, in e43ae1f , but I also merged that into this PR as d9d3da3
TL;DR: I'm setting an upper bound of "1000" for cases/resources, and if we cross that threshold, we'll return an explicit error. Since this upper bound is considered pathological, it should be fine to do this, and if that bound is too small, it should be fine to make it larger.
| ); | ||
|
|
||
| CREATE TABLE IF NOT EXISTS omicron.public.inv_fmd_status ( | ||
| inv_collection_id UUID NOT NULL, |
There was a problem hiding this comment.
Nit (here and below): need a comment like:
-- (foreign key into `inv_collection` table)If a build.rs calls configure_default_omicron_rpaths() but contributes no RPATH entries on illumos, the caller almost certainly forgot a direct `*-sys` dep in Cargo.toml. Catching this at build time avoids silently producing a binary with NEEDED libfmd_adm.so.1 (or libpq.so.5) but no RPATH, which would only surface as an ld.so.1 failure at process startup. The check is illumos-only because fmd-adm-sys is illumos-gated; on Linux a caller can legitimately contribute nothing.
hawkw asked why we were reinventing Result. The answer for the wire format is openapi-lint requires snake_case property names, and Result's default serde representation produces 'Ok'/'Err'. Use omicron_common::snake_case_result (already used for other Result fields in the inventory) to keep the Rust API as a normal Result while presenting snake_case on the wire.
Match the convention used by other inv_* tables in dbinit.sql. Addresses review feedback on #10345.
Captures the full `source()` chain on the three error paths in `fmd::illumos::collect` (FMD open / list cases / list resources). The underlying `fmd_adm::Error` variants `Nul(NulError)` and `Uuid(uuid::Error)` carry inner errors via `#[from]` that the bare `Display` impl can drop, so `InlineErrorChain` actually surfaces the extra context. Matches the existing pattern used elsewhere in `omicron-sled-agent` (e.g. `instance_manager.rs:989`).
Replaces `Result<FmdInventory, String>` with a typed `FmdInventoryError` (addresses hawkw's r3173718186 on PR #10283) and introduces cap constants `FMD_MAX_CASES` / `FMD_MAX_RESOURCES` (1000 each). The error type is flat — a `kind: FmdInventoryErrorKind` discriminator plus a free-form `message`. Three variants: - `FmdError`: catch-all for FMD daemon failures (open / list cases / list resources, or platform doesn't support FMD). - `TooManyCases` / `TooManyResources`: bound exceeded; producer refuses to report a partial set because a count this high is itself a signal operators should investigate directly. Counts/limits are included in the `message` string, so downstream display shows them without needing structured fields.
Merges the typed FmdInventoryError from add-fmd-inventory and updates the database side to match: - Schema: add CRDB ENUM `fmd_inventory_error_kind`. `inv_fmd_status` gains an `error_kind` column (ENUM type) alongside the existing `error_message` (now documented as informational only). A CHECK constraint ensures the two columns agree on NULL. - Diesel binding: `FmdInventoryErrorKindEnum` registered in db-schema; `FmdInventoryErrorKind` Diesel enum in db-model with From conversions to/from the API type in sled-agent-types. - `InvFmdStatus::new` now takes `&Result<FmdInventory, FmdInventoryError>` and projects to (kind, message) for storage. - Write path stores both columns; read path reconstructs the API `FmdInventoryError` from them. - Display wrapper (`FmdInventoryResultDisplay`) updated to wrap the typed result. `Display` formats the error via its `Display` impl, which is the `message` field (set by the producer to include counts when relevant). - `nexus_types::SledAgent.fmd` field type updated to match.
The schema test `test_migration_verification_files` enforces at most one DDL statement per up*.sql file for schema versions after 220. The previous up01.sql had two (CREATE TYPE + CREATE TABLE). Split into: - up01.sql: CREATE TYPE fmd_inventory_error_kind - up02.sql: CREATE TABLE inv_fmd_status - up03.sql: CREATE TABLE inv_fmd_host_case (was up02) - up04.sql: CREATE TABLE inv_fmd_resource (was up03)
Adds local-to-sled fault management info, from FMD, to the database.
Builds on #10283 , which propagated this info out of the Sled Agent API.
In the database, this propagates:
inv_fmd_status: The per-sled identification of whether or not fmd data was collected successfullyinv_fmd_host_case: information about individual cases on a sledinv_fmd_resource: information about individual resources that exist on a sled