Add fmd inventory to the database by smklein · Pull Request #10345 · oxidecomputer/omicron

smklein · 2026-04-29T21:25:48Z

Adds local-to-sled fault management info, from FMD, to the database.

Builds on #10283 , which propagated this info out of the Sled Agent API.

In the database, this propagates:

inv_fmd_status: The per-sled identification of whether or not fmd data was collected successfully
inv_fmd_host_case: information about individual cases on a sled
inv_fmd_resource: information about individual resources that exist on a sled

Exposes illumos Fault Management Daemon (FMD) data through the sled-agent inventory endpoint. This lets the control plane see diagnosed hardware/software faults on each sled. New API version 35 adds an `fmd: Option<FmdInventory>` field to `Inventory`. When present, it contains: - Cases: diagnosed faults with UUID, diagnostic code, URL, and the full event nvlist serialized as JSON - Resources: affected components with FMRI, fault status flags On illumos, sled-agent queries FMD on each inventory request. On non-illumos (sim, tests), the field is None. Database storage is not included — that's a follow-up.

The inventory endpoint is async, but FMD queries go through door calls to fmd(1M) that can stall the calling thread. Move the work onto spawn_blocking so it doesn't occupy a Tokio worker; surface any JoinError as FmdInventory::Error.

The FmdCase.url docstring now uses backticks instead of quotes around the example URL, which changes the schema description and thus the spec hash.

The verify-libraries xtask checks that binaries don't link against unexpected libraries. Add libfmd_adm.so.1 to the allowlist for the binaries that legitimately need it (sled-agent, sled-agent-sim, and omicron-dev which spins up sled-agent for tests).

oxidecomputer/fmd-adm#2 replaced the bool argument on `FmdAdm::resources()` with an `InvisibleResources` enum to make callsites self-documenting (per hawk's review feedback). Bump to the post-merge rev and update the call in sled-agent.

The Err arm was dead code: omicron compiles with panic="abort", so a panic inside the blocking task aborts the process before the JoinHandle can return Err. Switch to .expect with a descriptive message — if the invariant ever changes, the panic will say what happened. Addresses the take-it-or-leave-it nit at #10283 (comment r3112399887).

The illumos module is cfg-gated, so cargo check on Linux skips it entirely — the missing trait import wasn't visible locally. CI on helios caught it: three E0599 errors for from_untyped_uuid on FmdHostCaseUuid and FmdResourceUuid. Verified fix on atrium.

Adds three tables for persisting per-sled FMD data: inv_fmd_status — per-sled outcome of FMD collection inv_fmd_host_case — diagnosed cases (event payload as JSONB) inv_fmd_resource — resources affected by cases Bumps SCHEMA_VERSION to 254 with directory schema/crdb/inv-fmd. Adds diesel table! entries, db-model structs, and From impls for the read path. No callers yet — write/read/display follow in subsequent commits.

Wires the fmd field added to sled-agent's Inventory by the parent PR through into Nexus's in-memory inventory representation. The collector builder copies inventory.fmd verbatim. The DB read path will populate it from the inv_fmd_* tables in a follow-on commit; for now, the read path substitutes Available(empty) so existing tests round-trip cleanly.

Insert one InvFmdStatus row per sled in each inventory collection, plus a row per case and resource when collection succeeded. Wire the three tables into the existing prune transaction so old collections clean up after themselves.

Loads inv_fmd_status, inv_fmd_host_case, and inv_fmd_resource for the collection and reconstructs SledAgent.fmd. Status row's NULL error_message indicates Available; non-NULL becomes Error{error}. A missing status row falls back to Available with whatever cases/resources were found (defensive, in case of historical data predating this PR).

Adds Display wrappers for FmdInventoryResult/FmdInventory/FmdHostCase/ FmdResource on the sled-agent types. Wires them into nexus/types/src/inventory/display.rs::display_sleds so that `omdb db inventory collections show` (and reconfigurator-cli scripts that print sled inventories) include the FMD section. The FmdHostCase event payload is the FMD nvlist serialized to JSON; we intentionally don't interpret the schema, so it's pretty-printed verbatim under the case heading. Also seeds the representative test inventory (nexus/inventory examples) with a single fault case + resource so the inv_fmd_* tables get rows under test_representative_collection_populates_database. The reconfigurator-cli golden outputs grow a 'fmd:' section accordingly.

mergeconflict · 2026-05-01T02:33:39Z

+        // Load all FMD inventory rows. We expect at most ~tens of cases or
+        // resources per sled, so we don't bother paginating.


Could we get into some sort of trouble if this assumption is wrong (due to a bug, for example)?

So the silver lining is that inventories are periodically trimmed, so this wouldn't last forever.

Dealing with this in a "pathological case" is complicated.
Suppose we have a bug which emits hundreds of thousands of faults on a host.

If we use an artificial cap of "N", it's possible that "the first N faults are spew, the N+1'th one is actually useful". But it's also possible that "N" doesn't fit in an inventory collection? So the caller may want to know "I saw N ereports, of M total, (for N < M), and I need some way of accessing the rest of them". But that is more complicated too, because then we're sorta eyeing inventory and sorta side-stepping it?

It probably does make sense to handle this in the limit, but are you okay with side-stepping this in the short-term, so we can get the flow working end-to-end? I think we'll probably "really want to see all faults" for the short-term future, but I can file an issue

Definitely, just asking questions. If there isn't an easy solution to this one, it seems fair to punt.

... I had a vision of a broken system with a million faults, and I re-visited this.

The majority of the change exists in the parent PR, in e43ae1f , but I also merged that into this PR as d9d3da3

TL;DR: I'm setting an upper bound of "1000" for cases/resources, and if we cross that threshold, we'll return an explicit error. Since this upper bound is considered pathological, it should be fine to do this, and if that bound is too small, it should be fine to make it larger.

mergeconflict · 2026-05-01T02:37:20Z

 );

+CREATE TABLE IF NOT EXISTS omicron.public.inv_fmd_status (
+    inv_collection_id UUID NOT NULL,


Nit (here and below): need a comment like:

-- (foreign key into `inv_collection` table)

Done in 040b815

If a build.rs calls configure_default_omicron_rpaths() but contributes no RPATH entries on illumos, the caller almost certainly forgot a direct `*-sys` dep in Cargo.toml. Catching this at build time avoids silently producing a binary with NEEDED libfmd_adm.so.1 (or libpq.so.5) but no RPATH, which would only surface as an ld.so.1 failure at process startup. The check is illumos-only because fmd-adm-sys is illumos-gated; on Linux a caller can legitimately contribute nothing.

hawkw asked why we were reinventing Result. The answer for the wire format is openapi-lint requires snake_case property names, and Result's default serde representation produces 'Ok'/'Err'. Use omicron_common::snake_case_result (already used for other Result fields in the inventory) to keep the Rust API as a normal Result while presenting snake_case on the wire.

Match the convention used by other inv_* tables in dbinit.sql. Addresses review feedback on #10345.

Captures the full `source()` chain on the three error paths in `fmd::illumos::collect` (FMD open / list cases / list resources). The underlying `fmd_adm::Error` variants `Nul(NulError)` and `Uuid(uuid::Error)` carry inner errors via `#[from]` that the bare `Display` impl can drop, so `InlineErrorChain` actually surfaces the extra context. Matches the existing pattern used elsewhere in `omicron-sled-agent` (e.g. `instance_manager.rs:989`).

Replaces `Result<FmdInventory, String>` with a typed `FmdInventoryError` (addresses hawkw's r3173718186 on PR #10283) and introduces cap constants `FMD_MAX_CASES` / `FMD_MAX_RESOURCES` (1000 each). The error type is flat — a `kind: FmdInventoryErrorKind` discriminator plus a free-form `message`. Three variants: - `FmdError`: catch-all for FMD daemon failures (open / list cases / list resources, or platform doesn't support FMD). - `TooManyCases` / `TooManyResources`: bound exceeded; producer refuses to report a partial set because a count this high is itself a signal operators should investigate directly. Counts/limits are included in the `message` string, so downstream display shows them without needing structured fields.

Merges the typed FmdInventoryError from add-fmd-inventory and updates the database side to match: - Schema: add CRDB ENUM `fmd_inventory_error_kind`. `inv_fmd_status` gains an `error_kind` column (ENUM type) alongside the existing `error_message` (now documented as informational only). A CHECK constraint ensures the two columns agree on NULL. - Diesel binding: `FmdInventoryErrorKindEnum` registered in db-schema; `FmdInventoryErrorKind` Diesel enum in db-model with From conversions to/from the API type in sled-agent-types. - `InvFmdStatus::new` now takes `&Result<FmdInventory, FmdInventoryError>` and projects to (kind, message) for storage. - Write path stores both columns; read path reconstructs the API `FmdInventoryError` from them. - Display wrapper (`FmdInventoryResultDisplay`) updated to wrap the typed result. `Display` formats the error via its `Display` impl, which is the `message` field (set by the producer to include counts when relevant). - `nexus_types::SledAgent.fmd` field type updated to match.

The schema test `test_migration_verification_files` enforces at most one DDL statement per up*.sql file for schema versions after 220. The previous up01.sql had two (CREATE TYPE + CREATE TABLE). Split into: - up01.sql: CREATE TYPE fmd_inventory_error_kind - up02.sql: CREATE TABLE inv_fmd_status - up03.sql: CREATE TABLE inv_fmd_host_case (was up02) - up04.sql: CREATE TABLE inv_fmd_resource (was up03)

smklein added 19 commits April 17, 2026 14:13

Run FMD collection on a blocking thread

ab9399b

The inventory endpoint is async, but FMD queries go through door calls to fmd(1M) that can stall the calling thread. Move the work onto spawn_blocking so it doesn't occupy a Tokio worker; surface any JoinError as FmdInventory::Error.

Regenerate sled-agent OpenAPI spec

32c5872

The FmdCase.url docstring now uses backticks instead of quotes around the example URL, which changes the schema description and thus the spec hash.

merge

2c83840

InventoryResult, less optional

026bd3f

Convert 'FMD unavailable' to a default impl that's empty

bcb288d

Destructuring

778afd5

review feedback

34884ba

merge

6d0ec5f

merge

058314f

datastore: write + prune for inv_fmd_* tables

cfb3eea

Insert one InvFmdStatus row per sled in each inventory collection, plus a row per case and resource when collection succeeded. Wire the three tables into the existing prune transaction so old collections clean up after themselves.

smklein changed the title ~~Add fmd inventory db~~ Add fmd inventory to the database Apr 29, 2026

smklein marked this pull request as ready for review April 30, 2026 00:02

smklein requested review from hawkw and mergeconflict April 30, 2026 00:02

mergeconflict approved these changes May 1, 2026

View reviewed changes

smklein added 6 commits May 14, 2026 12:38

merge

3291762

merge

115d388

Add foreign-key comments to inv_fmd_* tables

040b815

Match the convention used by other inv_* tables in dbinit.sql. Addresses review feedback on #10345.

smklein added 3 commits May 14, 2026 15:55

merge

2e3d1e9

smklein mentioned this pull request May 15, 2026

Add FMD fault inventory to sled-agent API #10283

Merged

Base automatically changed from add-fmd-inventory to main May 15, 2026 18:58

merge

2d63473

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fmd inventory to the database#10345

Add fmd inventory to the database#10345
smklein wants to merge 30 commits into
mainfrom
add-fmd-inventory-db

smklein commented Apr 29, 2026 •

edited

Loading

Uh oh!

mergeconflict May 1, 2026

Uh oh!

smklein May 14, 2026

Uh oh!

mergeconflict May 15, 2026

Uh oh!

smklein May 15, 2026

Uh oh!

mergeconflict May 1, 2026

Uh oh!

smklein May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// Load all FMD inventory rows. We expect at most ~tens of cases or
		// resources per sled, so we don't bother paginating.

Conversation

smklein commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergeconflict May 1, 2026

Choose a reason for hiding this comment

Uh oh!

smklein May 14, 2026

Choose a reason for hiding this comment

Uh oh!

mergeconflict May 15, 2026

Choose a reason for hiding this comment

Uh oh!

smklein May 15, 2026

Choose a reason for hiding this comment

Uh oh!

mergeconflict May 1, 2026

Choose a reason for hiding this comment

Uh oh!

smklein May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

smklein commented Apr 29, 2026 •

edited

Loading