Skip to content

Add fmd inventory to the database#10345

Open
smklein wants to merge 30 commits into
mainfrom
add-fmd-inventory-db
Open

Add fmd inventory to the database#10345
smklein wants to merge 30 commits into
mainfrom
add-fmd-inventory-db

Conversation

@smklein
Copy link
Copy Markdown
Collaborator

@smklein smklein commented Apr 29, 2026

Adds local-to-sled fault management info, from FMD, to the database.

Builds on #10283 , which propagated this info out of the Sled Agent API.

In the database, this propagates:

  • inv_fmd_status: The per-sled identification of whether or not fmd data was collected successfully
  • inv_fmd_host_case: information about individual cases on a sled
  • inv_fmd_resource: information about individual resources that exist on a sled

smklein added 19 commits April 17, 2026 14:13
Exposes illumos Fault Management Daemon (FMD) data through the
sled-agent inventory endpoint. This lets the control plane see
diagnosed hardware/software faults on each sled.

New API version 35 adds an `fmd: Option<FmdInventory>` field to
`Inventory`. When present, it contains:
- Cases: diagnosed faults with UUID, diagnostic code, URL, and the
  full event nvlist serialized as JSON
- Resources: affected components with FMRI, fault status flags

On illumos, sled-agent queries FMD on each inventory request. On
non-illumos (sim, tests), the field is None. Database storage is
not included — that's a follow-up.
The inventory endpoint is async, but FMD queries go through door calls
to fmd(1M) that can stall the calling thread. Move the work onto
spawn_blocking so it doesn't occupy a Tokio worker; surface any
JoinError as FmdInventory::Error.
The FmdCase.url docstring now uses backticks instead of quotes around the
example URL, which changes the schema description and thus the spec hash.
The verify-libraries xtask checks that binaries don't link against
unexpected libraries. Add libfmd_adm.so.1 to the allowlist for the
binaries that legitimately need it (sled-agent, sled-agent-sim, and
omicron-dev which spins up sled-agent for tests).
oxidecomputer/fmd-adm#2 replaced the bool argument on
`FmdAdm::resources()` with an `InvisibleResources` enum to make
callsites self-documenting (per hawk's review feedback). Bump to the
post-merge rev and update the call in sled-agent.
The Err arm was dead code: omicron compiles with panic="abort", so a
panic inside the blocking task aborts the process before the JoinHandle
can return Err. Switch to .expect with a descriptive message — if the
invariant ever changes, the panic will say what happened.

Addresses the take-it-or-leave-it nit at #10283
(comment r3112399887).
The illumos module is cfg-gated, so cargo check on Linux skips it
entirely — the missing trait import wasn't visible locally. CI on helios
caught it: three E0599 errors for from_untyped_uuid on FmdHostCaseUuid
and FmdResourceUuid. Verified fix on atrium.
Adds three tables for persisting per-sled FMD data:
  inv_fmd_status      — per-sled outcome of FMD collection
  inv_fmd_host_case   — diagnosed cases (event payload as JSONB)
  inv_fmd_resource    — resources affected by cases

Bumps SCHEMA_VERSION to 254 with directory schema/crdb/inv-fmd. Adds
diesel table! entries, db-model structs, and From impls for the read
path. No callers yet — write/read/display follow in subsequent commits.
Wires the fmd field added to sled-agent's Inventory by the parent PR
through into Nexus's in-memory inventory representation. The collector
builder copies inventory.fmd verbatim. The DB read path will populate it
from the inv_fmd_* tables in a follow-on commit; for now, the read path
substitutes Available(empty) so existing tests round-trip cleanly.
Insert one InvFmdStatus row per sled in each inventory collection,
plus a row per case and resource when collection succeeded. Wire the
three tables into the existing prune transaction so old collections
clean up after themselves.
Loads inv_fmd_status, inv_fmd_host_case, and inv_fmd_resource for the
collection and reconstructs SledAgent.fmd. Status row's NULL
error_message indicates Available; non-NULL becomes Error{error}. A
missing status row falls back to Available with whatever cases/resources
were found (defensive, in case of historical data predating this PR).
Adds Display wrappers for FmdInventoryResult/FmdInventory/FmdHostCase/
FmdResource on the sled-agent types. Wires them into
nexus/types/src/inventory/display.rs::display_sleds so that
`omdb db inventory collections show` (and reconfigurator-cli scripts
that print sled inventories) include the FMD section.

The FmdHostCase event payload is the FMD nvlist serialized to JSON; we
intentionally don't interpret the schema, so it's pretty-printed
verbatim under the case heading.

Also seeds the representative test inventory (nexus/inventory examples)
with a single fault case + resource so the inv_fmd_* tables get rows
under test_representative_collection_populates_database. The
reconfigurator-cli golden outputs grow a 'fmd:' section accordingly.
@smklein smklein changed the title Add fmd inventory db Add fmd inventory to the database Apr 29, 2026
@smklein smklein marked this pull request as ready for review April 30, 2026 00:02
@smklein smklein requested review from hawkw and mergeconflict April 30, 2026 00:02
Comment on lines +4202 to +4203
// Load all FMD inventory rows. We expect at most ~tens of cases or
// resources per sled, so we don't bother paginating.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get into some sort of trouble if this assumption is wrong (due to a bug, for example)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the silver lining is that inventories are periodically trimmed, so this wouldn't last forever.

Dealing with this in a "pathological case" is complicated.
Suppose we have a bug which emits hundreds of thousands of faults on a host.

If we use an artificial cap of "N", it's possible that "the first N faults are spew, the N+1'th one is actually useful". But it's also possible that "N" doesn't fit in an inventory collection? So the caller may want to know "I saw N ereports, of M total, (for N < M), and I need some way of accessing the rest of them". But that is more complicated too, because then we're sorta eyeing inventory and sorta side-stepping it?

It probably does make sense to handle this in the limit, but are you okay with side-stepping this in the short-term, so we can get the flow working end-to-end? I think we'll probably "really want to see all faults" for the short-term future, but I can file an issue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, just asking questions. If there isn't an easy solution to this one, it seems fair to punt.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I had a vision of a broken system with a million faults, and I re-visited this.

The majority of the change exists in the parent PR, in e43ae1f , but I also merged that into this PR as d9d3da3

TL;DR: I'm setting an upper bound of "1000" for cases/resources, and if we cross that threshold, we'll return an explicit error. Since this upper bound is considered pathological, it should be fine to do this, and if that bound is too small, it should be fine to make it larger.

Comment thread schema/crdb/dbinit.sql
);

CREATE TABLE IF NOT EXISTS omicron.public.inv_fmd_status (
inv_collection_id UUID NOT NULL,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (here and below): need a comment like:

-- (foreign key into `inv_collection` table)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 040b815

smklein added 6 commits May 14, 2026 12:38
If a build.rs calls configure_default_omicron_rpaths() but contributes no
RPATH entries on illumos, the caller almost certainly forgot a direct
`*-sys` dep in Cargo.toml. Catching this at build time avoids silently
producing a binary with NEEDED libfmd_adm.so.1 (or libpq.so.5) but no
RPATH, which would only surface as an ld.so.1 failure at process startup.

The check is illumos-only because fmd-adm-sys is illumos-gated; on Linux
a caller can legitimately contribute nothing.
hawkw asked why we were reinventing Result. The answer for the wire format
is openapi-lint requires snake_case property names, and Result's default
serde representation produces 'Ok'/'Err'. Use omicron_common::snake_case_result
(already used for other Result fields in the inventory) to keep the Rust
API as a normal Result while presenting snake_case on the wire.
Match the convention used by other inv_* tables in dbinit.sql.
Addresses review feedback on #10345.
Captures the full `source()` chain on the three error paths in
`fmd::illumos::collect` (FMD open / list cases / list resources). The
underlying `fmd_adm::Error` variants `Nul(NulError)` and
`Uuid(uuid::Error)` carry inner errors via `#[from]` that the bare
`Display` impl can drop, so `InlineErrorChain` actually surfaces the
extra context. Matches the existing pattern used elsewhere in
`omicron-sled-agent` (e.g. `instance_manager.rs:989`).
smklein added 3 commits May 14, 2026 15:55
Replaces `Result<FmdInventory, String>` with a typed `FmdInventoryError`
(addresses hawkw's r3173718186 on PR #10283) and introduces cap constants
`FMD_MAX_CASES` / `FMD_MAX_RESOURCES` (1000 each).

The error type is flat — a `kind: FmdInventoryErrorKind` discriminator plus
a free-form `message`. Three variants:

- `FmdError`: catch-all for FMD daemon failures (open / list cases /
  list resources, or platform doesn't support FMD).
- `TooManyCases` / `TooManyResources`: bound exceeded; producer refuses
  to report a partial set because a count this high is itself a signal
  operators should investigate directly.

Counts/limits are included in the `message` string, so downstream display
shows them without needing structured fields.
Merges the typed FmdInventoryError from add-fmd-inventory and updates
the database side to match:

- Schema: add CRDB ENUM `fmd_inventory_error_kind`. `inv_fmd_status`
  gains an `error_kind` column (ENUM type) alongside the existing
  `error_message` (now documented as informational only). A CHECK
  constraint ensures the two columns agree on NULL.

- Diesel binding: `FmdInventoryErrorKindEnum` registered in db-schema;
  `FmdInventoryErrorKind` Diesel enum in db-model with From conversions
  to/from the API type in sled-agent-types.

- `InvFmdStatus::new` now takes `&Result<FmdInventory, FmdInventoryError>`
  and projects to (kind, message) for storage.

- Write path stores both columns; read path reconstructs the API
  `FmdInventoryError` from them.

- Display wrapper (`FmdInventoryResultDisplay`) updated to wrap the typed
  result. `Display` formats the error via its `Display` impl, which is
  the `message` field (set by the producer to include counts when
  relevant).

- `nexus_types::SledAgent.fmd` field type updated to match.
The schema test `test_migration_verification_files` enforces at most one
DDL statement per up*.sql file for schema versions after 220. The previous
up01.sql had two (CREATE TYPE + CREATE TABLE). Split into:

- up01.sql: CREATE TYPE fmd_inventory_error_kind
- up02.sql: CREATE TABLE inv_fmd_status
- up03.sql: CREATE TABLE inv_fmd_host_case  (was up02)
- up04.sql: CREATE TABLE inv_fmd_resource   (was up03)
Base automatically changed from add-fmd-inventory to main May 15, 2026 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants