Add FMD fault inventory to sled-agent API#10283
Conversation
df22bbb to
aaa6509
Compare
| /// Currently, we only do this for libpq ("pq-sys" package), but this pattern | ||
| /// could be generalized for other native libraries. | ||
| pub static RPATH_ENV_VARS: &'static [&'static str] = &["DEP_PQ_LIBDIRS"]; | ||
| /// We scan all of these on every build.rs call. Only env vars that are |
There was a problem hiding this comment.
This file is worth a careful review.
We're adding a dependency on fmd-adm - which uses fmd-adm-sys and links against a shared library which we expect to exist on our helios deployments. However, it's not in a stable location, so we need to provide some path info about how to find it.
Within fmd-adm-sys, we expose LIBDIRS metadata that gets picked up here. This is similar to the mechanism used by pq-sys, to configure path information for anyone linking against it.
However, this pre-existing version of configuring rpaths really only expected pq-sys to be the only native library we'd link against. With the addition of fmd-adm-sys, it's more complicated. We might want one. We might want the other. We might want both.
BEFORE this PR, this behavior was:
"Iterate over RPATH_ENV_VARS. If anything is missing an environment variable, panic. Otherwise, configure rpath based on the value of that environment variable"
IN THIS PR, I went with the option:
"Iterate over RPATH_ENV_VARS. If anything is missing an environment variable, return. Otherwise, configure rpath based on the value of that environment variable"
If we want to, I could change this to:
"explicitly take a list of dependencies, panic if any of them are missing"
That would basically be like "bring-your-own RPATH_ENV_VARS".
I didn't do this, because it seemed a little redundant to "specify the direct dependency, depend on omicron_rpaths, then also specify the transitive dependencies you collected", when those transitive dependencies can be inferred. However, if this seems too "fast and loose", we can push this work back onto the caller. But I'll probably need to change all the callsites to configure_default_omicron_rpaths to make this option work.
There was a problem hiding this comment.
What does the error in the event of a missing env var look like, after this change? Does something else fail in a way that makes it clear what happened?
There was a problem hiding this comment.
So if someone forgets to set a "*-sys" dependency, but actually needs dynamic linking:
- Cargo wouldn't set
DEP_*_LIBDIRSfor that crate's build.rs (this is "missing the env var") configure_default_omicron_rpathswould skip it- The binary still links (The transitive "-sys" dep emits
cargo:rustc-link-lib) but getDT_NEEDEDfor the library, but without anRPATH - At runtime (when we boot), the dynamic linker falls back to the default search paths, and wouldn't be able to resolve the
NEEDEDlibrary, so it would fail with anld.so.1: <bin>: fatal: ... open failed error
So: that's a bummer, but it's loud (and would be caught by any CI test, service start, etc) so it's unlikely to make it into prod.
Note: for the old code, this is exactly the same failure mode as "forgetting to add a pq-sys dependency, and also forgetting to call configure_default_omicron_rpaths". This invariant basically checked for the case of: You added this rpath-finding logic, but we didn't find any -sys dependencies.
I didn't preserve the old "panic if env var is unset" behavior because with one env var, the panic was basically a global invariant ("If you call this function, you better have a dependency on pq-sys wired up). With multiple possible rpath-based dependencies, each caller would need to enumerate all *-sys dependencies explicitly, which would be effectively re-stating its own Cargo.toml.
ALL THAT SAID - I can include a check that's similar - basically: If you call configure_default_omicron_rpaths, but we don't find any DEP_*_LIBDIRS, then something has definitely gone wrong.
(Done in d44c588)
dad43a7 to
84a6975
Compare
Exposes illumos Fault Management Daemon (FMD) data through the sled-agent inventory endpoint. This lets the control plane see diagnosed hardware/software faults on each sled. New API version 35 adds an `fmd: Option<FmdInventory>` field to `Inventory`. When present, it contains: - Cases: diagnosed faults with UUID, diagnostic code, URL, and the full event nvlist serialized as JSON - Resources: affected components with FMRI, fault status flags On illumos, sled-agent queries FMD on each inventory request. On non-illumos (sim, tests), the field is None. Database storage is not included — that's a follow-up.
84a6975 to
322d5b3
Compare
The inventory endpoint is async, but FMD queries go through door calls to fmd(1M) that can stall the calling thread. Move the work onto spawn_blocking so it doesn't occupy a Tokio worker; surface any JoinError as FmdInventory::Error.
3715de3 to
1fba06e
Compare
The FmdCase.url docstring now uses backticks instead of quotes around the example URL, which changes the schema description and thus the spec hash.
The verify-libraries xtask checks that binaries don't link against unexpected libraries. Add libfmd_adm.so.1 to the allowlist for the binaries that legitimately need it (sled-agent, sled-agent-sim, and omicron-dev which spins up sled-agent for tests).
1fba06e to
887f61a
Compare
oxidecomputer/fmd-adm#2 replaced the bool argument on `FmdAdm::resources()` with an `InvisibleResources` enum to make callsites self-documenting (per hawk's review feedback). Bump to the post-merge rev and update the call in sled-agent.
The Err arm was dead code: omicron compiles with panic="abort", so a panic inside the blocking task aborts the process before the JoinHandle can return Err. Switch to .expect with a descriptive message — if the invariant ever changes, the panic will say what happened. Addresses the take-it-or-leave-it nit at #10283 (comment r3112399887).
The illumos module is cfg-gated, so cargo check on Linux skips it entirely — the missing trait import wasn't visible locally. CI on helios caught it: three E0599 errors for from_untyped_uuid on FmdHostCaseUuid and FmdResourceUuid. Verified fix on atrium.
|
This change should be ready for re-review now! |
| /// Currently, we only do this for libpq ("pq-sys" package), but this pattern | ||
| /// could be generalized for other native libraries. | ||
| pub static RPATH_ENV_VARS: &'static [&'static str] = &["DEP_PQ_LIBDIRS"]; | ||
| /// We scan all of these on every build.rs call. Only env vars that are |
There was a problem hiding this comment.
What does the error in the event of a missing env var look like, after this change? Does something else fail in a way that makes it clear what happened?
| /// Result of querying FMD for fault information. | ||
| #[derive(Clone, Debug, PartialEq, Eq, Deserialize, Serialize, JsonSchema)] | ||
| #[serde(tag = "type", content = "value", rename_all = "snake_case")] | ||
| pub enum FmdInventoryResult { | ||
| /// FMD data was successfully collected. | ||
| Available(FmdInventory), | ||
| /// FMD data collection failed or is not available on this platform. | ||
| Error { error: String }, | ||
| } |
There was a problem hiding this comment.
Given that this is now just an enum of Available and Error...why not just represent it as a normal Rust Result<FmdInventory, String> instead of reinventing Result?
There was a problem hiding this comment.
(we could also make it a Result<FmdInventory, omicron_common::api::external::Error>, probably?)
There was a problem hiding this comment.
Done in 604a831 - Still being a bit stringy with the error for now, because this is getting kinda "shoved into the inventory object", and isn't itself directly plumbed to an HTTP request.
There was a problem hiding this comment.
I am not in love with the string error, but I also don't feel super particular about it in this context.
If a build.rs calls configure_default_omicron_rpaths() but contributes no RPATH entries on illumos, the caller almost certainly forgot a direct `*-sys` dep in Cargo.toml. Catching this at build time avoids silently producing a binary with NEEDED libfmd_adm.so.1 (or libpq.so.5) but no RPATH, which would only surface as an ld.so.1 failure at process startup. The check is illumos-only because fmd-adm-sys is illumos-gated; on Linux a caller can legitimately contribute nothing.
hawkw asked why we were reinventing Result. The answer for the wire format is openapi-lint requires snake_case property names, and Result's default serde representation produces 'Ok'/'Err'. Use omicron_common::snake_case_result (already used for other Result fields in the inventory) to keep the Rust API as a normal Result while presenting snake_case on the wire.
| warn!(log, "failed to open fmd"; "error" => %e); | ||
| return FmdInventoryResult::Error { | ||
| error: format!("failed to open fmd: {e}"), | ||
| }; | ||
| return Err(format!("failed to open fmd: {e}")); |
There was a problem hiding this comment.
might we want to format any of these using InlineErrorChain, or do they not do that?
| /// Result of querying FMD for fault information. | ||
| #[derive(Clone, Debug, PartialEq, Eq, Deserialize, Serialize, JsonSchema)] | ||
| #[serde(tag = "type", content = "value", rename_all = "snake_case")] | ||
| pub enum FmdInventoryResult { | ||
| /// FMD data was successfully collected. | ||
| Available(FmdInventory), | ||
| /// FMD data collection failed or is not available on this platform. | ||
| Error { error: String }, | ||
| } |
There was a problem hiding this comment.
I am not in love with the string error, but I also don't feel super particular about it in this context.
Captures the full `source()` chain on the three error paths in `fmd::illumos::collect` (FMD open / list cases / list resources). The underlying `fmd_adm::Error` variants `Nul(NulError)` and `Uuid(uuid::Error)` carry inner errors via `#[from]` that the bare `Display` impl can drop, so `InlineErrorChain` actually surfaces the extra context. Matches the existing pattern used elsewhere in `omicron-sled-agent` (e.g. `instance_manager.rs:989`).
Exposes data from fmd-adm through the sled-agent inventory endpoint.
We're extracting:
I'm only exposing this through the API right now - Nexus isn't yet shoving it into the DB. Soon! But wanted feedback
on this data first.
To give you a sense of "what does case/resource data look like", here's what I pulled out of Atrium, using
fmd-adm: