Skip to content

Add CRD-runtime drift detection test framework#5209

Open
ChrisJBurns wants to merge 13 commits intomainfrom
drift-detection-walker-and-telemetry-test
Open

Add CRD-runtime drift detection test framework#5209
ChrisJBurns wants to merge 13 commits intomainfrom
drift-detection-walker-and-telemetry-test

Conversation

@ChrisJBurns
Copy link
Copy Markdown
Collaborator

Summary

  • Silent drift between CRD types and their runtime config counterparts has caused user-facing bugs (e.g. Composite Tools: Fix plumbing for on-error-continue config #3118, Simplify VMCP Configuration #3125) where new fields appeared to work in unit tests but were not wired through the CRD conversion path. The only existing safety nets — code review and end-to-end tests — do not scale.
  • Adds a reusable reflection-based field walker and a bidirectional drift detection pattern. The pattern requires every leaf JSON field on either side of a CRD↔runtime boundary to be either explicitly mapped to the other side, or explicitly declared as intentionally unmapped with a justification string. Drift is allowed; unannounced drift is not.
  • Applies the pattern to MCPTelemetryConfigtelemetry.Config as the first real-world exercise. The drift test passes against the current codebase.

Type of change

  • New feature

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Tests added (all passing): walker unit tests in cmd/thv-operator/internal/testutil/reflect_test.go; drift tests in cmd/thv-operator/pkg/spectoconfig/telemetry_drift_test.go (TestTelemetryConfigDrift_CRDFieldsCovered, TestTelemetryConfigDrift_RuntimeFieldsCovered, TestTelemetryConfigDrift_MappingTableSanity).

task lint-fix could not be run locally — the installed golangci-lint v2.12.1 is built with go1.25 but the project targets go1.26. go vet and gofmt are clean on the new files. CI will run the project's pinned linter.

API Compatibility

  • This PR does not break the v1beta1 API, OR the api-break-allowed label is applied and the migration guidance is described above.

Test-only change. No production code, CRD schemas, or APIs are modified.

Changes

File Change
cmd/thv-operator/internal/testutil/reflect.go New FlattenJSONLeafFields helper that returns sorted dot-delimited JSON leaf paths for a struct type. Recurses into nested structs, deref'd pointers, and slice/map element types. Stops at primitives and a small allowlist (time.Duration, metav1.Duration, json.RawMessage, thvjson.Map/Any, vmcpconfig.Duration). Skips metav1.TypeMeta/ObjectMeta/ListMeta and unexported / json:\"-\" fields. Honors ,inline and Go's anonymous-field promotion. Cycle-protected.
cmd/thv-operator/internal/testutil/reflect_test.go Table-driven walker tests: pointers, slice/array/map elements, embedded structs (with and without ,inline), json:\"-\", missing tags, omitempty, unexported fields, leaf allowlist, sorted output, nil/non-struct input.
cmd/thv-operator/pkg/spectoconfig/telemetry_drift_test.go Bidirectional drift test for MCPTelemetryConfigSpectelemetry.Config. Two coverage tests (one per direction) sharing a single telemetryFieldMappings source-of-truth table, plus per-side IgnoredOnCRDOnly / IgnoredOnRuntimeOnly maps with justification strings. A third test asserts mapping-table sanity (no duplicates, no overlap with ignore lists, non-empty justifications).

Does this introduce a user-facing change?

No.

Special notes for reviewers

The pattern is bidirectional by design — it does not mandate parity between CRD and runtime types. Either side may have fields the other lacks; the test simply forces every divergence to be explicitly classified. New entries in IgnoredOnCRDOnly / IgnoredOnRuntimeOnly require a justification string, which makes intentional decoupling visible in code review.

The walker uncovered one CRD leaf that prior manual analysis missed: openTelemetry.caBundleRef.configMapRef.optional, a *bool promoted from corev1.ConfigMapKeySelector via LocalObjectReference. It is now explicitly declared in telemetryIgnoredOnCRDOnly with a justification — exactly the pattern working as intended.

This is the first PR in a series. Follow-ups will (1) introduce a CRD-owned v1beta1.VirtualMCPConfig mirror to decouple VirtualMCPServerSpec.Config from pkg/vmcp/config.Config without changing the user-facing CRD schema, and (2) extend the drift pattern to other converter boundaries (OIDC, external auth strategies, embedded auth server config).

Implementation plan

Approved drift detection plan

PR 1 (this PR): build the reusable reflection walker + bidirectional drift test machinery, exercise it against an already-decoupled boundary (telemetry) to validate the pattern.

PR 2 (next): mirror vmcpconfig.Config top-level fields into v1beta1.VirtualMCPConfig, with nested fields still pointing at runtime types. CRD JSON schema unchanged. Replace vmcp.Spec.Config.DeepCopy() in the converter with explicit field-by-field copy at the top level.

PR 3+: add bidirectional drift tests for VirtualMCPConfig and incrementally mirror nested types (AggregationConfig, OperationalConfig, …), one per PR, with the drift test guiding each mirror.

🤖 Generated with Claude Code

Silent drift between CRD types and their runtime config counterparts has
caused user-facing bugs (e.g. PR #3118, issue #3125) where new fields
appeared to work in tests but were not wired through the CRD conversion
path. The only existing safety nets — code review and end-to-end tests —
do not scale.

Add a reusable reflection-based field walker and a bidirectional drift
test pattern. The pattern requires every leaf JSON field on either side
of a CRD/runtime boundary to be either explicitly mapped to the other
side or explicitly declared as intentionally unmapped, with a
justification string. Adding a field to either type without classifying
it fails the test with a clear action-required message.

Apply the pattern to MCPTelemetryConfig <-> telemetry.Config as the
first real-world exercise. The drift test passes against the current
codebase and surfaced one previously-undocumented leaf
(openTelemetry.caBundleRef.configMapRef.optional) which is now
explicitly declared.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the size/L Large PR: 600-999 lines changed label May 6, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 88.05970% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.92%. Comparing base (e943660) to head (7218959).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
cmd/thv-operator/internal/testutil/drift.go 82.60% 15 Missing and 5 partials ⚠️
cmd/thv-operator/internal/testutil/reflect.go 95.34% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5209      +/-   ##
==========================================
+ Coverage   67.87%   67.92%   +0.05%     
==========================================
  Files         610      612       +2     
  Lines       62405    62682     +277     
==========================================
+ Hits        42356    42576     +220     
- Misses      16869    16917      +48     
- Partials     3180     3189       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ChrisJBurns and others added 2 commits May 7, 2026 01:13
Fix lint failures from CI: rename reflect.Ptr to reflect.Pointer,
suppress exhaustive on the kind switch with rationale, and tag the
,inline json fixtures as a known revive false positive.

Replace the explicit leafTypes allowlist with a json.Marshaler interface
check. Every entry in the previous map shared one property — a custom
MarshalJSON whose output bears no relation to the Go field layout — so
asking the type how it serializes is more general than maintaining a
hand-rolled list of K8s and project types. The walker is now
self-maintaining for any future custom-marshaled type.

Address reviewer findings on telemetry_drift_test.go:
  - delete dead sortedKeys helper and its misleading comment
  - upgrade per-entry empty-field checks to require to avoid empty-string
    pollution of the duplicate-detection maps
  - assert no path appears in both ignore maps (cross-pollination)
  - assert every mapping/ignore entry is still a live leaf on its type
    so renames and deletions surface instead of leaving stale entries
  - rewrite caCertPath, sensitiveHeaders, and serviceName justifications
    to name the actual wiring symbols a reviewer can grep for

Tighten the FlattenJSONLeafFields godoc to describe the one-level
self-reference expansion that maxStructRevisits actually permits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Simplify cycle detection: stop on first revisit instead of allowing one
self-referential expansion. The "expand once" gold-plating produced a
"next.<field>" entry on linked-list-shaped types, but no real CRD or
runtime config has cyclic shape, and stop-on-revisit is the simpler
contract for drift detection. The visited tracker becomes a
map[reflect.Type]struct{} and maxStructRevisits goes away.

Compress the FlattenJSONLeafFields godoc from a 9-bullet semantics list
into one paragraph. Most of the bullets are now subsumed by the single
"json.Marshaler => leaf, otherwise recurse on Struct/Slice/Array/Map"
rule. The detailed encoding/json reference belongs in the standard
library, not here.

Drop redundant subtests in TestFlattenJSONLeafFields. After the
json.Marshaler refactor, the json.RawMessage / thvjson.Map+Any /
vmcpconfig.Duration cases all exercise the same short-circuit branch as
metav1.Duration, so one Marshaler example is enough. Drop slice-of-
pointer-to-struct (covered by combining the pointer-deref and slice-of-
struct cases). Fold the dedicated PointerInputDereferenced test into the
table. Remove TestFlattenJSONLeafFields_OutputIsSorted: the table-
driven test already pins exact sorted-string equality on every case.

Update the recursive-type expectation to ["name"] to lock in the new
stop-on-revisit semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels May 7, 2026
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels May 7, 2026
golangci-lint v2 doesn't honor the //exhaustive:ignore source directive
in this configuration; the project-wide nolint convention works. The
switch genuinely falls through to a default leaf branch for every Kind
not listed (primitives, interfaces, channels, etc.) — the suppression
is a confirmed false positive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels May 7, 2026
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels May 7, 2026
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels May 7, 2026
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels May 7, 2026
jhrozek
jhrozek previously approved these changes May 7, 2026
Copy link
Copy Markdown
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Framework is well-conceived, walker semantics are sound, and the drift pattern is a good safety net for future CRD↔runtime work. Two non-blocking notes inline — approving.

Comment thread cmd/thv-operator/pkg/spectoconfig/telemetry_drift_test.go Outdated
Comment thread cmd/thv-operator/internal/testutil/reflect.go Outdated
Copy link
Copy Markdown
Contributor

@jerm-dro jerm-dro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this direction. A few suggestions to make this more general purpose and require less boilerplate.

Comment on lines +63 to +83
var telemetryIgnoredOnCRDOnly = map[string]string{
"openTelemetry.enabled": "CRD-only gate; controls whether the converter populates runtime fields at all",
"openTelemetry.sensitiveHeaders.name": "K8s-secret-backed header value; injected as TOOLHIVE_OTEL_HEADER_* env vars on the proxyrunner pod " +
"(controllerutil.GenerateOpenTelemetryEnvVarsFromRef) and merged into OTLP headers at runtime; not written into telemetry.Config.Headers by the converter",
"openTelemetry.sensitiveHeaders.secretKeyRef.name": "K8s-secret-backed header value; injected as TOOLHIVE_OTEL_HEADER_* env vars on the proxyrunner pod " +
"(controllerutil.GenerateOpenTelemetryEnvVarsFromRef) and merged into OTLP headers at runtime; not written into telemetry.Config.Headers by the converter",
"openTelemetry.sensitiveHeaders.secretKeyRef.key": "K8s-secret-backed header value; injected as TOOLHIVE_OTEL_HEADER_* env vars on the proxyrunner pod " +
"(controllerutil.GenerateOpenTelemetryEnvVarsFromRef) and merged into OTLP headers at runtime; not written into telemetry.Config.Headers by the converter",
"openTelemetry.caBundleRef.configMapRef.name": "K8s ConfigMap reference; resolved by operator into runtime CACertPath",
"openTelemetry.caBundleRef.configMapRef.key": "K8s ConfigMap reference; resolved by operator into runtime CACertPath",
"openTelemetry.caBundleRef.configMapRef.optional": "K8s ConfigMap reference flag promoted from corev1.ConfigMapKeySelector; not part of runtime config",
}

// telemetryIgnoredOnRuntimeOnly lists runtime leaf fields that intentionally
// have no CRD counterpart. Each entry MUST include a justification.
var telemetryIgnoredOnRuntimeOnly = map[string]string{
"serviceName": "per-server, set from MCPTelemetryConfigReference.ServiceName and defaulted at runtime by " +
"telemetry.ResolveServiceName; intentionally absent from the shared MCPTelemetryConfig",
"serviceVersion": "resolved at runtime from binary version (issue #2296)",
"environmentVariables": "CLI-only, not applicable to CRD-managed telemetry",
"caCertPath": "filesystem path injected by runconfig.AppendTelemetryRunnerOption after the operator computes the volume mount " +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: can telemetryIgnoredOnRuntimeOnly and telemetryIgnoredOnCRDOnly be specified as annotations/tags on the actual struct itself? That keeps the type's documentation in one spot. Annotations are accessible via the reflect package.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went back and forth with CC and it suggests keeping the maps. An ignore entry is really a cross-type fact ("no peer on the other side"), not an intrinsic property of one field. Putting it on the type also pushes edits into stable v1beta1 API code or into pkg/telemetry, which is shared with the standalone thv vmcp serve CLI and shouldn't carry operator-test metadata. Justification strings with parens, commas, and concatenation flatten awkwardly into tag values too. Since the Mappings table has to stay relational regardless, the savings would only be ~30 lines of strings. Happy to revisit if I'm missing something.

Comment thread cmd/thv-operator/pkg/spectoconfig/telemetry_drift_test.go Outdated
Three changes from PR review:

1. Fix bogus symbol name in caCertPath justification
   (telemetry_drift_test.go). The string referenced
   runconfig.AppendTelemetryRunnerOption, which does not exist anywhere
   in the tree. The actual wiring lives at
   cmd/thv-operator/pkg/runconfig/telemetry.go:33 inside
   AddMCPTelemetryConfigRefOptions, where config.CACertPath is assigned
   from the operator-computed volume-mount path.

2. Drop the standalone "if inline { return \"\" }" branch in the path-
   segment builder (reflect.go). encoding/json does not recognize the
   ",inline" tag option — only ",omitempty" and ",string" are parsed;
   any other token is silently discarded. Flattening is driven entirely
   by StructField.Anonymous combined with the absence of an explicit
   JSON name. Today every ",inline" usage in the codebase happens to
   also be on an anonymous field, so the bug was masked, but a future
   named field tagged json:"foo,inline" would have been incorrectly
   flattened by the walker while encoding/json would emit "foo":{...}.
   Add a regression test (named_field_with_,inline_tag_stays_nested) to
   pin the corrected behavior, and simplify parseJSONTag to match.

3. Extract a generic AssertNoDrift[Runtime, CRD] helper
   (internal/testutil/drift.go). The previous three top-level test
   functions (CRDFieldsCovered, RuntimeFieldsCovered,
   MappingTableSanity) are now subtests run inside the helper, so a new
   domain (OIDC, vMCP, etc.) drops to a single TestXxxDrift function
   that supplies its three tables. The helper carries Domain through
   into failure messages so output stays as actionable as before. The
   FieldMapping type and liveLeafSet helper move to the testutil
   package to live alongside the harness.

Net: ~130 LoC removed from telemetry_drift_test.go, ~220 LoC of
reusable harness added in drift.go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels May 8, 2026
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels May 8, 2026
CI flagged two issues on the previous push:

1. cyclomatic complexity 19 of `assertTableSanity` (limit 15). Extract
   the five independent invariants — empty/duplicate mappings, mapped-
   and-ignored overlap, justification non-emptiness, cross-pollination,
   and stale-entry detection — into named helpers. The orchestrator
   becomes a six-line dispatch and each helper is small enough to read
   in one screen. The stale-entry helper is parameterized by side so
   the CRD and runtime passes share one body.

2. paralleltest on TestTelemetryConfigDrift. Add t.Parallel() to the
   outer test; subtests inside AssertNoDrift already call it.

Also unrelated to this PR: TestGetOverlayMounts in pkg/ignore is a
known intermittent flake (race), not introduced by these changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels May 8, 2026
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels May 8, 2026
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels May 8, 2026
Copy link
Copy Markdown
Contributor

@jerm-dro jerm-dro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving since this is a net-positive but I have one suggestion on how to make this more useful beyond testing.

// SPDX-FileCopyrightText: Copyright 2025 Stacklok, Inc.
// SPDX-License-Identifier: Apache-2.0

package testutil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I don't think this should be entirely a testutil.

Ideally, the mapping defined in telemetry_drift_test.go is what is used in the production code conversion. Then, in a test, we assert on the mappings and types that there has been no drift. In other words, the only test part of this code should be:

func TestTelemetryConfigDrift(t *testing.T) {
	t.Parallel()
	testutil.AssertNoDrift[telemetry.Config, v1beta1.MCPTelemetryConfigSpec](t, testutil.DriftSpec{
		Domain:               "telemetry",
        // The following values are defined in the production code,
        // since they are used to drive the conversion.
		Mappings:             telemetryFieldMappings,
		IgnoredOnCRDOnly:     telemetryIgnoredOnCRDOnly,
		IgnoredOnRuntimeOnly: telemetryIgnoredOnRuntimeOnly,
	})
}

WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants