Skip to content

NO-JIRA: Skip CO condition tests on SNO#30775

Open
hongkailiu wants to merge 3 commits into
openshift:mainfrom
hongkailiu:skip-co-tests-on-sno
Open

NO-JIRA: Skip CO condition tests on SNO#30775
hongkailiu wants to merge 3 commits into
openshift:mainfrom
hongkailiu:skip-co-tests-on-sno

Conversation

@hongkailiu
Copy link
Copy Markdown
Member

@hongkailiu hongkailiu commented Feb 12, 2026

This pull skips all CO tests on SNO. SingleNode is may briefly go Available=False for many operators during updates or Node reboots. Several operators also lack the capacity to teach their Degraded logic about single-node quality-of-service expectations. And we don't have capacity to file and track single-node Degraded exceptions or to set Available grace periods in this test suite at the moment.

  • Available=False and Degrade=True are not checked at all no matter if the test case is executed in an upgrade test suite, or not. Before it was handled as an exception and thus the job would be just flaky instead of failing. Thus, the relevant exceptions are removed.

  • All checks on the Progressing condition are skipped as well on a SNO cluster.

The logging logic was inherited if it fails to determine the control plane topology because I am not sure on which type of clusters an error will show up.

Note that the exceptions about SNO coming from #31172 have been removed.

Summary by CodeRabbit

  • Tests
    • Monitoring and scaling tests are topology-aware and respect single-node, two-node, and multi-node control-plane layouts.
    • Operator state transition tests accept explicit topology, adjust validations, and emit clear single-node skip messages for skipped cases.
    • Topology retrieval failures are logged; tests fall back to sensible behavior when topology cannot be determined.

@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci openshift-ci Bot requested review from deads2k and sjenning February 12, 2026 03:30
@openshift-ci-robot
Copy link
Copy Markdown

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@hongkailiu
Copy link
Copy Markdown
Member Author

/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-agent-single-node-ipv6-conformance

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 12, 2026

@hongkailiu: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@hongkailiu
Copy link
Copy Markdown
Member Author

/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-single-node

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 12, 2026

@hongkailiu: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@hongkailiu
Copy link
Copy Markdown
Member Author

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-single-node

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 12, 2026

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/96b95810-0824-11f1-9668-f7261cec9d7e-0

@hongkailiu
Copy link
Copy Markdown
Member Author

hongkailiu commented Feb 12, 2026

From this job: skipping as expected.

$ curl -s 
<testcase name="[Monitor:legacy-cvo-invariants][bz-kube-apiserver] clusteroperator/kube-apiserver should not change condition/Available" time="9450.182493088">
<skipped message="Test skipped on a single-node cluster"/>
</testcase>

The cases are still running (passing) on non-SNO cluster:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30775/pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade/2021794122666348544/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/artifacts/junit/e2e-monitor-tests__20260212-050021.xml | grep 'should not change condition/Available' -A1 | tail -n2
    <testcase name="[Monitor:legacy-cvo-invariants][bz-Storage] clusteroperator/storage should not change condition/Available" time="0"></testcase>
    <testcase name="[Monitor:legacy-cvo-invariants][bz-apiserver-auth] clusteroperator/authentication should not change condition/Degraded" time="0"></testcase>

The uncovered part is the scaling test but the code to determine SNO is quite similar.

/verified by @hongkailiu

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Feb 12, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@hongkailiu: This PR has been marked as verified by @hongkailiu.

Details

In response to this:

From this job: skipping as expected.

$ curl -s 
<testcase name="[Monitor:legacy-cvo-invariants][bz-kube-apiserver] clusteroperator/kube-apiserver should not change condition/Available" time="9450.182493088">
<skipped message="Test skipped on a single-node cluster"/>
</testcase>

The cases are still running (passing) on non-SNO cluster:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30775/pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade/2021794122666348544/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/artifacts/junit/e2e-monitor-tests__20260212-050021.xml | grep 'should not change condition/Available' -A1 | tail -n2
   <testcase name="[Monitor:legacy-cvo-invariants][bz-Storage] clusteroperator/storage should not change condition/Available" time="0"></testcase>
   <testcase name="[Monitor:legacy-cvo-invariants][bz-apiserver-auth] clusteroperator/authentication should not change condition/Degraded" time="0"></testcase>

/verified by @hongkailiu

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hongkailiu
Copy link
Copy Markdown
Member Author

/verified by @hongkailiu

@openshift-ci-robot
Copy link
Copy Markdown

@hongkailiu: This PR has been marked as verified by @hongkailiu.

Details

In response to this:

/verified by @hongkailiu

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@hongkailiu
Copy link
Copy Markdown
Member Author

A follow up should be done after this gets in.

@tmshort
Copy link
Copy Markdown
Contributor

tmshort commented May 20, 2026

/payload-aggregate periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node 5

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 20, 2026

@tmshort: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/36514650-547c-11f1-851d-1c15132aeed9-0

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 20, 2026
@hongkailiu hongkailiu force-pushed the skip-co-tests-on-sno branch from 96f3cd1 to 8464d63 Compare May 20, 2026 19:06
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 20, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Walkthrough

This PR centralizes control-plane topology lookup in monitor tests, threads the resulting topology into operator state-transition helpers (removing clientConfig from exception callbacks), adds single-node-aware skip logic in those helpers, and updates a machine-scale test to skip operator-violation assertions on single-node clusters.

Changes

Topology-aware operator state transition testing

Layer / File(s) Summary
Central topology fetch
pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go
Adds e2e logging; EvaluateTestsFromConstructedIntervals fetches control-plane topology and passes it into monitor test helpers instead of w.adminRESTConfig.
exceptionCallback signature change
pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
exceptionCallback no longer accepts clientConfig; all callers and closures updated to the new signature.
Stable-system operator transition wiring
pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
testStableSystemOperatorStateTransitions accepts topology and forwards topology/single-node info into testOperatorStateTransitions; internal exception closure updated.
Upgrade operator state transitions
pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
testUpgradeOperatorStateTransitions accepts topology, derives isTwoNode, removes in-function topology lookup, updates exception closure signature, and passes topology/singleNode into testOperatorStateTransitions.
Core operator transition logic & single-node skips
pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
testOperatorStateTransitions now takes topology and emits skipped JUnit cases for each operator/condition when topology == SingleReplicaTopologyMode; exception invocation updated to match removed param.
Upgrade progressing checks & single-node skips
pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
testUpgradeOperatorProgressingStateTransitions accepts topology, derives isTwoNode, and adds explicit single-node skip messages for machine-config and Progressing=False per-operator checks.
Scale test topology-aware validation
test/extended/machines/scale.go
AfterEach reloads kubeconfig, creates typed configv1 client, retrieves control-plane topology via exutil, logs on error, and only asserts no operator Progressing=False violations when topology != SingleReplicaTopologyMode.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • openshift/origin#31172: Changes in the same legacycvomonitortests/operators.go area around topology-aware Progressing/exception handling for single-node clusters.
  • openshift/origin#31112: Refactors passing topology through test*StateTransitions and adjusts exception/skip handling for operator tests.

Suggested labels

jira/valid-reference, ready-for-human-review

Suggested reviewers

  • sjenning
  • deads2k
  • p0lyn0mial
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning PR introduces assertions without meaningful failure messages in scale.go AfterEach (lines 291-292 for topology retrieval failures) violating requirement #4 for helpful diagnostic messages. Add failure messages to assertions: o.Expect(err).NotTo(o.HaveOccurred(), "failed to load config") and o.Expect(err).NotTo(o.HaveOccurred(), "failed to create configV1 client")
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Skip CO condition tests on SNO' accurately and concisely describes the main objective of the changeset—skipping cluster operator condition tests on single-node OpenShift clusters, which is reflected consistently across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in modified files use stable, deterministic values with no dynamic information like timestamps or UUIDs found in test titles.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests added; PR modifies existing tests/helpers. Existing test has [apigroup:machine.openshift.io] tag providing MicroShift protection.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo tests (It, Describe, Context, When) were added in this PR. The changes only modify existing test infrastructure and helper functions to skip/handle SNO topology appropriately.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only test/monitoring code to skip tests on SNO and handle two-node topologies. No deployment manifests or problematic scheduling constraints are introduced.
Ote Binary Stdout Contract ✅ Passed No process-level stdout writes violating OTE contract found. e2e.Logf writes to GinkgoWriter (exempted), logrus calls are in functions only, no direct klog or fmt.Print calls.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests are added. The PR modifies existing test helpers to add SNO topology awareness with no IPv4 assumptions or external connectivity requirements.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 20, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hongkailiu
Once this PR has been reviewed and has the lgtm label, please assign sosiouxme for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go`:
- Around line 670-673: When topology == configv1.SingleReplicaTopologyMode,
short-circuit the test-case generation so both Progressing-condition suites are
marked skipped: detect the SingleReplicaTopologyMode early (using the existing
topology variable) and set mcTestCase.SkipMessage for the initial "must go
Progressing=True" cases and also create/mark the corresponding JUnit cases from
the second suite (the "should stay Progressing=False while MCO is
Progressing=True" cases) with the same SkipMessage; in practice, return or emit
skipped test entries for both suites instead of only the first branch so all
Progressing checks are consistently skipped on single-node clusters.
- Around line 490-499: The single-node (SNO) skip branch is executed after the
empty-event fast path, causing operator/condition pairs with no events to be
reported as passing instead of skipped; move the singleNode check so it runs
before the empty-event/early-return logic. Edit the function that builds test
results (the block that appends to ret with junitapi.JUnitTestCase using
testName and SkipMessage) to check singleNode first and return/append a skipped
test for SNOs before evaluating the empty-event case (the early-return that
emits pass results when there are no events). Ensure the singleNode branch still
creates the SkipMessage and appends to ret exactly as it does now, but placed
ahead of the empty-event fast path.

In `@test/extended/machines/scale.go`:
- Around line 288-294: GetControlPlaneTopologyFromConfigClient can return an
error and nil topo; after calling
GetControlPlaneTopologyFromConfigClient(configV1Client) check for err or nil
topo and bail out (or skip the SingleReplicaTopologyMode assertion) before
dereferencing *topo; specifically, if err != nil or topo == nil, log the error
and return/continue so the subsequent check that compares *topo to
configv1.SingleReplicaTopologyMode and the o.Expect(violations)... call are not
executed on a nil pointer.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3ac05f96-6400-4348-86f4-e2d09933b1a4

📥 Commits

Reviewing files that changed from the base of the PR and between 428b9a0 and 8464d63.

📒 Files selected for processing (3)
  • pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go
  • pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
  • test/extended/machines/scale.go

Comment thread pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go Outdated
Comment thread test/extended/machines/scale.go
@hongkailiu hongkailiu force-pushed the skip-co-tests-on-sno branch from 47f7a5e to 185b11c Compare May 20, 2026 19:14
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go (1)

94-108: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t silently treat unknown topology as multi-node.

If getControlPlaneTopology fails here, topology stays empty, so singleNode and all downstream isTwoNode checks become false. That means one transient lookup failure disables every SNO skip and dual-replica exception in this evaluator, which can reintroduce the flakes this refactor is trying to avoid. Please fail closed here or thread an explicit “topology unknown” path into the helpers instead of falling back to HA behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go`
around lines 94 - 108, getControlPlaneTopology errors are currently logged and
topology left empty which causes singleNode to be false and incorrectly treat
unknown topology as multi-node; change the behavior in the block around
getControlPlaneTopology(w.adminRESTConfig) so that on err you return the error
(or set and propagate an explicit unknown topology sentinel into the downstream
calls) instead of continuing. Specifically, ensure the caller of
getControlPlaneTopology either returns fmt.Errorf(...) when err != nil, or sets
a distinct topology value and updates calls to
testStableSystemOperatorStateTransitions, testUpgradeOperatorStateTransitions,
and testUpgradeOperatorProgressingStateTransitions to accept and handle an
"unknown" topology, so SNO-specific skips are preserved when topology cannot be
determined.
♻️ Duplicate comments (2)
pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go (2)

478-499: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Check singleNode before the empty-event fast path.

When operatorEvents is empty, this still emits a passing test case on SNO instead of a skipped one. If the intent is to skip all Available/Degraded checks on single-node clusters, the singleNode branch needs to run first.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go`
around lines 478 - 499, The loop that builds test cases for each operator checks
for empty operatorEvents before checking singleNode, causing empty-event tests
to be reported as passing on single-node clusters; reorder the branches in the
loop (the block using operatorEvents := eventsByOperator[operatorName] and the
subsequent if statements) so the singleNode check runs before the empty-event
fast path—i.e., for each operator (platformidentification.KnownOperators.List())
and conditionType, evaluate singleNode first and emit the SkipMessage
JUnitTestCase when singleNode is true, then handle the len(operatorEvents) == 0
case to append a plain test case.

653-674: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip the second Progressing suite on SNO as well.

The first Progressing loop skips SingleReplicaTopologyMode, but the later should stay Progressing=False while MCO is Progressing=True loop still evaluates on SNO. That means SNO only skips half of the Progressing assertions, which conflicts with the PR goal.

Also applies to: 755-784

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go`
around lines 653 - 674, The second "Progressing" test loop (the one asserting
"should stay Progressing=False while MCO is Progressing=True") is missing the
SingleReplicaTopologyMode skip, so SNO only skips the first Progressing suite;
update that second loop to mirror the first by adding the same topology check
(topology == configv1.SingleReplicaTopologyMode) and set mcTestCase.SkipMessage
to "Test skipped on a single-node cluster" when true; locate the loop iterating
platformidentification.KnownOperators.List() and the mcTestCase/SkipMessage
logic in that suite (also referenced around the other block at the 755-784
region) and apply the identical skip branch used earlier.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In
`@pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go`:
- Around line 94-108: getControlPlaneTopology errors are currently logged and
topology left empty which causes singleNode to be false and incorrectly treat
unknown topology as multi-node; change the behavior in the block around
getControlPlaneTopology(w.adminRESTConfig) so that on err you return the error
(or set and propagate an explicit unknown topology sentinel into the downstream
calls) instead of continuing. Specifically, ensure the caller of
getControlPlaneTopology either returns fmt.Errorf(...) when err != nil, or sets
a distinct topology value and updates calls to
testStableSystemOperatorStateTransitions, testUpgradeOperatorStateTransitions,
and testUpgradeOperatorProgressingStateTransitions to accept and handle an
"unknown" topology, so SNO-specific skips are preserved when topology cannot be
determined.

---

Duplicate comments:
In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go`:
- Around line 478-499: The loop that builds test cases for each operator checks
for empty operatorEvents before checking singleNode, causing empty-event tests
to be reported as passing on single-node clusters; reorder the branches in the
loop (the block using operatorEvents := eventsByOperator[operatorName] and the
subsequent if statements) so the singleNode check runs before the empty-event
fast path—i.e., for each operator (platformidentification.KnownOperators.List())
and conditionType, evaluate singleNode first and emit the SkipMessage
JUnitTestCase when singleNode is true, then handle the len(operatorEvents) == 0
case to append a plain test case.
- Around line 653-674: The second "Progressing" test loop (the one asserting
"should stay Progressing=False while MCO is Progressing=True") is missing the
SingleReplicaTopologyMode skip, so SNO only skips the first Progressing suite;
update that second loop to mirror the first by adding the same topology check
(topology == configv1.SingleReplicaTopologyMode) and set mcTestCase.SkipMessage
to "Test skipped on a single-node cluster" when true; locate the loop iterating
platformidentification.KnownOperators.List() and the mcTestCase/SkipMessage
logic in that suite (also referenced around the other block at the 755-784
region) and apply the identical skip branch used earlier.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 29fe20c3-e6d9-4fed-bbf8-26aae24dbefd

📥 Commits

Reviewing files that changed from the base of the PR and between 8464d63 and 185b11c.

📒 Files selected for processing (2)
  • pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go
  • pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@hongkailiu
Copy link
Copy Markdown
Member Author

/payload-aggregate periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node 5

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 20, 2026

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/93d4e570-5486-11f1-8bc2-54578cc4942a-0

@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label May 20, 2026
@hongkailiu hongkailiu changed the title Skip CO condition tests on SNO NO-JIRA: Skip CO condition tests on SNO May 20, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 20, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@hongkailiu: This pull request explicitly references no jira issue.

Details

In response to this:

This pull skips all CO tests on SNO.

  • Available=False and Degrade=True are not checked at all no matter if the test case is executed in an upgrade test suite, or not. Before it was handled as an exception and thus the job would be just flaky instead of failing. Thus, the relevant exceptions can be removed.

  • All checks on the Progressing condition are skipped as well on a SNO cluster.

The logging logic was inherited if it fails to determine the control plane topology because I am not sure on which type of clusters an error will show up.

Summary by CodeRabbit

  • Tests
  • Monitoring and scaling tests are now topology-aware and respect single-node, two-node, and multi-node control-plane layouts.
  • Operator state transition tests now accept explicit topology/single-node flags, adjust validations, and emit clear single-node skip messages for skipped JUnit cases.
  • Topology retrieval failures are logged and tests fall back to sensible single-node handling.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown

@tmshort: The /verified command must be used with one of the following actions: by, later, remove, or bypass. See https://docs.ci.openshift.org/docs/architecture/jira/#premerge-verification for more information.

Details

In response to this:

/verified cancel
This should've been removed after the rebase

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tmshort
Copy link
Copy Markdown
Contributor

tmshort commented May 20, 2026

/verified remove
This should've been removed after the rebase

@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label May 20, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@tmshort: The verified label has been removed.

Details

In response to this:

/verified remove
This should've been removed after the rebase

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the topology centralization in EvaluateTestsFromConstructedIntervals; that's better than a separate getControlPlaneTopology in each test.* function.

I'm fine with us making the call that we don't have capacity to monitor ClusterOperator behavior for single-node clusters. I just think we want to make it clear that the motivation is lack-of-capacity, and not that we're saying that we think happy-case single-node updates having ClusterOperators go Degraded=True is a good thing. For example, the Degraded Godocs say:

Degraded indicates that the component (operator and all configured operands) does not match its desired state over a period of time resulting in a lower quality of service.

And in a world where we had infinite capacity, we would be able to teach all the cluster operators that a minute or whatever of downtime around a Node reboot or component update was expected single-node behavior, and not a quality-of-service miss.

Anyhow, no worries from me merging this, as long as the commit message and comments we're using to motivate the change are clearly saying "would be nice, but we don't have time" and aren't sounding like "even if we had infinite time, we wouldn't work on improving single-node ClusterOperator status".

Comment thread pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go Outdated
Comment thread pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go Outdated
This pull skips all CO tests on SNO. SingleNode is may briefly go Available=False for many operators during updates or Node reboots. Several operators also lack the capacity to teach their Degraded logic about single-node quality-of-service expectations. And we don't have capacity to file and track single-node Degraded exceptions or to set Available grace periods in this test suite at the moment.

- `Available=False` and `Degrade=True` are not checked at all no matter if the test case is executed in an upgrade test suite, or not. Before it was handled as an exception and thus the job would be just flaky instead of failing. Thus, the relevant exceptions are removed.

- All checks on the `Progressing` condition are skipped as well on a SNO cluster.

The logging logic was inherited if it fails to determine the control plane topology because I am not sure on which type of clusters an error will show up.
@hongkailiu hongkailiu force-pushed the skip-co-tests-on-sno branch from dff8711 to 36e2d79 Compare May 21, 2026 04:52
@openshift-ci-robot
Copy link
Copy Markdown

@hongkailiu: This pull request explicitly references no jira issue.

Details

In response to this:

This pull skips all CO tests on SNO.

  • Available=False and Degrade=True are not checked at all no matter if the test case is executed in an upgrade test suite, or not. Before it was handled as an exception and thus the job would be just flaky instead of failing. Thus, the relevant exceptions can be removed.

  • All checks on the Progressing condition are skipped as well on a SNO cluster.

The logging logic was inherited if it fails to determine the control plane topology because I am not sure on which type of clusters an error will show up.

Summary by CodeRabbit

  • Tests
  • Monitoring and scaling tests are topology-aware and respect single-node, two-node, and multi-node control-plane layouts.
  • Operator state transition tests accept explicit topology, adjust validations, and emit clear single-node skip messages for skipped cases.
  • Topology retrieval failures are logged; tests fall back to sensible behavior when topology cannot be determined.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

if reason == "_ManagedDeploymentsAvailable" {
return "https://issues.redhat.com/browse/OCPBUGS-62633"
}
case "olm":
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the exceptions about SNO coming from #31172 have been removed.

/cc @tmshort

Copy link
Copy Markdown
Member Author

@hongkailiu hongkailiu May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have examples of co/operator-lifecycle-manager-packageserver failing on SNO. It is aligned with https://redhat.atlassian.net/browse/OCPBUGS-63672 is fixed except on SNO https://redhat.atlassian.net/browse/OCPBUGS-67210.

Is there any example for co/olm failing on SNO? If yes, that does not explain the high pass rate in Sippy.

I was trying to understand this in Slack but got distracted.


The following is for my own memory, not relevant to the above question.

OCPBUGS-63672 -> (by this comment) OCPBUGS-67210 is confusing because the condition to check went from Progressing to Available.
The reason is that I copied from the output from wrong test case in this comment because the linked job such as this one in my previous comment contains two flaky cases about clusteroperator/operator-lifecycle-manager-packageserver. One for Available and the other for Progressing.

OCPBUGS-67210 should be about Progressing as well.

@openshift-ci openshift-ci Bot requested a review from tmshort May 21, 2026 05:05
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@hongkailiu
Copy link
Copy Markdown
Member Author

/payload-aggregate periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node 5

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8c3bff00-54d7-11f1-9d6e-c7cb0ec27de4-0

@hongkailiu
Copy link
Copy Markdown
Member Author

Take a job from https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregator-periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node/2057335663132413952

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-origin-30775-ci-5.0-e2e-aws-upgrade-ovn-single-node/2057335662612320256/artifacts/e2e-aws-upgrade-ovn-single-node-4/single-node-e2e-test/artifacts/junit/e2e-monitor-tests__20260521-070135.xml | grep 'clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False' -A3
    <testcase name="[Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False while MCO is Progressing=True" time="0">
        <properties></properties>
        <skipped message="Test skipped on a single-node cluster"></skipped>
    </testcase>

So the case was skipped on a single node. The cases on other conditions are skipped too.

Normal cluster: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/30775/pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade/2057330851787575296

The same case not skipped.

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30775/pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade/2057330851787575296/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/artifacts/junit/e2e-monitor-tests__20260521-063507.xml | grep 'clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False' -A3
    <testcase name="[Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False while MCO is Progressing=True" time="0">
        <properties></properties>
    </testcase>
    <testcase name="[Monitor:legacy-cvo-invariants][bz-service-ca] clusteroperator/service-ca should stay Progressing=False while MCO is Progressing=True" time="0">

/verified by @hongkailiu

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 21, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@hongkailiu: This PR has been marked as verified by @hongkailiu.

Details

In response to this:

Take a job from https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregator-periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node/2057335663132413952

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-origin-30775-ci-5.0-e2e-aws-upgrade-ovn-single-node/2057335662612320256/artifacts/e2e-aws-upgrade-ovn-single-node-4/single-node-e2e-test/artifacts/junit/e2e-monitor-tests__20260521-070135.xml | grep 'clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False' -A3
   <testcase name="[Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False while MCO is Progressing=True" time="0">
       <properties></properties>
       <skipped message="Test skipped on a single-node cluster"></skipped>
   </testcase>

So the case was skipped on a single node. The cases on other conditions are skipped too.

Normal cluster: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/30775/pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade/2057330851787575296

The same case not skipped.

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30775/pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade/2057330851787575296/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/artifacts/junit/e2e-monitor-tests__20260521-063507.xml | grep 'clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False' -A3
   <testcase name="[Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False while MCO is Progressing=True" time="0">
       <properties></properties>
   </testcase>
   <testcase name="[Monitor:legacy-cvo-invariants][bz-service-ca] clusteroperator/service-ca should stay Progressing=False while MCO is Progressing=True" time="0">

/verified by @hongkailiu

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hongkailiu
Copy link
Copy Markdown
Member Author

@hongkailiu
Copy link
Copy Markdown
Member Author

/test e2e-aws-ovn-serial-1of2

@hongkailiu
Copy link
Copy Markdown
Member Author

/test e2e-aws-ovn-serial-2of2

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

@hongkailiu: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants