NO-JIRA: Skip CO condition tests on SNO by hongkailiu · Pull Request #30775 · openshift/origin

hongkailiu · 2026-02-12T03:30:26Z

This pull skips all CO tests on SNO. SingleNode is may briefly go Available=False for many operators during updates or Node reboots. Several operators also lack the capacity to teach their Degraded logic about single-node quality-of-service expectations. And we don't have capacity to file and track single-node Degraded exceptions or to set Available grace periods in this test suite at the moment.

Available=False and Degrade=True are not checked at all no matter if the test case is executed in an upgrade test suite, or not. Before it was handled as an exception and thus the job would be just flaky instead of failing. Thus, the relevant exceptions are removed.
All checks on the Progressing condition are skipped as well on a SNO cluster.

The logging logic was inherited if it fails to determine the control plane topology because I am not sure on which type of clusters an error will show up.

Note that the exceptions about SNO coming from #31172 have been removed.

Summary by CodeRabbit

Tests
- Monitoring and scaling tests are topology-aware and respect single-node, two-node, and multi-node control-plane layouts.
- Operator state transition tests accept explicit topology, adjust validations, and emit clear single-node skip messages for skipped cases.
- Topology retrieval failures are logged; tests fall back to sensible behavior when topology cannot be determined.

openshift-ci-robot · 2026-02-12T03:30:29Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci-robot · 2026-02-12T03:50:53Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

hongkailiu · 2026-02-12T14:55:33Z

/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-agent-single-node-ipv6-conformance

openshift-ci · 2026-02-12T14:55:44Z

@hongkailiu: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

hongkailiu · 2026-02-12T14:57:20Z

/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-single-node

openshift-ci · 2026-02-12T14:57:28Z

@hongkailiu: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

hongkailiu · 2026-02-12T15:07:45Z

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-single-node

openshift-ci · 2026-02-12T15:08:14Z

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/96b95810-0824-11f1-9668-f7261cec9d7e-0

hongkailiu · 2026-02-12T19:30:58Z

From this job: skipping as expected.

$ curl -s 
<testcase name="[Monitor:legacy-cvo-invariants][bz-kube-apiserver] clusteroperator/kube-apiserver should not change condition/Available" time="9450.182493088">
<skipped message="Test skipped on a single-node cluster"/>
</testcase>

The cases are still running (passing) on non-SNO cluster:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30775/pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade/2021794122666348544/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/artifacts/junit/e2e-monitor-tests__20260212-050021.xml | grep 'should not change condition/Available' -A1 | tail -n2
    <testcase name="[Monitor:legacy-cvo-invariants][bz-Storage] clusteroperator/storage should not change condition/Available" time="0"></testcase>
    <testcase name="[Monitor:legacy-cvo-invariants][bz-apiserver-auth] clusteroperator/authentication should not change condition/Degraded" time="0"></testcase>

The uncovered part is the scaling test but the code to determine SNO is quite similar.

/verified by @hongkailiu

openshift-ci-robot · 2026-02-12T19:31:11Z

@hongkailiu: This PR has been marked as verified by @hongkailiu.

Details

In response to this:

From this job: skipping as expected.

$ curl -s 
<testcase name="[Monitor:legacy-cvo-invariants][bz-kube-apiserver] clusteroperator/kube-apiserver should not change condition/Available" time="9450.182493088">
<skipped message="Test skipped on a single-node cluster"/>
</testcase>

The cases are still running (passing) on non-SNO cluster:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30775/pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade/2021794122666348544/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/artifacts/junit/e2e-monitor-tests__20260212-050021.xml | grep 'should not change condition/Available' -A1 | tail -n2
   <testcase name="[Monitor:legacy-cvo-invariants][bz-Storage] clusteroperator/storage should not change condition/Available" time="0"></testcase>
   <testcase name="[Monitor:legacy-cvo-invariants][bz-apiserver-auth] clusteroperator/authentication should not change condition/Degraded" time="0"></testcase>

/verified by @hongkailiu

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu · 2026-02-12T19:32:43Z

/verified by @hongkailiu

openshift-ci-robot · 2026-02-12T19:32:55Z

@hongkailiu: This PR has been marked as verified by @hongkailiu.

Details

In response to this:

/verified by @hongkailiu

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-02-12T19:54:54Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

hongkailiu · 2026-02-17T03:49:41Z

A follow up should be done after this gets in.

tmshort · 2026-05-20T18:46:27Z

/payload-aggregate periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node 5

openshift-ci · 2026-05-20T18:46:32Z

@tmshort: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/36514650-547c-11f1-851d-1c15132aeed9-0

coderabbitai · 2026-05-20T19:06:46Z

Walkthrough

This PR centralizes control-plane topology lookup in monitor tests, threads the resulting topology into operator state-transition helpers (removing clientConfig from exception callbacks), adds single-node-aware skip logic in those helpers, and updates a machine-scale test to skip operator-violation assertions on single-node clusters.

Changes

Topology-aware operator state transition testing

Layer / File(s)	Summary
Central topology fetch `pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go`	Adds `e2e` logging; `EvaluateTestsFromConstructedIntervals` fetches control-plane `topology` and passes it into monitor test helpers instead of `w.adminRESTConfig`.
exceptionCallback signature change `pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go`	`exceptionCallback` no longer accepts `clientConfig`; all callers and closures updated to the new signature.
Stable-system operator transition wiring `pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go`	`testStableSystemOperatorStateTransitions` accepts `topology` and forwards topology/single-node info into `testOperatorStateTransitions`; internal exception closure updated.
Upgrade operator state transitions `pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go`	`testUpgradeOperatorStateTransitions` accepts `topology`, derives `isTwoNode`, removes in-function topology lookup, updates exception closure signature, and passes topology/singleNode into `testOperatorStateTransitions`.
Core operator transition logic & single-node skips `pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go`	`testOperatorStateTransitions` now takes `topology` and emits skipped JUnit cases for each operator/condition when topology == `SingleReplicaTopologyMode`; exception invocation updated to match removed param.
Upgrade progressing checks & single-node skips `pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go`	`testUpgradeOperatorProgressingStateTransitions` accepts `topology`, derives `isTwoNode`, and adds explicit single-node skip messages for machine-config and Progressing=False per-operator checks.
Scale test topology-aware validation `test/extended/machines/scale.go`	AfterEach reloads kubeconfig, creates typed `configv1` client, retrieves control-plane topology via `exutil`, logs on error, and only asserts no operator `Progressing=False` violations when topology != `SingleReplicaTopologyMode`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

openshift/origin#31172: Changes in the same legacycvomonitortests/operators.go area around topology-aware Progressing/exception handling for single-node clusters.
openshift/origin#31112: Refactors passing topology through test*StateTransitions and adjusts exception/skip handling for operator tests.

Suggested labels

jira/valid-reference, ready-for-human-review

Suggested reviewers

sjenning
deads2k
p0lyn0mial

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	PR introduces assertions without meaningful failure messages in scale.go AfterEach (lines 291-292 for topology retrieval failures) violating requirement `#4` for helpful diagnostic messages.	Add failure messages to assertions: `o.Expect(err).NotTo(o.HaveOccurred(), "failed to load config")` and `o.Expect(err).NotTo(o.HaveOccurred(), "failed to create configV1 client")`

✅ Passed checks (10 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Skip CO condition tests on SNO' accurately and concisely describes the main objective of the changeset—skipping cluster operator condition tests on single-node OpenShift clusters, which is reflected consistently across all modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names in modified files use stable, deterministic values with no dynamic information like timestamps or UUIDs found in test titles.
Microshift Test Compatibility	✅ Passed	No new Ginkgo e2e tests added; PR modifies existing tests/helpers. Existing test has [apigroup:machine.openshift.io] tag providing MicroShift protection.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No new Ginkgo tests (It, Describe, Context, When) were added in this PR. The changes only modify existing test infrastructure and helper functions to skip/handle SNO topology appropriately.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies only test/monitoring code to skip tests on SNO and handle two-node topologies. No deployment manifests or problematic scheduling constraints are introduced.
Ote Binary Stdout Contract	✅ Passed	No process-level stdout writes violating OTE contract found. e2e.Logf writes to GinkgoWriter (exempted), logrus calls are in functions only, no direct klog or fmt.Print calls.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No new Ginkgo e2e tests are added. The PR modifies existing test helpers to add SNO topology awareness with no IPv4 assumptions or external connectivity requirements.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-20T19:06:48Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hongkailiu
Once this PR has been reviewed and has the lgtm label, please assign sosiouxme for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go`:
- Around line 670-673: When topology == configv1.SingleReplicaTopologyMode,
short-circuit the test-case generation so both Progressing-condition suites are
marked skipped: detect the SingleReplicaTopologyMode early (using the existing
topology variable) and set mcTestCase.SkipMessage for the initial "must go
Progressing=True" cases and also create/mark the corresponding JUnit cases from
the second suite (the "should stay Progressing=False while MCO is
Progressing=True" cases) with the same SkipMessage; in practice, return or emit
skipped test entries for both suites instead of only the first branch so all
Progressing checks are consistently skipped on single-node clusters.
- Around line 490-499: The single-node (SNO) skip branch is executed after the
empty-event fast path, causing operator/condition pairs with no events to be
reported as passing instead of skipped; move the singleNode check so it runs
before the empty-event/early-return logic. Edit the function that builds test
results (the block that appends to ret with junitapi.JUnitTestCase using
testName and SkipMessage) to check singleNode first and return/append a skipped
test for SNOs before evaluating the empty-event case (the early-return that
emits pass results when there are no events). Ensure the singleNode branch still
creates the SkipMessage and appends to ret exactly as it does now, but placed
ahead of the empty-event fast path.

In `@test/extended/machines/scale.go`:
- Around line 288-294: GetControlPlaneTopologyFromConfigClient can return an
error and nil topo; after calling
GetControlPlaneTopologyFromConfigClient(configV1Client) check for err or nil
topo and bail out (or skip the SingleReplicaTopologyMode assertion) before
dereferencing *topo; specifically, if err != nil or topo == nil, log the error
and return/continue so the subsequent check that compares *topo to
configv1.SingleReplicaTopologyMode and the o.Expect(violations)... call are not
executed on a nil pointer.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3ac05f96-6400-4348-86f4-e2d09933b1a4

📥 Commits

Reviewing files that changed from the base of the PR and between 428b9a0 and 8464d63.

📒 Files selected for processing (3)

pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go
pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
test/extended/machines/scale.go

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go (1)
94-108: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t silently treat unknown topology as multi-node.

If getControlPlaneTopology fails here, topology stays empty, so singleNode and all downstream isTwoNode checks become false. That means one transient lookup failure disables every SNO skip and dual-replica exception in this evaluator, which can reintroduce the flakes this refactor is trying to avoid. Please fail closed here or thread an explicit “topology unknown” path into the helpers instead of falling back to HA behavior.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go`
around lines 94 - 108, getControlPlaneTopology errors are currently logged and
topology left empty which causes singleNode to be false and incorrectly treat
unknown topology as multi-node; change the behavior in the block around
getControlPlaneTopology(w.adminRESTConfig) so that on err you return the error
(or set and propagate an explicit unknown topology sentinel into the downstream
calls) instead of continuing. Specifically, ensure the caller of
getControlPlaneTopology either returns fmt.Errorf(...) when err != nil, or sets
a distinct topology value and updates calls to
testStableSystemOperatorStateTransitions, testUpgradeOperatorStateTransitions,
and testUpgradeOperatorProgressingStateTransitions to accept and handle an
"unknown" topology, so SNO-specific skips are preserved when topology cannot be
determined.

♻️ Duplicate comments (2)

pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go (2)
478-499: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Check singleNode before the empty-event fast path.

When operatorEvents is empty, this still emits a passing test case on SNO instead of a skipped one. If the intent is to skip all Available/Degraded checks on single-node clusters, the singleNode branch needs to run first.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go`
around lines 478 - 499, The loop that builds test cases for each operator checks
for empty operatorEvents before checking singleNode, causing empty-event tests
to be reported as passing on single-node clusters; reorder the branches in the
loop (the block using operatorEvents := eventsByOperator[operatorName] and the
subsequent if statements) so the singleNode check runs before the empty-event
fast path—i.e., for each operator (platformidentification.KnownOperators.List())
and conditionType, evaluate singleNode first and emit the SkipMessage
JUnitTestCase when singleNode is true, then handle the len(operatorEvents) == 0
case to append a plain test case.
653-674: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip the second Progressing suite on SNO as well.

The first Progressing loop skips SingleReplicaTopologyMode, but the later should stay Progressing=False while MCO is Progressing=True loop still evaluates on SNO. That means SNO only skips half of the Progressing assertions, which conflicts with the PR goal.

Also applies to: 755-784
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go`
around lines 653 - 674, The second "Progressing" test loop (the one asserting
"should stay Progressing=False while MCO is Progressing=True") is missing the
SingleReplicaTopologyMode skip, so SNO only skips the first Progressing suite;
update that second loop to mirror the first by adding the same topology check
(topology == configv1.SingleReplicaTopologyMode) and set mcTestCase.SkipMessage
to "Test skipped on a single-node cluster" when true; locate the loop iterating
platformidentification.KnownOperators.List() and the mcTestCase/SkipMessage
logic in that suite (also referenced around the other block at the 755-784
region) and apply the identical skip branch used earlier.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In
`@pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go`:
- Around line 94-108: getControlPlaneTopology errors are currently logged and
topology left empty which causes singleNode to be false and incorrectly treat
unknown topology as multi-node; change the behavior in the block around
getControlPlaneTopology(w.adminRESTConfig) so that on err you return the error
(or set and propagate an explicit unknown topology sentinel into the downstream
calls) instead of continuing. Specifically, ensure the caller of
getControlPlaneTopology either returns fmt.Errorf(...) when err != nil, or sets
a distinct topology value and updates calls to
testStableSystemOperatorStateTransitions, testUpgradeOperatorStateTransitions,
and testUpgradeOperatorProgressingStateTransitions to accept and handle an
"unknown" topology, so SNO-specific skips are preserved when topology cannot be
determined.

---

Duplicate comments:
In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go`:
- Around line 478-499: The loop that builds test cases for each operator checks
for empty operatorEvents before checking singleNode, causing empty-event tests
to be reported as passing on single-node clusters; reorder the branches in the
loop (the block using operatorEvents := eventsByOperator[operatorName] and the
subsequent if statements) so the singleNode check runs before the empty-event
fast path—i.e., for each operator (platformidentification.KnownOperators.List())
and conditionType, evaluate singleNode first and emit the SkipMessage
JUnitTestCase when singleNode is true, then handle the len(operatorEvents) == 0
case to append a plain test case.
- Around line 653-674: The second "Progressing" test loop (the one asserting
"should stay Progressing=False while MCO is Progressing=True") is missing the
SingleReplicaTopologyMode skip, so SNO only skips the first Progressing suite;
update that second loop to mirror the first by adding the same topology check
(topology == configv1.SingleReplicaTopologyMode) and set mcTestCase.SkipMessage
to "Test skipped on a single-node cluster" when true; locate the loop iterating
platformidentification.KnownOperators.List() and the mcTestCase/SkipMessage
logic in that suite (also referenced around the other block at the 755-784
region) and apply the identical skip branch used earlier.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 29fe20c3-e6d9-4fed-bbf8-26aae24dbefd

📥 Commits

Reviewing files that changed from the base of the PR and between 8464d63 and 185b11c.

📒 Files selected for processing (2)

pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go
pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go

openshift-merge-bot · 2026-05-20T19:43:35Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

hongkailiu · 2026-05-20T20:00:39Z

/payload-aggregate periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node 5

openshift-ci · 2026-05-20T20:00:44Z

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/93d4e570-5486-11f1-8bc2-54578cc4942a-0

openshift-ci-robot · 2026-05-20T20:11:28Z

@hongkailiu: This pull request explicitly references no jira issue.

Details

In response to this:

This pull skips all CO tests on SNO.

Available=False and Degrade=True are not checked at all no matter if the test case is executed in an upgrade test suite, or not. Before it was handled as an exception and thus the job would be just flaky instead of failing. Thus, the relevant exceptions can be removed.

All checks on the Progressing condition are skipped as well on a SNO cluster.

The logging logic was inherited if it fails to determine the control plane topology because I am not sure on which type of clusters an error will show up.

Summary by CodeRabbit

Tests

Monitoring and scaling tests are now topology-aware and respect single-node, two-node, and multi-node control-plane layouts.

Operator state transition tests now accept explicit topology/single-node flags, adjust validations, and emit clear single-node skip messages for skipped JUnit cases.

Topology retrieval failures are logged and tests fall back to sensible single-node handling.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-05-20T20:14:26Z

@tmshort: The /verified command must be used with one of the following actions: by, later, remove, or bypass. See https://docs.ci.openshift.org/docs/architecture/jira/#premerge-verification for more information.

Details

In response to this:

/verified cancel
This should've been removed after the rebase

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tmshort · 2026-05-20T20:14:44Z

/verified remove
This should've been removed after the rebase

openshift-ci-robot · 2026-05-20T20:14:55Z

@tmshort: The verified label has been removed.

Details

In response to this:

/verified remove
This should've been removed after the rebase

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

wking

I like the topology centralization in EvaluateTestsFromConstructedIntervals; that's better than a separate getControlPlaneTopology in each test.* function.

I'm fine with us making the call that we don't have capacity to monitor ClusterOperator behavior for single-node clusters. I just think we want to make it clear that the motivation is lack-of-capacity, and not that we're saying that we think happy-case single-node updates having ClusterOperators go Degraded=True is a good thing. For example, the Degraded Godocs say:

Degraded indicates that the component (operator and all configured operands) does not match its desired state over a period of time resulting in a lower quality of service.

And in a world where we had infinite capacity, we would be able to teach all the cluster operators that a minute or whatever of downtime around a Node reboot or component update was expected single-node behavior, and not a quality-of-service miss.

Anyhow, no worries from me merging this, as long as the commit message and comments we're using to motivate the change are clearly saying "would be nice, but we don't have time" and aren't sounding like "even if we had infinite time, we wouldn't work on improving single-node ClusterOperator status".

This pull skips all CO tests on SNO. SingleNode is may briefly go Available=False for many operators during updates or Node reboots. Several operators also lack the capacity to teach their Degraded logic about single-node quality-of-service expectations. And we don't have capacity to file and track single-node Degraded exceptions or to set Available grace periods in this test suite at the moment. - `Available=False` and `Degrade=True` are not checked at all no matter if the test case is executed in an upgrade test suite, or not. Before it was handled as an exception and thus the job would be just flaky instead of failing. Thus, the relevant exceptions are removed. - All checks on the `Progressing` condition are skipped as well on a SNO cluster. The logging logic was inherited if it fails to determine the control plane topology because I am not sure on which type of clusters an error will show up.

openshift-ci-robot · 2026-05-21T04:54:15Z

@hongkailiu: This pull request explicitly references no jira issue.

Details

In response to this:

This pull skips all CO tests on SNO.

Available=False and Degrade=True are not checked at all no matter if the test case is executed in an upgrade test suite, or not. Before it was handled as an exception and thus the job would be just flaky instead of failing. Thus, the relevant exceptions can be removed.

All checks on the Progressing condition are skipped as well on a SNO cluster.

The logging logic was inherited if it fails to determine the control plane topology because I am not sure on which type of clusters an error will show up.

Summary by CodeRabbit

Tests

Monitoring and scaling tests are topology-aware and respect single-node, two-node, and multi-node control-plane layouts.

Operator state transition tests accept explicit topology, adjust validations, and emit clear single-node skip messages for skipped cases.

Topology retrieval failures are logged; tests fall back to sensible behavior when topology cannot be determined.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu · 2026-05-21T05:05:31Z

 			if reason == "_ManagedDeploymentsAvailable" {
 				return "https://issues.redhat.com/browse/OCPBUGS-62633"
 			}
-		case "olm":


Note that the exceptions about SNO coming from #31172 have been removed.

/cc @tmshort

I have examples of co/operator-lifecycle-manager-packageserver failing on SNO. It is aligned with https://redhat.atlassian.net/browse/OCPBUGS-63672 is fixed except on SNO https://redhat.atlassian.net/browse/OCPBUGS-67210.

Is there any example for co/olm failing on SNO? If yes, that does not explain the high pass rate in Sippy.

I was trying to understand this in Slack but got distracted.

The following is for my own memory, not relevant to the above question.

OCPBUGS-63672 -> (by this comment) OCPBUGS-67210 is confusing because the condition to check went from Progressing to Available.
The reason is that I copied from the output from wrong test case in this comment because the linked job such as this one in my previous comment contains two flaky cases about clusteroperator/operator-lifecycle-manager-packageserver. One for Available and the other for Progressing.

OCPBUGS-67210 should be about Progressing as well.

openshift-merge-bot · 2026-05-21T05:20:54Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

hongkailiu · 2026-05-21T05:40:15Z

/payload-aggregate periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node 5

openshift-ci · 2026-05-21T05:40:20Z

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8c3bff00-54d7-11f1-9d6e-c7cb0ec27de4-0

hongkailiu · 2026-05-21T11:23:09Z

Take a job from https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregator-periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node/2057335663132413952

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-origin-30775-ci-5.0-e2e-aws-upgrade-ovn-single-node/2057335662612320256/artifacts/e2e-aws-upgrade-ovn-single-node-4/single-node-e2e-test/artifacts/junit/e2e-monitor-tests__20260521-070135.xml | grep 'clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False' -A3
    <testcase name="[Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False while MCO is Progressing=True" time="0">
        <properties></properties>
        <skipped message="Test skipped on a single-node cluster"></skipped>
    </testcase>

So the case was skipped on a single node. The cases on other conditions are skipped too.

Normal cluster: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/30775/pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade/2057330851787575296

The same case not skipped.

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30775/pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade/2057330851787575296/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/artifacts/junit/e2e-monitor-tests__20260521-063507.xml | grep 'clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False' -A3
    <testcase name="[Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False while MCO is Progressing=True" time="0">
        <properties></properties>
    </testcase>
    <testcase name="[Monitor:legacy-cvo-invariants][bz-service-ca] clusteroperator/service-ca should stay Progressing=False while MCO is Progressing=True" time="0">

/verified by @hongkailiu

openshift-ci-robot · 2026-05-21T11:23:21Z

@hongkailiu: This PR has been marked as verified by @hongkailiu.

Details

In response to this:

Take a job from https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregator-periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node/2057335663132413952

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-origin-30775-ci-5.0-e2e-aws-upgrade-ovn-single-node/2057335662612320256/artifacts/e2e-aws-upgrade-ovn-single-node-4/single-node-e2e-test/artifacts/junit/e2e-monitor-tests__20260521-070135.xml | grep 'clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False' -A3
   <testcase name="[Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False while MCO is Progressing=True" time="0">
       <properties></properties>
       <skipped message="Test skipped on a single-node cluster"></skipped>
   </testcase>

So the case was skipped on a single node. The cases on other conditions are skipped too.

Normal cluster: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/30775/pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade/2057330851787575296

The same case not skipped.

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30775/pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade/2057330851787575296/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/artifacts/junit/e2e-monitor-tests__20260521-063507.xml | grep 'clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False' -A3
   <testcase name="[Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False while MCO is Progressing=True" time="0">
       <properties></properties>
   </testcase>
   <testcase name="[Monitor:legacy-cvo-invariants][bz-service-ca] clusteroperator/service-ca should stay Progressing=False while MCO is Progressing=True" time="0">

/verified by @hongkailiu

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu · 2026-05-21T11:25:00Z

The failure of https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/30775/pull-ci-openshift-origin-main-e2e-aws-ovn-serial-1of2/2057330851733049344 seems not relevant to this change.

/test e2e-aws-ovn-serial-2of2

hongkailiu · 2026-05-21T11:54:33Z

/test e2e-aws-ovn-serial-1of2

hongkailiu · 2026-05-21T16:45:35Z

/test e2e-aws-ovn-serial-2of2

openshift-ci · 2026-05-21T20:26:00Z

@hongkailiu: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot requested review from deads2k and sjenning February 12, 2026 03:30

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Feb 12, 2026

hongkailiu force-pushed the skip-co-tests-on-sno branch from 2738994 to 96f3cd1 Compare February 12, 2026 19:32

hongkailiu mentioned this pull request Feb 12, 2026

CO tests becomes flaky on a cluster upgrade failure #30780

Open

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 20, 2026

hongkailiu force-pushed the skip-co-tests-on-sno branch from 96f3cd1 to 8464d63 Compare May 20, 2026 19:06

openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 20, 2026

coderabbitai Bot requested changes May 20, 2026

View reviewed changes

Comment thread pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go Outdated

Comment thread pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go

Comment thread test/extended/machines/scale.go

hongkailiu force-pushed the skip-co-tests-on-sno branch from 47f7a5e to 185b11c Compare May 20, 2026 19:14

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

coderabbitai Bot approved these changes May 20, 2026

View reviewed changes

openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label May 20, 2026

hongkailiu changed the title ~~Skip CO condition tests on SNO~~ NO-JIRA: Skip CO condition tests on SNO May 20, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 20, 2026

openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label May 20, 2026

wking reviewed May 20, 2026

View reviewed changes

Comment thread pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go Outdated

Comment thread pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go Outdated

hongkailiu added 3 commits May 21, 2026 00:51

Simplify exceptionCallback function

e9c77aa

Address review comments

36e2d79

hongkailiu force-pushed the skip-co-tests-on-sno branch from dff8711 to 36e2d79 Compare May 21, 2026 04:52

hongkailiu commented May 21, 2026

View reviewed changes

openshift-ci Bot requested a review from tmshort May 21, 2026 05:05

hongkailiu mentioned this pull request May 21, 2026

NO-JIRA: Remove fixed bugs on CO conditions (2) - 2nd try #31207

Open

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 21, 2026

Conversation

hongkailiu commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Feb 12, 2026

Uh oh!

openshift-ci-robot commented Feb 12, 2026

Uh oh!

hongkailiu commented Feb 12, 2026

Uh oh!

openshift-ci Bot commented Feb 12, 2026

Uh oh!

hongkailiu commented Feb 12, 2026

Uh oh!

openshift-ci Bot commented Feb 12, 2026

Uh oh!

hongkailiu commented Feb 12, 2026

Uh oh!

openshift-ci Bot commented Feb 12, 2026

Uh oh!

hongkailiu commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Feb 12, 2026

Uh oh!

hongkailiu commented Feb 12, 2026

Uh oh!

openshift-ci-robot commented Feb 12, 2026

Uh oh!

openshift-ci-robot commented Feb 12, 2026

Uh oh!

hongkailiu commented Feb 17, 2026

Uh oh!

tmshort commented May 20, 2026

Uh oh!

openshift-ci Bot commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

openshift-ci Bot commented May 20, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-merge-bot Bot commented May 20, 2026

Uh oh!

hongkailiu commented May 20, 2026

Uh oh!

openshift-ci Bot commented May 20, 2026

Uh oh!

openshift-ci-robot commented May 20, 2026

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented May 20, 2026

Uh oh!

tmshort commented May 20, 2026

Uh oh!

openshift-ci-robot commented May 20, 2026

Uh oh!

wking left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

openshift-ci-robot commented May 21, 2026

hongkailiu commented Feb 12, 2026 •

edited

Loading

hongkailiu commented Feb 12, 2026 •

edited

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading

hongkailiu May 21, 2026 •

edited

Loading