NO-JIRA: Skip CO condition tests on SNO#30775
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
Scheduling required tests: |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-agent-single-node-ipv6-conformance |
|
@hongkailiu: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-single-node |
|
@hongkailiu: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-single-node |
|
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/96b95810-0824-11f1-9668-f7261cec9d7e-0 |
|
From this job: skipping as expected. The cases are still running (passing) on non-SNO cluster: The uncovered part is the scaling test but the code to determine SNO is quite similar. /verified by @hongkailiu |
|
@hongkailiu: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
2738994 to
96f3cd1
Compare
|
/verified by @hongkailiu |
|
@hongkailiu: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Scheduling required tests: |
|
A follow up should be done after this gets in. |
|
/payload-aggregate periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node 5 |
|
@tmshort: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/36514650-547c-11f1-851d-1c15132aeed9-0 |
96f3cd1 to
8464d63
Compare
WalkthroughThis PR centralizes control-plane topology lookup in monitor tests, threads the resulting topology into operator state-transition helpers (removing clientConfig from exception callbacks), adds single-node-aware skip logic in those helpers, and updates a machine-scale test to skip operator-violation assertions on single-node clusters. ChangesTopology-aware operator state transition testing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hongkailiu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go`:
- Around line 670-673: When topology == configv1.SingleReplicaTopologyMode,
short-circuit the test-case generation so both Progressing-condition suites are
marked skipped: detect the SingleReplicaTopologyMode early (using the existing
topology variable) and set mcTestCase.SkipMessage for the initial "must go
Progressing=True" cases and also create/mark the corresponding JUnit cases from
the second suite (the "should stay Progressing=False while MCO is
Progressing=True" cases) with the same SkipMessage; in practice, return or emit
skipped test entries for both suites instead of only the first branch so all
Progressing checks are consistently skipped on single-node clusters.
- Around line 490-499: The single-node (SNO) skip branch is executed after the
empty-event fast path, causing operator/condition pairs with no events to be
reported as passing instead of skipped; move the singleNode check so it runs
before the empty-event/early-return logic. Edit the function that builds test
results (the block that appends to ret with junitapi.JUnitTestCase using
testName and SkipMessage) to check singleNode first and return/append a skipped
test for SNOs before evaluating the empty-event case (the early-return that
emits pass results when there are no events). Ensure the singleNode branch still
creates the SkipMessage and appends to ret exactly as it does now, but placed
ahead of the empty-event fast path.
In `@test/extended/machines/scale.go`:
- Around line 288-294: GetControlPlaneTopologyFromConfigClient can return an
error and nil topo; after calling
GetControlPlaneTopologyFromConfigClient(configV1Client) check for err or nil
topo and bail out (or skip the SingleReplicaTopologyMode assertion) before
dereferencing *topo; specifically, if err != nil or topo == nil, log the error
and return/continue so the subsequent check that compares *topo to
configv1.SingleReplicaTopologyMode and the o.Expect(violations)... call are not
executed on a nil pointer.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 3ac05f96-6400-4348-86f4-e2d09933b1a4
📒 Files selected for processing (3)
pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.gopkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.gotest/extended/machines/scale.go
47f7a5e to
185b11c
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go (1)
94-108:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t silently treat unknown topology as multi-node.
If
getControlPlaneTopologyfails here,topologystays empty, sosingleNodeand all downstreamisTwoNodechecks become false. That means one transient lookup failure disables every SNO skip and dual-replica exception in this evaluator, which can reintroduce the flakes this refactor is trying to avoid. Please fail closed here or thread an explicit “topology unknown” path into the helpers instead of falling back to HA behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go` around lines 94 - 108, getControlPlaneTopology errors are currently logged and topology left empty which causes singleNode to be false and incorrectly treat unknown topology as multi-node; change the behavior in the block around getControlPlaneTopology(w.adminRESTConfig) so that on err you return the error (or set and propagate an explicit unknown topology sentinel into the downstream calls) instead of continuing. Specifically, ensure the caller of getControlPlaneTopology either returns fmt.Errorf(...) when err != nil, or sets a distinct topology value and updates calls to testStableSystemOperatorStateTransitions, testUpgradeOperatorStateTransitions, and testUpgradeOperatorProgressingStateTransitions to accept and handle an "unknown" topology, so SNO-specific skips are preserved when topology cannot be determined.
♻️ Duplicate comments (2)
pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go (2)
478-499:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winCheck
singleNodebefore the empty-event fast path.When
operatorEventsis empty, this still emits a passing test case on SNO instead of a skipped one. If the intent is to skip all Available/Degraded checks on single-node clusters, thesingleNodebranch needs to run first.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go` around lines 478 - 499, The loop that builds test cases for each operator checks for empty operatorEvents before checking singleNode, causing empty-event tests to be reported as passing on single-node clusters; reorder the branches in the loop (the block using operatorEvents := eventsByOperator[operatorName] and the subsequent if statements) so the singleNode check runs before the empty-event fast path—i.e., for each operator (platformidentification.KnownOperators.List()) and conditionType, evaluate singleNode first and emit the SkipMessage JUnitTestCase when singleNode is true, then handle the len(operatorEvents) == 0 case to append a plain test case.
653-674:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSkip the second Progressing suite on SNO as well.
The first Progressing loop skips
SingleReplicaTopologyMode, but the latershould stay Progressing=False while MCO is Progressing=Trueloop still evaluates on SNO. That means SNO only skips half of the Progressing assertions, which conflicts with the PR goal.Also applies to: 755-784
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go` around lines 653 - 674, The second "Progressing" test loop (the one asserting "should stay Progressing=False while MCO is Progressing=True") is missing the SingleReplicaTopologyMode skip, so SNO only skips the first Progressing suite; update that second loop to mirror the first by adding the same topology check (topology == configv1.SingleReplicaTopologyMode) and set mcTestCase.SkipMessage to "Test skipped on a single-node cluster" when true; locate the loop iterating platformidentification.KnownOperators.List() and the mcTestCase/SkipMessage logic in that suite (also referenced around the other block at the 755-784 region) and apply the identical skip branch used earlier.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In
`@pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go`:
- Around line 94-108: getControlPlaneTopology errors are currently logged and
topology left empty which causes singleNode to be false and incorrectly treat
unknown topology as multi-node; change the behavior in the block around
getControlPlaneTopology(w.adminRESTConfig) so that on err you return the error
(or set and propagate an explicit unknown topology sentinel into the downstream
calls) instead of continuing. Specifically, ensure the caller of
getControlPlaneTopology either returns fmt.Errorf(...) when err != nil, or sets
a distinct topology value and updates calls to
testStableSystemOperatorStateTransitions, testUpgradeOperatorStateTransitions,
and testUpgradeOperatorProgressingStateTransitions to accept and handle an
"unknown" topology, so SNO-specific skips are preserved when topology cannot be
determined.
---
Duplicate comments:
In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go`:
- Around line 478-499: The loop that builds test cases for each operator checks
for empty operatorEvents before checking singleNode, causing empty-event tests
to be reported as passing on single-node clusters; reorder the branches in the
loop (the block using operatorEvents := eventsByOperator[operatorName] and the
subsequent if statements) so the singleNode check runs before the empty-event
fast path—i.e., for each operator (platformidentification.KnownOperators.List())
and conditionType, evaluate singleNode first and emit the SkipMessage
JUnitTestCase when singleNode is true, then handle the len(operatorEvents) == 0
case to append a plain test case.
- Around line 653-674: The second "Progressing" test loop (the one asserting
"should stay Progressing=False while MCO is Progressing=True") is missing the
SingleReplicaTopologyMode skip, so SNO only skips the first Progressing suite;
update that second loop to mirror the first by adding the same topology check
(topology == configv1.SingleReplicaTopologyMode) and set mcTestCase.SkipMessage
to "Test skipped on a single-node cluster" when true; locate the loop iterating
platformidentification.KnownOperators.List() and the mcTestCase/SkipMessage
logic in that suite (also referenced around the other block at the 755-784
region) and apply the identical skip branch used earlier.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 29fe20c3-e6d9-4fed-bbf8-26aae24dbefd
📒 Files selected for processing (2)
pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.gopkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
|
Scheduling required tests: |
|
/payload-aggregate periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node 5 |
|
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/93d4e570-5486-11f1-8bc2-54578cc4942a-0 |
|
@hongkailiu: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@tmshort: The DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/verified remove |
|
@tmshort: The DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
wking
left a comment
There was a problem hiding this comment.
I like the topology centralization in EvaluateTestsFromConstructedIntervals; that's better than a separate getControlPlaneTopology in each test.* function.
I'm fine with us making the call that we don't have capacity to monitor ClusterOperator behavior for single-node clusters. I just think we want to make it clear that the motivation is lack-of-capacity, and not that we're saying that we think happy-case single-node updates having ClusterOperators go Degraded=True is a good thing. For example, the Degraded Godocs say:
Degraded indicates that the component (operator and all configured operands) does not match its desired state over a period of time resulting in a lower quality of service.
And in a world where we had infinite capacity, we would be able to teach all the cluster operators that a minute or whatever of downtime around a Node reboot or component update was expected single-node behavior, and not a quality-of-service miss.
Anyhow, no worries from me merging this, as long as the commit message and comments we're using to motivate the change are clearly saying "would be nice, but we don't have time" and aren't sounding like "even if we had infinite time, we wouldn't work on improving single-node ClusterOperator status".
This pull skips all CO tests on SNO. SingleNode is may briefly go Available=False for many operators during updates or Node reboots. Several operators also lack the capacity to teach their Degraded logic about single-node quality-of-service expectations. And we don't have capacity to file and track single-node Degraded exceptions or to set Available grace periods in this test suite at the moment. - `Available=False` and `Degrade=True` are not checked at all no matter if the test case is executed in an upgrade test suite, or not. Before it was handled as an exception and thus the job would be just flaky instead of failing. Thus, the relevant exceptions are removed. - All checks on the `Progressing` condition are skipped as well on a SNO cluster. The logging logic was inherited if it fails to determine the control plane topology because I am not sure on which type of clusters an error will show up.
dff8711 to
36e2d79
Compare
|
@hongkailiu: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| if reason == "_ManagedDeploymentsAvailable" { | ||
| return "https://issues.redhat.com/browse/OCPBUGS-62633" | ||
| } | ||
| case "olm": |
There was a problem hiding this comment.
I have examples of co/operator-lifecycle-manager-packageserver failing on SNO. It is aligned with https://redhat.atlassian.net/browse/OCPBUGS-63672 is fixed except on SNO https://redhat.atlassian.net/browse/OCPBUGS-67210.
Is there any example for co/olm failing on SNO? If yes, that does not explain the high pass rate in Sippy.
I was trying to understand this in Slack but got distracted.
The following is for my own memory, not relevant to the above question.
OCPBUGS-63672 -> (by this comment) OCPBUGS-67210 is confusing because the condition to check went from Progressing to Available.
The reason is that I copied from the output from wrong test case in this comment because the linked job such as this one in my previous comment contains two flaky cases about clusteroperator/operator-lifecycle-manager-packageserver. One for Available and the other for Progressing.
OCPBUGS-67210 should be about Progressing as well.
|
Scheduling required tests: |
|
/payload-aggregate periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node 5 |
|
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8c3bff00-54d7-11f1-9d6e-c7cb0ec27de4-0 |
|
Take a job from https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregator-periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node/2057335663132413952 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-origin-30775-ci-5.0-e2e-aws-upgrade-ovn-single-node/2057335662612320256/artifacts/e2e-aws-upgrade-ovn-single-node-4/single-node-e2e-test/artifacts/junit/e2e-monitor-tests__20260521-070135.xml | grep 'clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False' -A3
<testcase name="[Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False while MCO is Progressing=True" time="0">
<properties></properties>
<skipped message="Test skipped on a single-node cluster"></skipped>
</testcase>So the case was skipped on a single node. The cases on other conditions are skipped too. The same case not skipped. $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30775/pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade/2057330851787575296/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/artifacts/junit/e2e-monitor-tests__20260521-063507.xml | grep 'clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False' -A3
<testcase name="[Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should stay Progressing=False while MCO is Progressing=True" time="0">
<properties></properties>
</testcase>
<testcase name="[Monitor:legacy-cvo-invariants][bz-service-ca] clusteroperator/service-ca should stay Progressing=False while MCO is Progressing=True" time="0">/verified by @hongkailiu |
|
@hongkailiu: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
The failure of https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/30775/pull-ci-openshift-origin-main-e2e-aws-ovn-serial-1of2/2057330851733049344 seems not relevant to this change. /test e2e-aws-ovn-serial-2of2 |
|
/test e2e-aws-ovn-serial-1of2 |
|
/test e2e-aws-ovn-serial-2of2 |
|
@hongkailiu: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This pull skips all CO tests on SNO. SingleNode is may briefly go Available=False for many operators during updates or Node reboots. Several operators also lack the capacity to teach their Degraded logic about single-node quality-of-service expectations. And we don't have capacity to file and track single-node Degraded exceptions or to set Available grace periods in this test suite at the moment.
Available=FalseandDegrade=Trueare not checked at all no matter if the test case is executed in an upgrade test suite, or not. Before it was handled as an exception and thus the job would be just flaky instead of failing. Thus, the relevant exceptions are removed.All checks on the
Progressingcondition are skipped as well on a SNO cluster.The logging logic was inherited if it fails to determine the control plane topology because I am not sure on which type of clusters an error will show up.
Note that the exceptions about SNO coming from #31172 have been removed.
Summary by CodeRabbit