Skip to content

Fix Azure deprovisioner deleting DNS records for v2 self-managed clusters#79984

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
bryan-cox:fix-deprovisioner-v2-rg-pattern
Jun 2, 2026
Merged

Fix Azure deprovisioner deleting DNS records for v2 self-managed clusters#79984
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
bryan-cox:fix-deprovisioner-v2-rg-pattern

Conversation

@bryan-cox

@bryan-cox bryan-cox commented Jun 2, 2026

Copy link
Copy Markdown
Member

Summary

  • The deprovisioner's is_ci_rg() function did not recognize v2 self-managed guest cluster resource groups (e.g., public-ea334332a2-public-ea334332a2-dx4l4), causing Phase 3 (DNS sweep) to delete their *.apps wildcard DNS records while clusters were still running
  • This was confirmed by the deprovisioner logs which show explicit deletion of *.apps.{public,private,upgrade,autoscaling,oauth-lb}-ea334332a2 records during an in-flight v2 self-managed rehearsal
  • Adds v2 guest cluster RG pattern to is_ci_rg() and fixes prefix extraction to use the 10-hex job hash, protecting all DNS records for active jobs

Test plan

🤖 Generated with Claude Code

Summary by CodeRabbit

This PR adds a new Azure HyperShift deprovisioning script that fixes a critical bug where the deprovisioner incorrectly deletes active DNS records for v2 self-managed clusters.

Background

The deprovisioner is a periodic job that runs in the OpenShift CI infrastructure to clean up stale resources. It operates in three phases:

  1. Phase 1: Destroys stale HostedClusters (HyperShift-managed guest clusters older than a TTL threshold)
  2. Phase 2: Deletes orphaned Azure resource groups that belong to CI jobs
  3. Phase 3: Sweeps DNS zones to remove records for clusters no longer running

The Problem

v2 self-managed guest cluster resource groups use a different naming convention ({type}-{10hex}-{type}-{hex}-{suffix}, e.g., public-ea334332a2-public-ea334332a2-dx4l4) compared to legacy clusters and management clusters. The previous is_ci_rg() function didn't recognize this pattern, causing Phase 3 to treat v2 cluster DNS records as orphaned and delete them—including critical wildcard *.apps records—while the clusters were still running. This broke hosted cluster DNS resolution.

The Fix

  1. Updated is_ci_rg() function: Added regex pattern to recognize v2 self-managed guest cluster RGs alongside existing management and legacy guest patterns.

  2. Improved Phase 3 prefix extraction logic: Instead of a generic string manipulation approach, the script now uses pattern-specific extraction:

    • Management RGs: Extract the leading 10-hex before -mgmt-
    • Legacy guest RGs: Use the first 20 hex characters
    • V2 self-managed guest RGs: Extract the 10-hex job hash from the -([0-9a-f]{10})- segment using regex capture groups
  3. Enhanced logging: The script now prints the list of active cluster prefixes found during Phase 3, making it easier to debug and verify correct identification of active clusters.

Affected Infrastructure

This change directly impacts the Azure HyperShift test infrastructure deprovisioning, specifically the periodic cleanup job for v2 self-managed hosted clusters. It ensures DNS records for active jobs are protected during the cleanup process.

…ters

The deprovisioner's is_ci_rg() function did not recognize v2 self-managed
guest cluster resource groups (e.g., public-ea334332a2-public-ea334332a2-dx4l4).
These RGs use a {type}-{10hex}-{type}-{hex}-{suffix} naming convention instead
of the {20hex}-{hex} pattern used by legacy guest clusters.

Because these RGs were not recognized, Phase 3 (DNS sweep) treated their DNS
records as orphaned and deleted them — including *.apps wildcard A records —
while the clusters were still running. This caused all hosted clusters to fail
with console operator DNS resolution errors.

Changes:
- Add v2 guest cluster RG pattern to is_ci_rg()
- Extract the 10-hex job hash as the active prefix for both management and
  v2 guest RGs, so DNS records containing that hash are protected
- Log active prefixes for easier debugging

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: fa0facc8-5b73-4528-b22d-d8b966cc54b9

📥 Commits

Reviewing files that changed from the base of the PR and between 8aa5b23 and 41fec05.

📒 Files selected for processing (1)
  • ci-operator/step-registry/hypershift/azure/deprovision/hypershift-azure-deprovision-commands.sh

Walkthrough

This PR enhances the Azure deprovision script to recognize and process a new V2 self-managed guest resource group naming scheme. The RG detection function is updated, the Phase 3 DNS sweep prefix extraction is reworked to handle multiple RG patterns explicitly, and the output reporting is adjusted to display prefix lists.

Changes

Azure Deprovision Logic Update

Layer / File(s) Summary
RG Detection and Phase 3 Prefix Extraction
ci-operator/step-registry/hypershift/azure/deprovision/hypershift-azure-deprovision-commands.sh
is_ci_rg() recognizes V2 self-managed guest RG naming. phase3_dns_sweep() replaces string-replacement prefix derivation with pattern-based extraction for mgmt, legacy guest, and V2 self-managed guest RGs, recording prefixes only when successfully derived. Phase 3 output now prints the active prefix list inline alongside the count.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • openshift/release#79720: Updates Azure deprovisioning logic for V2 self-managed guests in is_ci_rg() and phase3_dns_sweep() to compute active prefix patterns correctly.

Suggested labels

lgtm, rehearsals-ack

Suggested reviewers

  • psalajova
🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title directly summarizes the main fix: preventing the Azure deprovisioner from deleting DNS records for v2 self-managed clusters by fixing the resource group recognition logic.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR contains only bash script and configuration changes, no Ginkgo tests present. Custom check for Ginkgo test stability is not applicable.
Test Structure And Quality ✅ Passed PR modifies bash shell scripts only (Azure deprovisioner); no Ginkgo test code present, so check is not applicable.
Microshift Test Compatibility ✅ Passed PR modifies only a Bash shell script for Azure deprovisioning; no Ginkgo e2e tests are added or modified.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. Changes are limited to Azure deprovisioner shell scripts and configuration files. The custom check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed This PR modifies only a bash script for Azure infrastructure deprovisioning, not deployment manifests, operators, or controllers. No Kubernetes scheduling constraints introduced.
Ote Binary Stdout Contract ✅ Passed This PR modifies a bash deprovisioning script and CI configuration, not OTE binary code. The check applies to Go test binaries with JSON stdout contracts, which are not present here.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests are added; PR only modifies a bash shell script for Azure resource group cleanup logic, making the check not applicable.
No-Weak-Crypto ✅ Passed No weak crypto, custom crypto implementations, or non-constant-time secret comparisons detected in the deprovision script changes.
Container-Privileges ✅ Passed PR modifies bash script and step YAML without any Kubernetes container specs, privileged containers, or related security settings.
No-Sensitive-Data-In-Logs ✅ Passed New logging statement outputs hex cluster prefixes from RG names, which are not sensitive data like passwords, tokens, API keys, or PII. Credentials are protected with set +x.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from enxebre and sjenning June 2, 2026 13:57
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 2, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@bryan-cox: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
periodic-ci-openshift-hypershift-main-azure-deprovision-azure-deprovision N/A periodic Registry content changed

Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@csrwng

csrwng commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 2, 2026
@openshift-ci

openshift-ci Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, csrwng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bryan-cox

Copy link
Copy Markdown
Member Author

/pj-rehearse skip

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@bryan-cox: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-merge-bot openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jun 2, 2026
@openshift-ci

openshift-ci Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

@bryan-cox: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 558ee46 into openshift:main Jun 2, 2026
10 checks passed
TimurMP pushed a commit to TimurMP/release that referenced this pull request Jun 4, 2026
…ters (openshift#79984)

The deprovisioner's is_ci_rg() function did not recognize v2 self-managed
guest cluster resource groups (e.g., public-ea334332a2-public-ea334332a2-dx4l4).
These RGs use a {type}-{10hex}-{type}-{hex}-{suffix} naming convention instead
of the {20hex}-{hex} pattern used by legacy guest clusters.

Because these RGs were not recognized, Phase 3 (DNS sweep) treated their DNS
records as orphaned and deleted them — including *.apps wildcard A records —
while the clusters were still running. This caused all hosted clusters to fail
with console operator DNS resolution errors.

Changes:
- Add v2 guest cluster RG pattern to is_ci_rg()
- Extract the 10-hex job hash as the active prefix for both management and
  v2 guest RGs, so DNS records containing that hash are protected
- Log active prefixes for easier debugging

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants