Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
180 commits
Select commit Hold shift + click to select a range
b5fe281
phase 3: bounded-parallel CL2 fan-out across clusters
May 6, 2026
506d195
phase 3: add 5-cluster tier (azure-5.tfvars + n5 stage on dev/prod pi…
May 6, 2026
56942b1
aks-cli: wait for stable Succeeded before extra node pool create
May 7, 2026
5801228
aks-cli: run wait-for-succeeded with bash interpreter (dash rejects p…
May 7, 2026
1b02f57
fix per-type events rate: scope ip/v1 doesn't exist in kvstoremesh; a…
May 7, 2026
7ec0c43
probe: dump actual scope/action labels on kvstoremesh events metric
May 7, 2026
4714d26
aks-cli: retry nodepool add on OperationNotAllowed (race vs lazy AKS …
May 8, 2026
dbaf930
fix per-type events rate: range vector for increase, finer subquery s…
May 8, 2026
a92b84e
diag: add CurrentValue/SeriesCount per scope; add operations-count fa…
May 8, 2026
81ea7c3
per-scope events: report TotalCount (instant sum), drop broken increa…
May 8, 2026
3a9af93
phase 3: add 10-cluster tier (azure-10.tfvars + n10 stage on dev/prod…
May 8, 2026
380d34c
per-scope events: restore rate queries; add 90s pre-workload settle f…
May 8, 2026
90ef4e7
n10: lower terraform apply parallelism to 4 (AKS RP throttles at 10 c…
May 8, 2026
cac3392
dev pipeline: disable n2 + n5 stages temporarily (RG quota pressure)
May 9, 2026
4ca27f0
cleanup phase 3: drop dead per-scope rate queries; drop 90s settle (d…
May 9, 2026
55c8a40
phase 3: add 20-cluster tier (final scale-test point); disable n2/n5/…
May 9, 2026
5714f9c
n20: parallelism=8 + 480min timeout; validate retry budget 30min for …
May 9, 2026
2d717a7
20-node baseline (spec line 24): default pool 2->20 nodes, D4s_v5->D4…
May 9, 2026
e24962f
aks-cli: add pod_subnet_name to variable schema (latent bug — main.tf…
May 10, 2026
529aa91
aks-cli: pass --pod-subnet-id to nodepool add too (AKS requires all-o…
May 10, 2026
fd67123
pylint: clear R1732 (Popen disable), R1731 (max builtin), W0212 (rena…
May 10, 2026
1bd56a6
pre-merge cleanup: strip DEBUG-DUMP/SMOKE-FAILURE-DEBUG-DUMP blocks; …
May 10, 2026
5c45946
dev pipeline: flip skip_publish to false (need Kusto data for dashboa…
May 11, 2026
f44129b
collect: stash subdirs around process_cl2_reports; per-cluster errors…
May 11, 2026
ca6895b
validate: pre-gate on clustermesh-apiserver Deployment+LB readiness a…
May 11, 2026
e961e15
cl2 measurements: add per-pod apiserver CPU + per-peer mesh failure b…
May 11, 2026
d80105a
phase 4a: pod-churn-scale + pod-churn-kill CL2 configs, slope measure…
May 11, 2026
c144982
phase 4a: wire pod-churn matrix entries + churn knobs in execute.yml/…
May 11, 2026
a021e02
phase 4a: pod-churn-combined config + Method:Exec killer; n20 matrix …
May 12, 2026
8433840
phase 4a: enable n=2 stage with pod_churn_combined entry; disable n=2…
May 12, 2026
8c447ae
phase 4a: smoke-only — comment out non-combined n=2 matrix entries
May 12, 2026
3672613
phase 4a: pre-stage kubectl in cl2_config_dir for Method:Exec killer …
May 12, 2026
8fd94c3
phase 4a: annotate workload namespaces for ACNS CFP-39876 cross-clust…
May 12, 2026
71056be
phase 4a: flip dev pipeline to n=20 (event_throughput + pod_churn_com…
May 12, 2026
ec9946d
phase 4b: share-infra refactor in execute.yml/collect.yml; dev pipeli…
May 13, 2026
026d4fe
phase 4b: fix IFS-tab parsing bug in collect.yml (consecutive tabs co…
May 13, 2026
7e94f35
phase 4b: scenario #4 ClusterMesh APIServer Failure — killer + measur…
May 13, 2026
b68c256
phase 4b: flip dev pipeline to n=20 share-infra (3 scenarios, max_con…
May 13, 2026
9f962ab
phase 4b: share-infra exit-0 + SucceededWithIssues + apiserver-failur…
May 13, 2026
ab7eb0e
phase 4b: diagnostic dump on killer timeout (periodic samples + descr…
May 13, 2026
7784422
phase 4b: validate — retry-until-ready loop for node readiness (15min…
May 13, 2026
fd8f2f3
phase 4b: tee killer diag to stdout + iter-only n=2 share-infra to ap…
May 13, 2026
234fb87
phase 4b: fix apiserver-failure killer false-negative timeout — kubec…
May 13, 2026
ca0d4ec
phase 4b: scenario #7 (HA configuration validation) — replicas scaler…
May 13, 2026
b1838c4
phase 4b: scenario #5 (multi-cluster failure isolation) — target-only…
May 14, 2026
c15e16c
iter: narrow n2_shared to isolation-only for scenario #5 smoke
May 14, 2026
08c9800
phase 4b: per-scenario max_concurrent override — isolation forces con…
May 14, 2026
cb966c4
phase 4b: scenario #3 (node churn / IP churn) — host-side az nodepool…
May 14, 2026
21849b7
fix scenario #3 build 67114 failures: sentinel ctx via direct kubecon…
May 14, 2026
b993b45
fix scenario #3 build 67126: filter nodes by VMSS providerID instead …
May 14, 2026
d8aa039
fix scenario #3 build 67133: add explicit replace_refill op (az aks n…
May 14, 2026
d7e7a5d
scenario #3 build 67155 was green end-to-end; add new_node_count to o…
May 14, 2026
e35bc27
scenario #3 n=2 smoke: bump node_replace_batch_size 1→10 (50% pool re…
May 14, 2026
f004c2b
fix scenario #3 build 67170 (K=10): wait_vmss_succeeded before every …
May 14, 2026
a8df66a
phase 4b: scenario #6 (upper bound / saturation) — in-run QPS x resta…
May 14, 2026
adc11f6
iter: comment out n2_shared (node-churn-combined) for scenario #6 fir…
May 14, 2026
3702832
iter: swap n=2 tfvars D4s_v3/D8s_v3 → D4ds_v4/D8ds_v4 — DSv3 family h…
May 15, 2026
a8ee088
fix saturation classifier filename pattern (build 67211 root cause): …
May 15, 2026
c7b1b5a
debug: classifier rung-files-found count was 0 in build 67221 despite…
May 15, 2026
8c1f6df
fix saturation _read_metric content shape (build 67224 root cause): C…
May 15, 2026
c5c9b0f
bump saturation defaults — qps 20/40/80/160 → 100/500/1500/4000/10000…
May 15, 2026
484a3c2
phase A fixes for scenario #6 — bump Prom mem 4Gi→12Gi (build 67279 s…
May 15, 2026
bf5e7a4
iter: n=2 tfvars D4ds_v4/D8ds_v4 → D4s_v5/D8s_v5 (different family fo…
May 15, 2026
6072658
scenario #6 phase B: label-flip workload + ops_per_sec knob (rubber-d…
May 16, 2026
a77fed3
scenario #6 phase C: revert to CL2-native restart workload (Phase B k…
May 16, 2026
a1b3355
fix scenario #6 Prom OOM (build 67335): bump --prometheus-memory-requ…
May 16, 2026
5e53f77
fix scenario #6 Prom admission (build 67347): CL2's --prometheus-memo…
May 16, 2026
a733d99
n=20 saturation: swap tfvars D4s_v3/D8s_v3 → D4_v3/D8_v3 (Dv3 family …
May 16, 2026
d6a7f48
bump dev pipeline timeout 360→720min (n=20 apply/destroy can balloon)
May 16, 2026
6d44d72
dev pipeline: comment out n2_upper_bound, only n20 runs by default
May 16, 2026
6ee3f3a
dev pipeline: disable n=2 stage, enable n=20 stage with only n20_uppe…
May 16, 2026
bcaf46c
n=20 debug enhancements: periodic snapshot daemon (60s sampling of ap…
May 16, 2026
3c39e03
wait-for-apiserver: scale budget 30min→90min at N>=15 (build 67384 ev…
May 16, 2026
7310d85
wait-for-apiserver: add background fleet clustermeshprofile re-applie…
May 17, 2026
4339893
opt clustermesh-scale into PRESERVE_STATE_ON_APPLY_FAILURE: at N=20 a…
May 17, 2026
24e886e
scope preserve_state_on_apply_failure to a template parameter (was pi…
May 17, 2026
998d699
fix: opt n=20 stage into preserve_state_on_apply_failure (template-pa…
May 17, 2026
936ba57
n=20 tfvars: bump deletion_delay 4h→24h (build 67477: Azure resource …
May 17, 2026
4832d4b
n=2 all-scenarios run: enable n=2 stage with n2_shared (5-scenario ro…
May 18, 2026
83fe5bf
soft-fail upper-bound on junit failures + flip stages: disable n=2, r…
May 18, 2026
75e2718
extend soft-fail to ALL clustermesh-scale scenarios: build evidence a…
May 18, 2026
d1fee14
enable both n=2 + n=20 stages; operator chooses one in AzDO UI per run
May 18, 2026
db702df
fix scenario_failure_diag node-churn block crashing in solo-scenario …
May 18, 2026
692e0ac
validate-cilium: add Fleet-side + on-cluster peer-list debug dumps (b…
May 18, 2026
9e7ef71
validate-cilium: upfront cluster-id+cluster-name table at N>=10 to su…
May 18, 2026
0538dc0
n=5: enable stage with all 3 matrix entries + bump azure-5.tfvars to …
May 18, 2026
25a86e0
n=10: enable stage with 3 matrix entries (shared/node-churn/upper-bou…
May 18, 2026
f7b16bc
n=10: drop max_parallel 3→2 so it fits Dv3 quota alongside n=5+n=20 (…
May 18, 2026
cf510ab
n=20: add n20_shared + n20_node_churn_combined matrix entries to mirr…
May 18, 2026
1829ce1
n=20: comment out n20_upper_bound (already validated in 67579)
May 18, 2026
80899ae
validate-cilium: Fleet detector demote all-concat to informational; w…
May 19, 2026
fc98daa
n=5: keep only n5_upper_bound (others already validated in 67593)
May 19, 2026
672dcf1
validate-cilium: fail-fast on Fleet skip-bug (cluster-id=0 after wait…
May 19, 2026
df54d53
shared-vnet support: derive clustermesh VNet from AKS subnet (not rol…
May 19, 2026
c8a7895
n=2 shared-VNet smoke: azure-2-shared.tfvars (1 VNet 10.0.0.0/8, 0 pe…
May 19, 2026
0329e65
n=2 shared-VNet smoke: test_type=pod-churn-combined-shared-vnet to is…
May 19, 2026
0c0677e
tfvars: add Microsoft.ContainerService/managedClusters delegation to …
May 19, 2026
08d1e5e
aks-cli: bump aks_wait_succeeded 20min->30min and nodepool retry 15mi…
May 20, 2026
76228cf
N=100 shared-VNet pod-churn-combined: azure-100.tfvars (1 VNet 10.0.0…
May 20, 2026
343028d
azure-100.tfvars: add Microsoft.ContainerService/managedClusters dele…
May 20, 2026
cf9290e
fleet: wrap clustermeshprofile apply in 5-attempt retry for N=100 LRO…
May 20, 2026
1ba615e
N=100: enable stage by default (quota verified live: 4992 free Dv3 vs…
May 20, 2026
fad744d
fleet: clustermeshprofile create is idempotent under preserve_state_o…
May 20, 2026
4beaafb
validate-cilium: skip VNet peering inventory in shared-VNet mode; fai…
May 20, 2026
ed9c1bd
execute-parallel: add per-worker watchdog timeout (CL2_WORKER_TIMEOUT…
May 20, 2026
d7daf3c
N=100 matrix: worker_timeout_seconds=14400 (4h ceiling, ~8x normal CL…
May 20, 2026
719659b
aks-cli: retry az aks create on ReferencedResourceNotProvisioned (sha…
May 20, 2026
5eb6e9d
N=100: drop parallelism 8->4; expand retry to VirtualNetworkNotInSucc…
May 20, 2026
ba35105
aks-cli: delete-before-retry on transient Azure RP errors (build 6778…
May 20, 2026
e09cace
aks-cli: idempotency precheck before az aks create + case-insensitive…
May 20, 2026
2b76994
diag: agent_specs_diag stage to dump VM specs (memory/vCPU/SKU via Az…
May 21, 2026
8cf8c6d
%global variation Phase 1 (smoke): annotate-namespaces.sh 3rd arg + C…
May 21, 2026
da6f368
enable n2_global_smoke stage for first trigger
May 21, 2026
d8469c5
%global matrix Phase 2: azure-20/50-shared.tfvars + N=20/50/100 sweep…
May 21, 2026
f333465
Option E (multi-scenario): event-throughput + isolation %global plumb…
May 21, 2026
045689b
enable N=20 %global sweep + disable smoke (validated by builds 67954+…
May 21, 2026
4970967
fix(tfvars): add required network_config_list attrs (network_security…
May 21, 2026
aef41c5
enable N=50 sweep + kubectl-top diagnostic per cluster + disable N=20…
May 21, 2026
66f62c7
n20 g20 retry: dedicated stage + enhanced Fleet diag (clustermeshprof…
May 22, 2026
e299264
kubectl-top to file (readability) + prep N=50 trigger
May 22, 2026
ce05238
N=50: set TF parallelism=4 (matches proven N=100 setting; parallelism…
May 22, 2026
d8b8bba
N=50 retry: 3 failed cells (g20/g60/g100) at max_parallel=2; g0 data …
May 22, 2026
85af318
N=50 g100 retry: solo cell attempt 3 (g20/g60 landed in 68079)
May 23, 2026
9996177
enable N=100 sweep + add cilium-status diag for CL2 failures
May 23, 2026
11fa4b3
add g60 hot-spot replicate stages (n=20/n=50/n=100, condition:false)
May 27, 2026
2dedd3b
g60 rerun stages: remove condition:false so they're selectable in UI
May 27, 2026
13d5e64
Add anomaly reruns: n20 g020, n20 g100, n50 g100
May 27, 2026
bf99b8c
aks-cli: aks_nodepool_cli precheck existing state + retry on "already…
skosuri1 May 28, 2026
716bf18
aks-cli fail-fast bricked nodepool + stuck cluster; Phase 1 metrics; …
skosuri1 Jun 2, 2026
6914bb6
n2 smoke: condition false -> always() so manual stage select works
skosuri1 Jun 2, 2026
4c1c54f
propagation probe + global services + outbound retry cap + cmp auto-r…
skosuri1 Jun 2, 2026
6c96dab
propagation probe: distroless-safe cilium exec + MCR-approved images …
skosuri1 Jun 2, 2026
ab1da2c
probe backend: explicit nginx command + tcpSocket readiness (cbl-mari…
skosuri1 Jun 3, 2026
2130ed4
probe defaults sized for N=100 safety (PEER_TIMEOUT 60->120, WINDOW 3…
skosuri1 Jun 3, 2026
fddd10b
next batch: L1 policy canary + mesh-recovery probe + canadacentral pr…
skosuri1 Jun 3, 2026
9b400a1
policy metric increase() over %v window + cc preflight PIP prefer-Sta…
skosuri1 Jun 3, 2026
b67a1b4
policy canary: L4 toPorts rule (force policy regen, was optimized awa…
skosuri1 Jun 3, 2026
da7511c
policy regen metric: keep query, document AKS-managed Cilium does not…
skosuri1 Jun 3, 2026
eb2711f
endpoint regen metric: increase(%v) + Mean/TotalSamples + comment cla…
skosuri1 Jun 3, 2026
020ffd0
cc migration: n=2 shared-vnet smoke (DSv4 SKU swap + canadacentral st…
skosuri1 Jun 3, 2026
606fdec
cc migration N=20+N=100: tfvars DSv4 swap + pipeline cells (cross-reg…
skosuri1 Jun 3, 2026
6b4c588
parallel next-batch: pod-density 500+800 n2 stages (euap, orthogonal …
Jun 3, 2026
aa90d3e
cluster-loss-recovery probe: mesh-detach-rejoin orchestrator + n=3 sm…
skosuri1 Jun 3, 2026
1838e67
detach probe: bump prewait 120->300s + pre-state deadline 60s->300s (…
skosuri1 Jun 4, 2026
de451ac
detach probe: cilium-dbg status (not clustermesh status) via ds/ciliu…
skosuri1 Jun 4, 2026
215c00e
final batch: long-soak 6h canary + repeatability-variance N=20 g100 x…
skosuri1 Jun 4, 2026
f1a77fd
cc N=100 fallback: azure_eastus2_n100_pod_churn (eastus2 has 143 clus…
skosuri1 Jun 4, 2026
8c226b2
centraluseuap N=100 stage (highest-capacity region; 187 AKS free + 74…
skosuri1 Jun 4, 2026
43cd066
revert centraluseuap N=100 (DSv4 only 7424 free = ~1.5x N=100 need, n…
Jun 4, 2026
493240e
pivot N=100 -> cc N=92 (eastus2 Dv3 SKU-policy-blocked; cc has 92 clu…
skosuri1 Jun 4, 2026
87f0041
drop N=100-in-alternate-region attempts; rely on euap N=100 baseline …
skosuri1 Jun 4, 2026
351e4f5
soak canary: worker_timeout 7h to 8h plus stage timeout 10h to 11h (b…
skosuri1 Jun 4, 2026
35ced14
metrics Phase 3 + NetworkPolicy at scale scenario (10 new Hubble/CRI/…
skosuri1 Jun 4, 2026
32367f8
cross-cluster CNP propagation cost probe: host-side parallel-apply or…
skosuri1 Jun 4, 2026
dcb5afb
propagation probe: add REMOVE + FIRST_PACKET extensions (stale-state …
skosuri1 Jun 4, 2026
90124bd
FIRST_PACKET probe fix: switch probe pod to nginx (HTTP server) when …
skosuri1 Jun 5, 2026
ff23224
soak canary: worker_timeout 8h to 9h plus stage timeout 11h to 12h (b…
skosuri1 Jun 5, 2026
780a944
replace dead Hubble queries with cilium_forward/drop datapath flow me…
skosuri1 Jun 6, 2026
def3fd8
mesh-behavior gap probes: identity GC in REMOVE + single-cluster fail…
Jun 9, 2026
be8994a
validation gate fixes: strip trailing whitespace in pipeline yaml + u…
Jun 12, 2026
05e32b3
fix pre-existing pylint regressions (too-many-lines disable on scale.…
Jun 12, 2026
e347feb
prometheus TSDB snapshot to blob (opt-in via cl2_prom_snapshot_enable…
Jun 12, 2026
34f41e3
switch prom snapshot delivery from Telescope blob to AzDO pipeline ar…
Jun 12, 2026
fa197c0
prom snapshot blob path: upload to cmshscaleprom in sub 37deca37 (our…
Jun 12, 2026
20bd804
fix prom snapshot blob upload: replace login.yml template + bash comb…
Jun 12, 2026
3ddd7a3
fill remaining metric + probe gaps: Phase 4 PromQL queries (Hubble fl…
Jun 16, 2026
491f6a4
fix gap #3 service-backend probe: create transient global Service on …
Jun 17, 2026
4cee8fd
refine Hubble flow query: slice forward/drop by standard verdict labe…
Jun 17, 2026
1433f34
enable prom snapshot (blob) on n2 global smoke: richest probe stage (…
Jun 18, 2026
a3f1116
enable prom snapshot on soak canary: pod-churn-combined gathers ciliu…
Jun 24, 2026
0ec395d
Add clustermesh-scale mock mode (KWOK + mock-cilium-agent topology)
Jun 25, 2026
9e28ea4
Add azure_mock_n2 stage for clustermesh-scale mock mode
Jun 25, 2026
c665bd6
Add n=2 MOCK smoke stage to new-pipeline-test (KWOK + mock-cilium-agent)
Jun 25, 2026
b40feab
Fix mock-mode CI checks (configure test arg, tfvars test-inputs, yaml…
Jun 25, 2026
b8d9f3f
mock-mode: disable CL2 kubelet scraping (kwok nodes have no kubelet)
Jun 25, 2026
962824d
test: use dict literal to satisfy pylint (use-dict-literal)
Jun 25, 2026
afd9d22
mock-mode: run pod-churn workload on kwok nodes, apply mock PodMonito…
Jun 25, 2026
8367099
Add n=20 mock provisioning spike stage (scales n=2 mock to 20 clusters)
Jun 25, 2026
2a4ba53
Add n=100 mock shared-VNet stage with parallel deploy + CL2 prometheu…
Jun 29, 2026
23627ab
Enable prometheus TSDB snapshots (blob) for all mock stages
Jun 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
239 changes: 239 additions & 0 deletions docs/clustermesh-scale-failure-modes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
# ClusterMesh-on-AKS+Fleet failure-mode catalog

This catalog documents every observed failure mode in the
`clustermesh-scale` test pipeline, with machine-readable signatures so
retry logic and dashboards can detect, classify, and (where safe)
auto-recover.

**How to use:**
- Look up by **symptom signature** (log regex / metric pattern) to identify
a failure from a new run.
- Read **root cause** to understand whether it's an Azure RP issue, a
Cilium issue, a test harness issue, or a fundamental scale finding.
- Apply **mitigation** when running new builds (existing retry/fail-fast
rules in `aks-cli/main.tf` consume some of these signatures).
- Use **linked builds** as historical evidence for the failure.

**Coverage matrix:** see the "What we TESTED vs what we did NOT" section
at the bottom for the explicit scope statement.

---

## Machine-readable signatures (consumed by retry logic)

| `id` | `error_regex` | `retryable` | `max_retry_budget_s` | `fail_fast_action` | `linked_builds` |
|---|---|---|---|---|---|
| `outbound_conn_fail_on_create` | `VMExtensionError_OutboundConnFail\|VMExtensionProvisioningError.*OutboundConnFail` | true (1 retry only) | 600 | abort + dump VMSS extension logs | 68700 |
| `prompool_already_exists` | `already exists` (in `az aks nodepool add` output) | true | 1800 | precheck state + recreate if Failed | 68577, 68700 |
| `subnet_referenced_resource_not_provisioned` | `ReferencedResourceNotProvisioned` | true | 1800 | retry after VNet PUT queue drains | 67775, 67788, 68700 |
| `aks_create_already_exists` | `already exists` (in `az aks create` output) | true | 600 | precheck state + delete if half-created | 67798 |
| `cluster_stuck_updating` | provisioningState=`Updating` for 30+ poll iterations w/ no state change | false (BRICKED) | n/a | abort immediately; cluster needs manual triage | 68577 (mesh-44) |
| `nodepool_stuck_failed_delete` | nodepool provisioningState=`Failed` AND `delete` call did not move state out of `Failed` within 120s | false (BRICKED) | n/a | abort immediately; nodepool needs manual delete | 69021 (mesh-50) |
| `fleet_cluster_id_zero_skip` | `cilium-config` ConfigMap `cluster-id=0` on a Fleet member | true | 1800 | delete + recreate `clustermeshprofile` (re-randomizes IDs) | 68035 |
| `acns_stuck_applying_non_euap` | `az fleet clustermeshprofile apply` hangs in `Applying` for >5min | false | n/a | abort; region does not have ACNS rolled out | (all westus2/canadacentral builds pre-2026-05-24) |
| `vmextension_error_k_*` | `VMExtensionError_K[A-Za-z]+` (kubelet/CRI failures) | false | n/a | abort + dump CSE logs; non-retryable | 68700 |

---

## Detailed entries

### `outbound_conn_fail_on_create`

**Symptom signature**
- Log regex: `VMExtensionError_OutboundConnFail` OR `VMExtensionProvisioningError.*OutboundConnFail`
- Metric pattern: AKS provisioningState transitions to `Failed` shortly after `az aks create` returns; agent log shows CSE script exit 50
- Wall-clock signature: failure within 5-10 min of `az aks create`

**Root cause**
- AKS VMSS provisioning runs a Custom Script Extension (CSE) at first boot to install kubelet/runtime packages
- Packages are downloaded from Microsoft package repos via outbound connectivity
- At N=100 shared-VNet, concurrent subnet PUT operations on the shared VNet keep some subnets in `Updating` state when their VMSS comes online
- Outbound NAT path uses a route that depends on the subnet being `Succeeded` → CSE script can't reach upstream → exit 50

**Mitigation (in code)**
- `aks-cli/main.tf` `aks_cli` retry block: when this error fires on retry iteration ≤2 AND on a fresh recreate (post our delete+recreate logic), allow ONE more retry with explicit partial-cluster cleanup. Past iteration 2, fail-fast.
- Not added to the general retryable regex — would mask real outbound config bugs at smaller N

**Manual recovery**
- Rerun the entire stage; new VNet provisioning order may avoid the race
- If recurs at N=100, consider lowering parallelism or splitting into multiple shared VNets

**Linked builds**
- 68700: 32 occurrences across the run; mesh-23 specifically died at attempt 3 of cluster recreate

---

### `prompool_already_exists`

**Symptom signature**
- Log regex: `The (agent pool|nodepool) .* already exists` (in `az aks nodepool add` stderr/stdout)
- Wall-clock signature: appears at apply retry boundary (i.e., terraform task attempt > 1)

**Root cause**
- Under `preserve_state_on_apply_failure=true` + AzDO `retryCountOnTaskFailure`, terraform may re-run the `local-exec` provisioner after a prior apply attempt already created the nodepool
- Without state precheck, `az aks nodepool add` returns "already exists" → script exits 1 → cycle repeats

**Mitigation (in code)**
- `aks-cli/main.tf` `aks_nodepool_cli` block (commit `bf99b8c`): state-aware precheck — Succeeded → exit 0; Creating/Updating/Deleting → wait; Failed → delete+recreate; absent → add. Plus "already exists" added to retryable regex.

**Linked builds**
- 68577 attempts 2 + 5 (deterministic bug)
- 68700 (absorbed cleanly by the fix — 707 already-exists hits, no failures)

---

### `subnet_referenced_resource_not_provisioned`

**Symptom signature**
- Log regex: `ReferencedResourceNotProvisioned`
- Often accompanied by: `Cannot proceed with operation because resource .* is not in Succeeded state. Resource is in Updating state and the last operation that updated/is updating the resource is PutSubnetOperation`

**Root cause**
- Azure VNet serializes all subnet PUT operations per-VNet (only one PutSubnetOperation in flight at a time)
- At N=100 shared-VNet with 200 subnets, concurrent AKS creates fan out subnet attach requests faster than Azure can serialize them
- AKS sees a peer cluster's subnet PUT mid-flight, rejects with this error

**Mitigation (in code)**
- `aks-cli/main.tf` `aks_cli` block: included in retryable regex. 15 retries × 60s = 15min budget; drains the queue.

**Linked builds**
- 67775, 67788: first observed at N=100
- 68700: 100+ retries absorbed cleanly

---

### `cluster_stuck_updating`

**Symptom signature**
- Metric pattern: AKS provisioningState=`Updating` for ≥30 consecutive 20s polls (10min) with no state change
- Log: `aks_wait_succeeded` emits same `provisioningState=Updating` line repeatedly with no transition

**Root cause**
- AKS Resource Provider regional queue stalls a cluster's reconciliation
- No external indicator of stuck vs slowly-progressing without ground-truth from RP team
- Build 68577 mesh-44 spent 30+ min stuck before being killed by AzDO retry; cluster was never recoverable

**Mitigation (in code)**
- `aks-cli/main.tf` `aks_wait_succeeded` (commit `716bf18`): track same-state count; if 30 consecutive polls observe the same state with no change, fail-fast immediately. Saves ~20min per occurrence.

**Linked builds**
- 68577 mesh-44 (4× internal retries × 30min each = 2+ hours wasted)

---

### `nodepool_stuck_failed_delete`

**Symptom signature**
- Metric pattern: nodepool provisioningState=`Failed` AND `az aks nodepool delete` API call returned but state remained `Failed` 120+ seconds later
- Log: `az aks nodepool delete reported error (will poll absence anyway)` followed by indefinite `still present (state=Failed)` polling

**Root cause**
- Azure RP rejected the delete (no transition to `Deleting`); the nodepool is bricked
- No amount of additional retries will release it without manual intervention

**Mitigation (in code)**
- `aks-cli/main.tf` `aks_nodepool_cli` block (commit `716bf18`): after issuing delete, if state still `Failed` 120s later (no Failed→Deleting transition), abort with clear `BRICKED` message. Saves ~88 of 90 minutes wasted on bricked nodepools.

**Linked builds**
- 69021 mesh-50 (13.6 HOURS burned on this exact pattern; the trigger for the fast-fail fix)

---

### `fleet_cluster_id_zero_skip`

**Symptom signature**
- After `az fleet clustermeshprofile apply` reports success, query
`cilium-config` ConfigMap on a member → `cluster-id` value is `0`
- Cilium agent logs: errors about "invalid cluster ID 0"
- Cross-cluster traffic fails on the affected cluster

**Root cause**
- Fleet hash-allocation algorithm can collide on cluster IDs across mesh members
- When collision detected, one cluster gets ID 0 (skip-allocated) instead of a unique non-zero ID
- Mesh peering effectively skips this cluster

**Mitigation (in code)**
- `validate-resources.yml` detects ID=0 case → currently fails the stage
- Future: `cmp-auto-recovery` todo — delete + recreate `clustermeshprofile` (re-randomizes ID assignment, ~99% chance of resolving in one retry). Cost: ~15-30min vs ~3h for full pipeline rerun.

**Linked builds**
- 68035

---

### `acns_stuck_applying_non_euap`

**Symptom signature**
- `az fleet clustermeshprofile apply` returns success but state stays `Applying` indefinitely (>5min)
- No ACNS reconciler logs visible
- Region != `eastus2euap`

**Root cause**
- AKS-managed ClusterMesh / ACNS rollout was region-gated to eastus2euap pre-2026-05-24
- canadacentral verified working as of 2026-05-24
- Other regions (westus2, etc.) still gated as of that date

**Mitigation**
- Manual: only use regions verified to have ACNS rollout complete
- Code: no automated mitigation; fail-fast is correct behavior

**Linked builds**
- All westus2 builds pre-2026-05-24 (checkpoint 002 evidence)

---

### `vmextension_error_k_*`

**Symptom signature**
- Log regex: `VMExtensionError_K[A-Za-z]+` (e.g. `VMExtensionError_KubeletStart`)
- AKS provisioningState=`Failed` after CSE script reports kubelet/CRI startup failure

**Root cause**
- Kubelet or container runtime failed to start on the node
- Usually downstream of an earlier failure (disk full, OOM, image pull failure)
- Build 68700 saw 12 of these; root cause was the same shared-VNet outbound flux as `outbound_conn_fail_on_create`

**Mitigation**
- No automated retry — these usually indicate a real underlying problem
- Manual: check CSE logs (`/var/log/azure/cluster-provision-cse-output.log` on node) for the upstream cause

**Linked builds**
- 68700

---

## Covered / NOT-covered matrix (release scope statement)

### ✅ TESTED in current pipeline
- N=2/5/10/20/50/100 cluster meshes
- 4 `%global` cells: 0% / 20% / 60% / 100% of namespaces marked global
- 7 base scenarios: event-throughput, pod-churn-combined, isolation, node-churn-scale/replace/combined, upper-bound
- AKS-managed Cilium (current AKS version) + Fleet `clustermeshprofile`
- Single region: eastus2euap (canadacentral verified Fleet-capable but not yet sweeping)
- Shared-VNet topology (single VNet, 100 clusters share via subnet partition)
- pause-pod workloads (no real HTTP traffic in pre-2026-06-02 scenarios; propagation-probe.yaml adds real http-echo)

### ⚠️ PARTIALLY TESTED
- Global services (`service.cilium.io/global=true`): the Service objects ARE created in our scenarios but no client cross-cluster traffic exists. propagation-probe.yaml adds real cross-cluster curl.
- Synthetic propagation latency: kvstore_op_duration as proxy was used pre-2026-06-02; direct measurement added in propagation-probe.yaml.

### ❌ NOT TESTED (explicit gaps)
- **NetworkPolicy / CiliumNetworkPolicy at scale** — zero policies in any current scenario. See `policy-scale-matrix` todo.
- **L7 policies** (HTTP/Kafka/gRPC)
- **IPsec / WireGuard transparent encryption** between mesh peers
- **Mixed-version Cilium across mesh members** (version skew tolerance)
- **Cilium upgrade mid-mesh** (under load)
- **MCS-API (ServiceExport / ServiceImport)** as alternative to global services
- **Private clusters** (no public API endpoint)
- **Multi-region mesh** (cross-region latency, cross-region cost)
- **Mixed cluster sizes in same mesh** (small + large clusters together, fairness/QoS)
- **Pod density > 200 pods/cluster** — see `pod-density-scaling` todo
- **24h+ soak runs** — all current tests ≤ few hours. See `long-soak-test` todo.
- **Cluster loss / disaster recovery** — Fleet member permanently removed, mesh GC behavior. See `cluster-loss-recovery` todo.
- **CIDR overlap between clusters** (Cilium cluster_id disambiguation)
- **Bursty workload patterns** (10× spike then drop, vs sustained)
- **Hubble flow telemetry** (per-flow visibility into actual cross-cluster traffic)

This list is intentionally explicit so PMs/customers/operators know the
boundary of "tested at scale" claims. Items in NOT TESTED are not bugs —
they're scope choices for the current iteration.
4 changes: 4 additions & 0 deletions jobs/competitive-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@ parameters:
- name: retry_attempt_count
type: number
default: 3
- name: preserve_state_on_apply_failure
type: string
default: "false"
- name: credential_type
type: string
default: service_connection
Expand Down Expand Up @@ -79,6 +82,7 @@ jobs:
terraform_arguments: ${{ parameters.terraform_arguments }}
terraform_input_varibles: ${{ parameters.terraform_input_varibles }}
retry_attempt_count: ${{ parameters.retry_attempt_count }}
preserve_state_on_apply_failure: ${{ parameters.preserve_state_on_apply_failure }}
- template: /steps/validate-resources.yml
parameters:
cloud: ${{ parameters.cloud }}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
#!/bin/bash
# Annotate workload namespaces for ACNS (managed Cilium) opt-in cross-cluster sync.
#
# AKS-managed Cilium ships with `clustermesh-default-global-namespace=false`
# (opt-in mode, per ACNS team confirmation 2026-05-11 from David Vadas /
# Isaiah Raya), unlike upstream Cilium which defaults to opt-out. Without
# the `clustermesh.cilium.io/global: "true"` annotation on the workload
# namespace, NONE of the namespace's resources (CiliumIdentity,
# CiliumEndpoint, CiliumEndpointSlice, Services, ServiceExports) sync
# across the mesh — even if the Service object itself carries
# `service.cilium.io/global: "true"`. The namespace annotation is
# load-bearing; once present, Cilium auto-applies the service-level
# semantics to all services in that namespace.
#
# This script is invoked via `Method: Exec` from each scale-test scenario's
# top-level CL2 config (event-throughput.yaml, pod-churn-*.yaml). It runs
# AFTER CL2 has created the test namespaces (`<prefix>-1..N`) and BEFORE the
# workload deploy phase, so cross-cluster sync is enabled from the first
# resource creation.
#
# The pre-staged kubectl binary at /root/perf-tests/clusterloader2/config/kubectl
# (set up by steps/engine/clusterloader2/clustermesh-scale/execute.yml) is
# used because the CL2 image does not bundle kubectl.
#
# Positional args:
# $1 NAMESPACE_COUNT How many namespaces total (matches CL2's `namespace.number`).
# $2 NAMESPACE_PREFIX Namespace prefix (matches CL2's `namespace.prefix`).
# $3 GLOBAL_NAMESPACE_COUNT (OPTIONAL, default=$1) How many of the N
# namespaces to annotate as global. Lets
# experiments vary %global without touching
# CL2 namespace.number. When 0, NO namespace
# is annotated (pure ClusterMesh overhead
# baseline). When equal to $1, behaves as
# before (all annotated; backward-compatible).

set -u
set -o pipefail

NAMESPACE_COUNT="${1:-0}"
NAMESPACE_PREFIX="${2:-}"
# Default: annotate all namespaces (backward-compatible behavior).
# Always-annotate-first-N pattern: callers wanting %global=20% with 5 NS
# pass GLOBAL_NAMESPACE_COUNT=1; %global=60% with 5 NS pass 3; etc.
GLOBAL_NAMESPACE_COUNT="${3:-$NAMESPACE_COUNT}"

if [ -z "${NAMESPACE_PREFIX}" ] || [ "${NAMESPACE_COUNT}" -lt 1 ]; then
echo "annotate-namespaces ERROR: need positional args (count, prefix); got count='${NAMESPACE_COUNT}' prefix='${NAMESPACE_PREFIX}'"
exit 2
fi

# GLOBAL_NAMESPACE_COUNT validation: must be 0..NAMESPACE_COUNT.
if ! [ "${GLOBAL_NAMESPACE_COUNT}" -ge 0 ] 2>/dev/null || [ "${GLOBAL_NAMESPACE_COUNT}" -gt "${NAMESPACE_COUNT}" ]; then
echo "annotate-namespaces ERROR: GLOBAL_NAMESPACE_COUNT='${GLOBAL_NAMESPACE_COUNT}' must be 0..${NAMESPACE_COUNT}"
exit 2
fi

# Prefer PATH kubectl, fall back to the pre-staged binary the pipeline
# downloads into the bind-mounted config dir. Mirrors pod-churn-killer.sh's
# fallback path so both scripts behave consistently if the CL2 image
# eventually starts bundling kubectl.
if command -v kubectl >/dev/null 2>&1; then
KUBECTL=kubectl
elif [ -x /root/perf-tests/clusterloader2/config/kubectl ]; then
KUBECTL=/root/perf-tests/clusterloader2/config/kubectl
echo "annotate-namespaces: using pre-staged kubectl at ${KUBECTL}"
else
echo "annotate-namespaces ERROR: kubectl not in PATH and pre-staged binary missing"
exit 127
fi

ANNOTATION="clustermesh.cilium.io/global=true"

# 0% global baseline: no namespace is annotated. Log explicitly and exit
# clean — this is the "pure ClusterMesh overhead" experimental control.
if [ "${GLOBAL_NAMESPACE_COUNT}" -eq 0 ]; then
echo "annotate-namespaces: GLOBAL_NAMESPACE_COUNT=0 — no namespaces annotated (0% global baseline)"
echo "annotate-namespaces: done, applied=0 of total=${NAMESPACE_COUNT}"
exit 0
fi

echo "annotate-namespaces: applying ${ANNOTATION} to first ${GLOBAL_NAMESPACE_COUNT} of ${NAMESPACE_COUNT} namespaces (prefix '${NAMESPACE_PREFIX}')"

FAIL_COUNT=0
APPLIED_COUNT=0
for i in $(seq 1 "${GLOBAL_NAMESPACE_COUNT}"); do
NS="${NAMESPACE_PREFIX}-${i}"
# --overwrite tolerates re-runs (CL2 retries, multi-step configs). The
# namespace MUST already exist — CL2 creates managed namespaces before
# the first test step runs. If it's missing here, that's a real bug
# worth surfacing as an error (don't --ignore-not-found).
if "${KUBECTL}" annotate namespace "${NS}" "${ANNOTATION}" --overwrite >/dev/null 2>&1; then
echo "annotate-namespaces: ${NS} annotated"
APPLIED_COUNT=$((APPLIED_COUNT + 1))
else
echo "annotate-namespaces ERROR: failed to annotate ${NS}"
FAIL_COUNT=$((FAIL_COUNT + 1))
fi
done

# Verification log — caller can grep this to confirm expected vs actual.
echo "annotate-namespaces: requested=${GLOBAL_NAMESPACE_COUNT}, applied=${APPLIED_COUNT}, failed=${FAIL_COUNT}, total_namespaces=${NAMESPACE_COUNT}"

if [ "${FAIL_COUNT}" -gt 0 ]; then
echo "annotate-namespaces: ${FAIL_COUNT}/${GLOBAL_NAMESPACE_COUNT} namespaces failed annotation"
exit 1
fi

echo "annotate-namespaces: done, applied=${APPLIED_COUNT} of total=${NAMESPACE_COUNT}"
exit 0
Loading
Loading