ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios by skosuri1 · Pull Request #1168 · Azure/telescope

skosuri1 · 2026-05-06T21:00:08Z

Stacked on top of #1157 (skosuri/clustermesh-scale). Do not merge until #1157 is merged; review/merge order matters.

This PR continues the ClusterMesh scale-test scenario with Phase 3 work — moving from harness validation (2 small clusters) to real scale measurement across cluster-count tiers.

Phase 3 Deliverables

20-node baseline cluster size (spec line 24). Current clusters are 3 nodes (2 default + 1 prompool) — sized for harness validation, not real scale measurement.
Cluster-count tiers: add azure-5.tfvars, azure-10.tfvars, azure-20.tfvars and corresponding pipeline matrix entries. Each tier: validate quota, validate peering count (N·(N-1) at separate-VNet mode — 380 at N=20), tune CL2 timeouts, document breaking points.
Parallel CL2 fan-out: replace sequential per-cluster CL2 with bounded concurrency (default 4). Requires async wrapping of utils.run_cl2_command (currently synchronous, modules/python/clusterloader2/utils.py:66-72) and confirming the AzDO agent has CPU/RAM headroom for N concurrent CL2 + Prometheus.
etcd PodMonitor capacity check at 20 clusters: 28 watchers per cluster × 20 = 560 watchers; verify Prom scrape budget holds.
Scaling-curve dashboards from cluster-attributed results (Kusto).

Out of Scope (deferred to later phases / pre-merge of #1157)

Pre-merge housekeeping for Add Cilium ClusterMesh scale-test scenario #1157 (DEBUG-DUMP block, dev-pipeline placeholder revert, comment-trim pass) — stays on the base branch.
Remaining scale scenarios (Allowing terraform inputs as JSON format #2 pod churn, add userdata bash scripts for lb eof error repro #3 node churn, Refactoring to add role concept #4 API server failure, Fix aws and azure issue #5 isolation, Change job id to run id #6 upper-bound, Refactor Terraform input variables #7 HA replicas) — Phase 4.

…pelines)

…ipefail)

…dd Total counts

…extension PUT)

…tep, restore Endpoints (ip/v1)

…llback metric per scope

…se/rate queries

… pipelines)

…or prom baseline

…oncurrent creates)

…idn't fix root cause); fix n5 condition syntax

…n10 in dev for n20 iteration

…N=20 mesh convergence

…s_v3 (DSv3 quota fits 1600 vCPU at n20)

… referenced it but variables.tf didn't declare)

…r-none)

…me _FakePopen attrs)

…flip prod skip_publish to false

…rds)

… warn not abort

…cross all clusters

…reakdown

…ments, harness knobs

…collect.yml

…entry

…0 for smoke first

…(build 67579) for headline; remove cc N=92 stage + azure-92-shared-cc tfvars/json

…uild 69332 evidence: SIGTERM at ~6h50m actual budget needed for 6h churn plus setup plus 10min terminate phase)

…throttle/OOM/endpoint-state metric IDs in cilium.yaml + new policy-scale.yaml scenario creates N CNPs per ns + scale.py CLI knobs + execute.yml env vars + n=2 pipeline stage)

…chestrator + execute.yml launcher/wait + scale.py collector + n=3 pipeline stage (complements per-cluster policy-scale by measuring fleet-wide rollout latency)

…risk + user-perceived service-works latency); opt-in via env, default off; enabled on euap n=2 + cc n=2 smoke stages

…extension enabled, curl pod IP directly instead of global Service DNS (build 69395 evidence: all 10 samples emitted nulls because pause pod cannot respond to curl)

…uild 69392 evidence: SIGTERM at 7h1min CL2 wall instead of expected 6h50min; +1h in-CL2 overhead from inter-phase measurement gather)

…trics (build 69395 evidence: Hubble queries emit "No data items" because CL2 prometheus does not scrape Hubble metrics port 9965; ACNS exposes them but our scrape config covers only standard cilium-agent port 9962)

…over (scale-down/up; gap #4) + clustermesh-apiserver restart survival (in-pod curl loop; gap #8) + n=3 smoke stages for both

…pdate test_configure_command_parsing kwargs to match scale.py CLI

…py + tests + hoist subprocess import + suppress not-callable false positive on tuple-unpacked transform); pylint now 10/10 exit 0

…d): port-forward + curl + tar to capture in-cluster prometheus state for offline PromQL; adds PodMonitors for hubble:9965 + coredns:9153 + kvstoremesh-standalone:9964 (cilium-agent + cilium-operator already scraped by CL2 built-in flags); enabled on n=3 failover + n=3 restart-survival smoke stages

…tifact (artifact owned by our pipeline run, downloadable from Build page; eliminates Telescope-team storage dependency that defeated the purpose of having an independent backup)

… own storage account, OAuth via SP, satisfies sub no-shared-key policy); knobs cl2_prom_snapshot_target=artifact|blob + storage_account + container; scales to N=100; n3 smoke stages use blob to validate end-to-end

…o with single AzureCLI@2 task (AzDO does not allow runtime condition on step-template references; AzureCLI@2 supports condition directly and handles SP auth via azureSubscription input)

…ows, CoreDNS latency+cache, kvstoremesh sync duration+readiness, operator identity GC+IPAM) + gap #3 service-backend membership probe (transient global Service per probe pod with propagation-probe-id selector, wait_peer_service_backend polls BPF lb map on peers, creates+deletes Service per iteration)

…ALL clusters not just source (build 70704 Kusto evidence: service_backend_ok=0/20 because Cilium global-service backend merge requires the same-named service to exist on each peer for the source backend to appear in the peer BPF lb map; source-only service meant peers never created lb entries)

…l (build 70704 evidence: type=/subtype= filters returned empty, verdict-based + unfiltered work)

…propagation+remove+first-packet+service-backend+recovery+policy-canary+Phase 4 scrape targets) — snapshot gives offline PromQL over hubble/coredns/kvstoremesh raw metrics

…m/clustermesh/etcd metrics only at start+end so Kusto shows 6h aggregates but no drift curve; snapshot captures full 6h at 15s resolution to compute memory/BPF/etcd growth slope offline (the slow-degradation signal that is the entire point of a soak)

… comment)

…r, normalize CL2_MOCK_MODE

…s-setup retry

skosuri and others added 30 commits May 6, 2026 13:59

phase 3: bounded-parallel CL2 fan-out across clusters

b5fe281

phase 3: add 5-cluster tier (azure-5.tfvars + n5 stage on dev/prod pi…

506d195

…pelines)

aks-cli: wait for stable Succeeded before extra node pool create

56942b1

aks-cli: run wait-for-succeeded with bash interpreter (dash rejects p…

5801228

…ipefail)

fix per-type events rate: scope ip/v1 doesn't exist in kvstoremesh; a…

1b02f57

…dd Total counts

probe: dump actual scope/action labels on kvstoremesh events metric

7ec0c43

aks-cli: retry nodepool add on OperationNotAllowed (race vs lazy AKS …

4714d26

…extension PUT)

fix per-type events rate: range vector for increase, finer subquery s…

dbaf930

…tep, restore Endpoints (ip/v1)

diag: add CurrentValue/SeriesCount per scope; add operations-count fa…

a92b84e

…llback metric per scope

per-scope events: report TotalCount (instant sum), drop broken increa…

81ea7c3

…se/rate queries

phase 3: add 10-cluster tier (azure-10.tfvars + n10 stage on dev/prod…

3a9af93

… pipelines)

per-scope events: restore rate queries; add 90s pre-workload settle f…

380d34c

…or prom baseline

n10: lower terraform apply parallelism to 4 (AKS RP throttles at 10 c…

90ef4e7

…oncurrent creates)

dev pipeline: disable n2 + n5 stages temporarily (RG quota pressure)

cac3392

cleanup phase 3: drop dead per-scope rate queries; drop 90s settle (d…

4ca27f0

…idn't fix root cause); fix n5 condition syntax

phase 3: add 20-cluster tier (final scale-test point); disable n2/n5/…

55c8a40

…n10 in dev for n20 iteration

n20: parallelism=8 + 480min timeout; validate retry budget 30min for …

5714f9c

…N=20 mesh convergence

20-node baseline (spec line 24): default pool 2->20 nodes, D4s_v5->D4…

2d717a7

…s_v3 (DSv3 quota fits 1600 vCPU at n20)

aks-cli: add pod_subnet_name to variable schema (latent bug — main.tf…

e24962f

… referenced it but variables.tf didn't declare)

aks-cli: pass --pod-subnet-id to nodepool add too (AKS requires all-o…

529aa91

…r-none)

pylint: clear R1732 (Popen disable), R1731 (max builtin), W0212 (rena…

fd67123

…me _FakePopen attrs)

pre-merge cleanup: strip DEBUG-DUMP/SMOKE-FAILURE-DEBUG-DUMP blocks; …

1bd56a6

…flip prod skip_publish to false

dev pipeline: flip skip_publish to false (need Kusto data for dashboa…

5c45946

…rds)

collect: stash subdirs around process_cl2_reports; per-cluster errors…

f44129b

… warn not abort

validate: pre-gate on clustermesh-apiserver Deployment+LB readiness a…

ca6895b

…cross all clusters

cl2 measurements: add per-pod apiserver CPU + per-peer mesh failure b…

e961e15

…reakdown

phase 4a: pod-churn-scale + pod-churn-kill CL2 configs, slope measure…

d80105a

…ments, harness knobs

phase 4a: wire pod-churn matrix entries + churn knobs in execute.yml/…

c144982

…collect.yml

phase 4a: pod-churn-combined config + Method:Exec killer; n20 matrix …

a021e02

…entry

phase 4a: enable n=2 stage with pod_churn_combined entry; disable n=2…

8433840

…0 for smoke first

skosuri1 and others added 30 commits June 4, 2026 06:59

drop N=100-in-alternate-region attempts; rely on euap N=100 baseline …

87f0041

…(build 67579) for headline; remove cc N=92 stage + azure-92-shared-cc tfvars/json

soak canary: worker_timeout 7h to 8h plus stage timeout 10h to 11h (b…

351e4f5

…uild 69332 evidence: SIGTERM at ~6h50m actual budget needed for 6h churn plus setup plus 10min terminate phase)

metrics Phase 3 + NetworkPolicy at scale scenario (10 new Hubble/CRI/…

35ced14

…throttle/OOM/endpoint-state metric IDs in cilium.yaml + new policy-scale.yaml scenario creates N CNPs per ns + scale.py CLI knobs + execute.yml env vars + n=2 pipeline stage)

cross-cluster CNP propagation cost probe: host-side parallel-apply or…

32367f8

…chestrator + execute.yml launcher/wait + scale.py collector + n=3 pipeline stage (complements per-cluster policy-scale by measuring fleet-wide rollout latency)

propagation probe: add REMOVE + FIRST_PACKET extensions (stale-state …

dcb5afb

…risk + user-perceived service-works latency); opt-in via env, default off; enabled on euap n=2 + cc n=2 smoke stages

FIRST_PACKET probe fix: switch probe pod to nginx (HTTP server) when …

90124bd

…extension enabled, curl pod IP directly instead of global Service DNS (build 69395 evidence: all 10 samples emitted nulls because pause pod cannot respond to curl)

soak canary: worker_timeout 8h to 9h plus stage timeout 11h to 12h (b…

ff23224

…uild 69392 evidence: SIGTERM at 7h1min CL2 wall instead of expected 6h50min; +1h in-CL2 overhead from inter-phase measurement gather)

mesh-behavior gap probes: identity GC in REMOVE + single-cluster fail…

def3fd8

…over (scale-down/up; gap #4) + clustermesh-apiserver restart survival (in-pod curl loop; gap #8) + n=3 smoke stages for both

validation gate fixes: strip trailing whitespace in pipeline yaml + u…

be8994a

…pdate test_configure_command_parsing kwargs to match scale.py CLI

fix pre-existing pylint regressions (too-many-lines disable on scale.…

05e32b3

…py + tests + hoist subprocess import + suppress not-callable false positive on tuple-unpacked transform); pylint now 10/10 exit 0

switch prom snapshot delivery from Telescope blob to AzDO pipeline ar…

34f41e3

…tifact (artifact owned by our pipeline run, downloadable from Build page; eliminates Telescope-team storage dependency that defeated the purpose of having an independent backup)

fix prom snapshot blob upload: replace login.yml template + bash comb…

20bd804

…o with single AzureCLI@2 task (AzDO does not allow runtime condition on step-template references; AzureCLI@2 supports condition directly and handles SP auth via azureSubscription input)

refine Hubble flow query: slice forward/drop by standard verdict labe…

4cee8fd

…l (build 70704 evidence: type=/subtype= filters returned empty, verdict-based + unfiltered work)

enable prom snapshot (blob) on n2 global smoke: richest probe stage (…

1433f34

…propagation+remove+first-packet+service-backend+recovery+policy-canary+Phase 4 scrape targets) — snapshot gives offline PromQL over hubble/coredns/kvstoremesh raw metrics

Add clustermesh-scale mock mode (KWOK + mock-cilium-agent topology)

0ec395d

Add azure_mock_n2 stage for clustermesh-scale mock mode

9e28ea4

Add n=2 MOCK smoke stage to new-pipeline-test (KWOK + mock-cilium-agent)

c665bd6

Fix mock-mode CI checks (configure test arg, tfvars test-inputs, yaml…

b40feab

… comment)

mock-mode: disable CL2 kubelet scraping (kwok nodes have no kubelet)

b8d9f3f

test: use dict literal to satisfy pylint (use-dict-literal)

962824d

mock-mode: run pod-churn workload on kwok nodes, apply mock PodMonito…

afd9d22

…r, normalize CL2_MOCK_MODE

Add n=20 mock provisioning spike stage (scales n=2 mock to 20 clusters)

8367099

Add n=100 mock shared-VNet stage with parallel deploy + CL2 prometheu…

2a4ba53

…s-setup retry

Enable prometheus TSDB snapshots (blob) for all mock stages

23627ab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios#1168

ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios#1168
skosuri1 wants to merge 180 commits into
skosuri/clustermesh-scalefrom
skosuri/clustermesh-scale-2

skosuri1 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

skosuri1 commented May 6, 2026

Phase 3 Deliverables

Out of Scope (deferred to later phases / pre-merge of #1157)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant