ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios#1168
Draft
skosuri1 wants to merge 180 commits into
Draft
ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios#1168skosuri1 wants to merge 180 commits into
skosuri1 wants to merge 180 commits into
Conversation
…tep, restore Endpoints (ip/v1)
…llback metric per scope
…oncurrent creates)
…idn't fix root cause); fix n5 condition syntax
…n10 in dev for n20 iteration
…N=20 mesh convergence
…s_v3 (DSv3 quota fits 1600 vCPU at n20)
… referenced it but variables.tf didn't declare)
…me _FakePopen attrs)
…flip prod skip_publish to false
…cross all clusters
…ments, harness knobs
…0 for smoke first
…(build 67579) for headline; remove cc N=92 stage + azure-92-shared-cc tfvars/json
…uild 69332 evidence: SIGTERM at ~6h50m actual budget needed for 6h churn plus setup plus 10min terminate phase)
…throttle/OOM/endpoint-state metric IDs in cilium.yaml + new policy-scale.yaml scenario creates N CNPs per ns + scale.py CLI knobs + execute.yml env vars + n=2 pipeline stage)
…chestrator + execute.yml launcher/wait + scale.py collector + n=3 pipeline stage (complements per-cluster policy-scale by measuring fleet-wide rollout latency)
…risk + user-perceived service-works latency); opt-in via env, default off; enabled on euap n=2 + cc n=2 smoke stages
…extension enabled, curl pod IP directly instead of global Service DNS (build 69395 evidence: all 10 samples emitted nulls because pause pod cannot respond to curl)
…uild 69392 evidence: SIGTERM at 7h1min CL2 wall instead of expected 6h50min; +1h in-CL2 overhead from inter-phase measurement gather)
…trics (build 69395 evidence: Hubble queries emit "No data items" because CL2 prometheus does not scrape Hubble metrics port 9965; ACNS exposes them but our scrape config covers only standard cilium-agent port 9962)
…pdate test_configure_command_parsing kwargs to match scale.py CLI
…py + tests + hoist subprocess import + suppress not-callable false positive on tuple-unpacked transform); pylint now 10/10 exit 0
…d): port-forward + curl + tar to capture in-cluster prometheus state for offline PromQL; adds PodMonitors for hubble:9965 + coredns:9153 + kvstoremesh-standalone:9964 (cilium-agent + cilium-operator already scraped by CL2 built-in flags); enabled on n=3 failover + n=3 restart-survival smoke stages
…tifact (artifact owned by our pipeline run, downloadable from Build page; eliminates Telescope-team storage dependency that defeated the purpose of having an independent backup)
… own storage account, OAuth via SP, satisfies sub no-shared-key policy); knobs cl2_prom_snapshot_target=artifact|blob + storage_account + container; scales to N=100; n3 smoke stages use blob to validate end-to-end
…o with single AzureCLI@2 task (AzDO does not allow runtime condition on step-template references; AzureCLI@2 supports condition directly and handles SP auth via azureSubscription input)
…ows, CoreDNS latency+cache, kvstoremesh sync duration+readiness, operator identity GC+IPAM) + gap #3 service-backend membership probe (transient global Service per probe pod with propagation-probe-id selector, wait_peer_service_backend polls BPF lb map on peers, creates+deletes Service per iteration)
…ALL clusters not just source (build 70704 Kusto evidence: service_backend_ok=0/20 because Cilium global-service backend merge requires the same-named service to exist on each peer for the source backend to appear in the peer BPF lb map; source-only service meant peers never created lb entries)
…l (build 70704 evidence: type=/subtype= filters returned empty, verdict-based + unfiltered work)
…propagation+remove+first-packet+service-backend+recovery+policy-canary+Phase 4 scrape targets) — snapshot gives offline PromQL over hubble/coredns/kvstoremesh raw metrics
…m/clustermesh/etcd metrics only at start+end so Kusto shows 6h aggregates but no drift curve; snapshot captures full 6h at 15s resolution to compute memory/BPF/etcd growth slope offline (the slow-degradation signal that is the entire point of a soak)
…r, normalize CL2_MOCK_MODE
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on top of #1157 (
skosuri/clustermesh-scale). Do not merge until #1157 is merged; review/merge order matters.This PR continues the ClusterMesh scale-test scenario with Phase 3 work — moving from harness validation (2 small clusters) to real scale measurement across cluster-count tiers.
Phase 3 Deliverables
azure-5.tfvars,azure-10.tfvars,azure-20.tfvarsand corresponding pipeline matrix entries. Each tier: validate quota, validate peering count (N·(N-1) at separate-VNet mode — 380 at N=20), tune CL2 timeouts, document breaking points.utils.run_cl2_command(currently synchronous,modules/python/clusterloader2/utils.py:66-72) and confirming the AzDO agent has CPU/RAM headroom for N concurrent CL2 + Prometheus.Out of Scope (deferred to later phases / pre-merge of #1157)