Skip to content

Monitor MachinePool Health#1473

Closed
friegger wants to merge 8 commits into
mainfrom
enh/1472-pool-health
Closed

Monitor MachinePool Health#1473
friegger wants to merge 8 commits into
mainfrom
enh/1472-pool-health

Conversation

@friegger
Copy link
Copy Markdown
Contributor

@friegger friegger commented May 14, 2026

This change mostly consists of two parts:

  1. A MachinePool lifecycle controller:

Implements machinepool-lifecycle-controller monitoring MachinePool health, with corresponding API additions for MachinePool/VolumePool/BucketPool conditions, codegen updates

  1. A pool health heartbeat runnable in the machinepoollet:

The poollet now actively reports pool liveness so the lifecycle controller can set pools whose poollets have gone away to Unknown. It also includes kustomizations adding the Lease namespace and RBAC adjustments.
The MachinePoolHeartbeat, a ticker-driven Runnable that probes the IRI runtime via Status, renews the pool's Lease in ironcore-machinepool-lease, and patches Ready only when its value or observedGeneration actually changes. Errors on either sub-step are logged and retried on the next tick; the lifecycle controller's grace period absorbs short blips. Lease takeover from a previous holder is logged at Info. Contains app arguments that make the heartbeat intervals configurable, defaulting to the IEP-15 values.

Contributes to #1472

machinepool-lifecycle-controller monitors MachinePools
based on the proposal.

Signed-off-by: Felix Riegger <felix.riegger@sap.com>
@friegger friegger requested a review from adracus May 14, 2026 19:55
@github-actions github-actions Bot added enhancement New feature or request size/L labels May 14, 2026
@hardikdr hardikdr added the area/iaas Issues related to IronCore IaaS development. label May 16, 2026
@hardikdr hardikdr added this to Roadmap May 16, 2026
Signed-off-by: Felix Riegger <felix.riegger@sap.com>
@github-actions github-actions Bot added size/XXL and removed size/L labels May 18, 2026
Signed-off-by: Felix Riegger <felix.riegger@sap.com>
@friegger friegger force-pushed the enh/1472-pool-health branch from ed66ffe to 6833b9c Compare May 18, 2026 19:09
Signed-off-by: Felix Riegger <felix.riegger@sap.com>
@friegger friegger force-pushed the enh/1472-pool-health branch from cdf84c1 to 19f5b40 Compare May 19, 2026 07:13
friegger added 2 commits May 19, 2026 09:23
Signed-off-by: Felix Riegger <felix.riegger@sap.com>
Implement the MachinePool health heartbeat described in IEP-15.
The poollet now actively reports pool liveness so the lifecycle controller
can fail pools whose poollets have gone away.

It includes:
- config
    - Provision the ironcore-machinepool-lease namespace via the install
      kustomization so operators don't have to know the magic name.
    - Grant the poollet coordination.k8s.io/leases RBAC scoped to that
      namespace.
- heartbeat runnable
    - Add pure helpers ComputeReadyCondition (maps an IRI Status probe
      result to a MachinePool Ready condition) and ReadyConditionsDiffer
      (decides whether a patch is warranted, ignoring timestamps so we
      don't flap downstream watchers).
    - Add MachinePoolHeartbeat, a ticker-driven Runnable that probes the
      IRI runtime via Status, renews the pool's Lease in
      ironcore-machinepool-lease, and patches Ready only when its value or
      observedGeneration actually changes. Errors on either sub-step are
      logged and retried on the next tick; the lifecycle controller's
      grace period absorbs short blips. Lease takeover from a previous
      holder is logged at Info as required by IEP-15.
- app arguments that make the heartbeat intervals configurable,
  defaulting to the IEP-15 values.

Signed-off-by: Felix Riegger <felix.riegger@sap.com>
@github-actions github-actions Bot added size/XL and removed size/XXL labels May 20, 2026
friegger added 2 commits May 22, 2026 14:56
…manager

Signed-off-by: Felix Riegger <felix.riegger@sap.com>
Splits each component install into a namespace-free `default/` base and
a `standalone/` wrapper that ships Namespaces. Resolves the kustomize
ID conflict introduced in 242aaf9 (where the parent namespace
transformer renamed every Namespace in the bundle, including the new
lease Namespace, to ironcore-system) and eliminates the
remove-namespace.yaml patch dance the combined wrappers used to work
around it.

Layer model:
- config/<component>/default/   - sets namespace+namePrefix, emits no
                                  Namespace (used as a base)
- config/<component>/standalone/ - wraps default/ and adds Namespaces
                                  (use this for single-component deploys)
- config/namespaces/ironcore-system/    - shared Namespace kustomization
- config/namespaces/machinepool-lease/  - shared Namespace + lease RBAC
- config/default/, config/etcdless/  - reference the bases and namespace
                                       kustomizations directly; no more
                                       remove-namespace.yaml patches

Closes two IEP-15 RBAC gaps:
- Adds the missing RoleBinding for poollet lease renewal; without it
  every poollet got 403 on its first renewal.
- Adds Role + cross-namespace RoleBinding granting the controller
  manager get/list/watch on coordination.k8s.io/leases in
  ironcore-machinepool-lease (the lifecycle controller was reading
  leases with no Role granting access).

Also moves the apiserver-side lease Role out of config/apiserver/rbac/,
where the parent transformer was silently rewriting its
metadata.namespace from ironcore-machinepool-lease to ironcore-system.
It now lives alongside its Namespace in config/namespaces/machinepool-lease/.

Behavioral change for downstream consumers: users who previously ran
`kustomize build config/controller/default` for a complete deploy must
migrate to `config/controller/standalone`; same for
`config/apiserver/default` -> `config/apiserver/standalone`. The combined
config/default and config/etcdless paths produce output that is
byte-identical to main for every non-lease document, plus the new
ironcore-machinepool-lease Namespace and its RBAC.

The Makefile install/uninstall/deploy/undeploy targets are retargeted
at the standalone variants accordingly. hack/validate-kustomize.sh is
also made portable (GNU realpath --relative-to is unavailable on macOS).

Signed-off-by: Felix Riegger <felix.riegger@sap.com>
@friegger friegger force-pushed the enh/1472-pool-health branch from 85560b8 to fbec8f1 Compare May 22, 2026 12:56
@friegger
Copy link
Copy Markdown
Contributor Author

Superseeded by #1476.

@friegger friegger closed this May 22, 2026
@github-project-automation github-project-automation Bot moved this to Done in Roadmap May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/iaas Issues related to IronCore IaaS development. enhancement New feature or request size/XL

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants