Skip to content
This repository was archived by the owner on Jun 24, 2026. It is now read-only.
Open
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
1cf96f0
feat(crd): add AzureFlexNodeClass v1alpha1 CRD
chokevin Apr 22, 2026
6c4786d
chore(plugin): bump flexNodeVersion to v0.0.18
chokevin Apr 22, 2026
6bdb3a2
feat(plugin): add azure/flexvm agent pool service
chokevin Apr 22, 2026
170a427
feat(karpenter): add azure cross-region cloudprovider
chokevin Apr 22, 2026
43f1b52
feat(karpenter): add azure nodeclass status+termination controllers
chokevin Apr 22, 2026
100ca7f
feat(karpenter): wire azure cloudprovider into controller main
chokevin Apr 22, 2026
e026782
docs(karpenter): add azure example NodeClass+NodePool
chokevin Apr 22, 2026
c4b32b6
fix(azure): address P0/P1 review findings
Apr 22, 2026
6e7b5b9
fix(charts): grant rbac for azureflexnodeclasses
Apr 22, 2026
d17174b
chore(karpenter): go mod tidy
Apr 22, 2026
a4d892b
feat(catalog): add Standard_ND96isr_H100_v5 (8x H100 SKU)
Apr 22, 2026
0f6173f
userdata: regenerate containerd v2 config before aks-flex-node apply
Apr 22, 2026
244839e
userdata: write v3-schema containerd config AFTER aks-flex-node apply
Apr 22, 2026
16e0483
userdata: fix conf.d/99-nvidia.toml bin_dir override
Apr 22, 2026
9a8fdf5
flexvm: garbage-collect orphan NICs after failed VM creation
Apr 23, 2026
0a0bfb2
address copilot review comments
chokevin Apr 23, 2026
e68df85
fix(azure-flex): handle mismatched agentpool types in GC paths
chokevin Apr 24, 2026
460b21b
chore(karpenter): tidy protobuf module classification
chokevin Apr 24, 2026
3ec4f79
fix(karpenter): harden azure h200 provisioning
chokevin May 16, 2026
85ca16f
chore(karpenter): drop obsolete provider patches
chokevin May 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.20.1
name: azureflexnodeclasses.flex.aks.azure.com
spec:
group: flex.aks.azure.com
names:
categories:
- karpenter
- nap
kind: AzureFlexNodeClass
listKind: AzureFlexNodeClassList
plural: azureflexnodeclasses
shortNames:
- afnc
- afncs
singular: azureflexnodeclass
scope: Cluster
versions:
- additionalPrinterColumns:
- jsonPath: .status.conditions[?(@.type=='Ready')].status
name: Ready
type: string
- jsonPath: .metadata.creationTimestamp
name: Age
type: date
name: v1alpha1
schema:
openAPIV3Schema:
description: |-
AzureFlexNodeClass is the Schema for the AzureFlexNodeClass API.

It enables a NodePool in an AKS cluster to auto-provision external Azure VMs in a
(potentially different) Azure region than the AKS cluster's own region. Each node
is a single VM (not VMSS) so that cross-region placement is straightforward.
properties:
apiVersion:
description: |-
APIVersion defines the versioned schema of this representation of an object.
Servers should convert recognized schemas to the latest internal value, and
may reject unrecognized values.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
type: string
kind:
description: |-
Kind is a string value representing the REST resource this object represents.
Servers may infer this from the endpoint the client submits requests to.
Cannot be updated.
In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
type: string
metadata:
type: object
spec:
description: |-
AzureFlexNodeClassSpec is the spec for AzureFlexNodeClass.

Phase 1 scope (issue #63): single region per NodeClass, no spot, no zones,
no identity/UAMI per-NodeClass (the controller MI is assumed to have
Contributor on the target subscription/RG/subnet), no quota preflight,
no PPG/capacity reservation, no spot, no WireGuard.
properties:
allocateNodePublicIP:
default: false
description: AllocateNodePublicIP controls whether each node receives
a public IP.
type: boolean
imageID:
description: ImageID is a SIG / community gallery image resource ID.
Mutually exclusive with ImageReference.
type: string
imageReference:
description: |-
ImageReference selects an Azure Marketplace image. Mutually exclusive with ImageID.
If neither is set, defaults to microsoft-dsvm/ubuntu-hpc/2204/latest.
properties:
offer:
type: string
publisher:
type: string
sku:
type: string
version:
default: latest
type: string
required:
- offer
- publisher
- sku
type: object
location:
description: Location is the Azure region (e.g. "eastus2"). May differ
from the AKS cluster region.
type: string
maxPodsPerNode:
default: 110
description: MaxPodsPerNode is advertised in the node's capacity and
affects Karpenter scheduling.
format: int32
type: integer
osDiskSizeGB:
default: 128
description: OSDiskSizeGB is the size of the OS disk in GB.
format: int32
type: integer
resourceGroup:
description: |-
ResourceGroup is the resource group where VMs, NICs, and OS disks land.
Must already exist.
type: string
securityType:
default: Standard
description: |-
SecurityType selects the VM security profile. Currently only "Standard" is supported.
TrustedLaunch is deferred — it has been observed to break the DSVM image.
enum:
- Standard
type: string
sshPublicKeys:
description: SSHPublicKeys is the list of SSH public keys to install
on each node.
items:
type: string
type: array
subnetID:
description: |-
SubnetID is the full ARM resource ID of the subnet (must already exist
and be reachable from the AKS cluster).
type: string
subscriptionID:
description: SubscriptionID is the Azure subscription where VMs will
be created.
type: string
tags:
additionalProperties:
type: string
description: Tags are applied to every Azure resource (VM, NIC, OS
disk) created from this NodeClass.
type: object
required:
- location
- resourceGroup
- subnetID
- subscriptionID
type: object
status:
description: status contains the resolved state of the AzureFlexNodeClass.
properties:
conditions:
description: conditions contains signals for health and readiness
items:
description: Condition aliases the upstream type and adds additional
helper methods
properties:
lastTransitionTime:
description: |-
lastTransitionTime is the last time the condition transitioned from one status to another.
This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.
format: date-time
type: string
message:
description: |-
message is a human readable message indicating details about the transition.
This may be an empty string.
maxLength: 32768
type: string
observedGeneration:
description: |-
observedGeneration represents the .metadata.generation that the condition was set based upon.
For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date
with respect to the current state of the instance.
format: int64
minimum: 0
type: integer
reason:
description: |-
reason contains a programmatic identifier indicating the reason for the condition's last transition.
Producers of specific condition types may define expected values and meanings for this field,
and whether the values are considered a guaranteed API.
The value should be a CamelCase string.
This field may not be empty.
maxLength: 1024
minLength: 1
pattern: ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$
type: string
status:
description: status of the condition, one of True, False, Unknown.
enum:
- "True"
- "False"
- Unknown
type: string
type:
description: type of condition in CamelCase or in foo.example.com/CamelCase.
maxLength: 316
pattern: ^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$
type: string
required:
- lastTransitionTime
- message
- reason
- status
- type
type: object
type: array
type: object
type: object
served: true
storage: true
subresources:
status: {}
4 changes: 2 additions & 2 deletions karpenter/charts/karpenter/templates/clusterrole.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ rules:
resources: ["aksnodeclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: ["flex.aks.azure.com"]
resources: ["nebiusnodeclasses"]
resources: ["nebiusnodeclasses", "azureflexnodeclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: ["kaito.sh"]
resources: ["kaitonodeclasses"]
Expand All @@ -43,7 +43,7 @@ rules:
resources: ["aksnodeclasses", "aksnodeclasses/status"]
verbs: ["patch", "update"]
- apiGroups: ["flex.aks.azure.com"]
resources: ["nebiusnodeclasses", "nebiusnodeclasses/status"]
resources: ["nebiusnodeclasses", "nebiusnodeclasses/status", "azureflexnodeclasses", "azureflexnodeclasses/status"]
verbs: ["patch", "update"]
- apiGroups: ["kaito.sh"]
resources: ["kaitonodeclasses", "kaitonodeclasses/status"]
Expand Down
13 changes: 13 additions & 0 deletions karpenter/cmd/controller/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ import (
kaitov1alpha1 "github.com/Azure/aks-flex/karpenter/pkg/apis/kaito/v1alpha1"
"github.com/Azure/aks-flex/karpenter/pkg/apis/v1alpha1"
flexcloudproviders "github.com/Azure/aks-flex/karpenter/pkg/cloudproviders"
azureflex "github.com/Azure/aks-flex/karpenter/pkg/cloudproviders/azure"
"github.com/Azure/aks-flex/karpenter/pkg/cloudproviders/kaito"
"github.com/Azure/aks-flex/karpenter/pkg/cloudproviders/nebius"
flexcontrollers "github.com/Azure/aks-flex/karpenter/pkg/controllers"
Expand All @@ -47,6 +48,7 @@ func main() {
operator.WaitForCRDs(
ctx, 2*time.Minute, ctrl.GetConfigOrDie(), logger,
&v1alpha1.NebiusNodeClass{},
&v1alpha1.AzureFlexNodeClass{},
&kaitov1alpha1.KaitoNodeClass{},
),
"failed waiting for CRDs",
Expand Down Expand Up @@ -119,6 +121,17 @@ func main() {
lo.Must0(err, "registering kaito cloud provider")
}

// azure-flex (cross-region single-VM Azure cloud provider)
{
err := azureflex.Register(
ctx,
hubCloudProvider,
op.GetClient(),
clusterCA,
)
lo.Must0(err, "registering azure-flex cloud provider")
}

overlayUndecoratedCloudProvider := metrics.Decorate(hubCloudProvider)
cloudProvider := overlay.Decorate(overlayUndecoratedCloudProvider, op.GetClient(), op.InstanceTypeStore)
clusterState := state.NewCluster(op.Clock, op.GetClient(), cloudProvider)
Expand Down
21 changes: 21 additions & 0 deletions karpenter/examples/azure/azureflexnodeclass-h200-eastus2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
apiVersion: flex.aks.azure.com/v1alpha1
kind: AzureFlexNodeClass
metadata:
name: h200-eastus2
spec:
subscriptionID: 00000000-0000-0000-0000-000000000000
location: eastus2
resourceGroup: my-flex-rg
subnetID: /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/my-flex-rg/providers/Microsoft.Network/virtualNetworks/flex-vnet/subnets/nodes
imageReference:
publisher: microsoft-dsvm
offer: ubuntu-hpc
sku: "2204"
version: latest
securityType: Standard
osDiskSizeGB: 256
allocateNodePublicIP: false
maxPodsPerNode: 110
tags:
purpose: karpenter-flex-h200
managed-by: aks-flex-karpenter
29 changes: 29 additions & 0 deletions karpenter/examples/azure/nodepool-h200.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: h200
spec:
template:
spec:
nodeClassRef:
group: flex.aks.azure.com
kind: AzureFlexNodeClass
name: h200-eastus2
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- Standard_ND96isr_H200_v5
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
limits:
nvidia.com/gpu: 64
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 30s
2 changes: 1 addition & 1 deletion karpenter/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ go 1.26.0

require (
github.com/Azure/aks-flex/plugin v0.0.0-00010101000000-000000000000
github.com/Azure/azure-sdk-for-go/sdk/azcore v1.21.0
github.com/Azure/karpenter-provider-azure v1.7.1
github.com/awslabs/operatorpkg v0.0.0-20250909182303-e8e550b6f339
github.com/go-logr/logr v1.4.3
Expand All @@ -27,7 +28,6 @@ require (
github.com/Azure/azure-kusto-go v0.16.1 // indirect
github.com/Azure/azure-sdk-for-go v68.0.0+incompatible // indirect
github.com/Azure/azure-sdk-for-go-extensions v0.5.1 // indirect
github.com/Azure/azure-sdk-for-go/sdk/azcore v1.21.0 // indirect
github.com/Azure/azure-sdk-for-go/sdk/azidentity v1.13.1 // indirect
github.com/Azure/azure-sdk-for-go/sdk/internal v1.11.2 // indirect
github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/authorization/armauthorization/v2 v2.2.0 // indirect
Expand Down
Loading
Loading