Skip to main content

Azure Cluster Autoscaling for AIRunway

This guide explains how to enable cluster autoscaling for GPU workloads in Azure Kubernetes Service (AKS), allowing your cluster to automatically provision GPU nodes when AI Runway deployments require more resources than currently available.

Overview

AI Runway integrates with Kubernetes cluster autoscaling to provide visibility and guidance when deploying models that exceed available GPU capacity.

Prerequisites

  • Azure CLI (az) installed and authenticated
  • kubectl configured for your cluster
  • Appropriate Azure RBAC permissions (Contributor or higher on cluster/resource group)

Enable Autoscaling on AKS

AKS provides a managed cluster autoscaler that integrates directly with Azure infrastructure.

Enable Autoscaling on Existing Node Pool

If you already have a GPU node pool, enable autoscaling with:

# Replace with your actual values
RESOURCE_GROUP="my-resource-group"
CLUSTER_NAME="my-aks-cluster"
NODE_POOL_NAME="gpu"

az aks nodepool update \
--resource-group $RESOURCE_GROUP \
--cluster-name $CLUSTER_NAME \
--name $NODE_POOL_NAME \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 10

Create New GPU Node Pool with Autoscaling

To create a new GPU node pool with autoscaling enabled:

az aks nodepool add \
--resource-group $RESOURCE_GROUP \
--cluster-name $CLUSTER_NAME \
--name gpunodepool \
--node-count 1 \
--min-count 1 \
--max-count 10 \
--node-vm-size Standard_NC24ads_A100_v4 \
--enable-cluster-autoscaler

Common GPU VM Sizes:

VM SizeGPUsGPU TypevCPUsRAM
Standard_NC24ads_A100_v41x A10080GB24220 GB
Standard_NC48ads_A100_v42x A100160GB48440 GB
Standard_NC96ads_A100_v44x A100320GB96880 GB
Standard_NC40ads_H100_v51x H10080GB40320 GB
Standard_NC80adis_H100_v52x H100160GB80640 GB
Standard_ND96isr_H100_v58x H100640GB961900 GB

Update Autoscaler Settings

Adjust min/max node counts:

az aks nodepool update \
--resource-group $RESOURCE_GROUP \
--cluster-name $CLUSTER_NAME \
--name $NODE_POOL_NAME \
--update-cluster-autoscaler \
--min-count 0 \
--max-count 20

Note: Setting --min-count 0 allows scaling down to zero nodes when idle, reducing costs. However, scale-up from zero takes longer (typically 5-10 minutes).

Disable Autoscaling

To disable autoscaling and maintain a fixed node count:

az aks nodepool update \
--resource-group $RESOURCE_GROUP \
--cluster-name $CLUSTER_NAME \
--name $NODE_POOL_NAME \
--disable-cluster-autoscaler

Verification

Check Autoscaler Detection in AIRunway

  1. Navigate to Settings page in AIRunway
  2. Look for Cluster Autoscaling section
  3. Expected status: Cluster Autoscaler running on X node group(s)

Verify via CLI

# Check if autoscaler is enabled on node pool
az aks nodepool show \
--resource-group $RESOURCE_GROUP \
--cluster-name $CLUSTER_NAME \
--name $NODE_POOL_NAME \
--query '{autoscaling: enableAutoScaling, min: minCount, max: maxCount}'

Check Autoscaler Status ConfigMap

AI Runway detects the autoscaler using AKS-specific node labels (cluster-autoscaler.kubernetes.io/enabled) first, then falls back to checking the cluster-autoscaler-status ConfigMap:

kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

Troubleshooting

Issue: AI Runway Shows "Not Detected"

Check if autoscaling is enabled:

az aks nodepool show \
--resource-group $RESOURCE_GROUP \
--cluster-name $CLUSTER_NAME \
--name $NODE_POOL_NAME \
--query enableAutoScaling

If false, enable it:

az aks nodepool update \
--resource-group $RESOURCE_GROUP \
--cluster-name $CLUSTER_NAME \
--name $NODE_POOL_NAME \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 10

Issue: Pods Stay Pending, No Scale-Up

  1. Check pod scheduling failure reason:

    kubectl describe pod <pod-name> -n <namespace>

    Look for events like:

    • Insufficient nvidia.com/gpu → Autoscaler should help
    • node(s) didn't match Pod's node affinity → Configuration issue
  2. Check node pool max capacity:

    az aks nodepool show \
    --resource-group $RESOURCE_GROUP \
    --cluster-name $CLUSTER_NAME \
    --name $NODE_POOL_NAME \
    --query '{current: count, max: maxCount}'

    If at max, increase:

    az aks nodepool update ... --max-count 20
  3. Check Azure quota:

    az vm list-usage --location eastus --query "[?contains(name.value, 'NC')]" -o table

    Request increase if needed: https://aka.ms/azure-quota

Issue: Slow Scale-Up (>10 minutes)

GPU node scale-up typically takes 5-10 minutes. If longer:


Cost Optimization

Scale to Zero

Allow GPU nodes to scale to zero when idle:

az aks nodepool update \
--resource-group $RESOURCE_GROUP \
--cluster-name $CLUSTER_NAME \
--name $NODE_POOL_NAME \
--update-cluster-autoscaler \
--min-count 0 \
--max-count 10

Trade-offs:

  • ✅ Maximum cost savings
  • ❌ First deployment takes 5-10 minutes to provision

Reference