KEDA Auto-Scaler for inference workloads
- Feature status: Alpha
Overview
This document outlines the steps to enable intelligent autoscaling based on the service monitoring metrics for KAITO inference workloads by utilizing the following components and features:
- KEDA
- Kubernetes-based Event Driven Autoscaling component
- keda-kaito-scaler
- A dedicated KEDA external scaler, eliminating the need for external dependencies such as Prometheus.
- KAITO
InferenceSetCRD and Controller- This new CRD and Controller were built on top of the KAITO workspace for intelligent autoscaling, introduced as an alpha feature in KAITO version
v0.8.0
- This new CRD and Controller were built on top of the KAITO workspace for intelligent autoscaling, introduced as an alpha feature in KAITO version
Architecture

Prerequisites
- install KEDA
The following example demonstrates how to install KEDA using Helm chart. For instructions on installing KEDA through other methods, please refer to the guide here.
export KEDA_NAMESPACE=kaito-workspace
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace $KEDA_NAMESPACE --create-namespace
- install keda-kaito-scaler
tip
Ensure that keda-kaito-scaler is installed within the same namespace as KEDA.
helm repo add keda-kaito-scaler https://kaito-project.github.io/keda-kaito-scaler/charts/kaito-project
helm upgrade --install keda-kaito-scaler -n $KEDA_NAMESPACE keda-kaito-scaler/keda-kaito-scaler --create-namespace
Enable this feature
This feature is available starting from KAITO v0.8.0, and the InferenceSet Controller must be enabled during the KAITO installation.
export CLUSTER_NAME=kaito
helm repo add kaito https://kaito-project.github.io/kaito/charts/kaito
helm repo update
helm upgrade --install kaito-workspace kaito/workspace \
--namespace kaito-workspace \
--create-namespace \
--set clusterName="$CLUSTER_NAME" \
--set featureGates.enableInferenceSetController=true \
--wait
Quickstart
Create a Kaito InferenceSet for running inference workloads
- The following example creates an InferenceSet for the phi-4-mini model, using annotations with the prefix
scaledobject.kaito.sh/to supply parameter inputs for the KEDA Kaito Scaler:scaledobject.kaito.sh/auto-provision- required, specifies whether KEDA Kaito Scaler will automatically provision a ScaledObject based on the
InferenceSetobject
- required, specifies whether KEDA Kaito Scaler will automatically provision a ScaledObject based on the
scaledobject.kaito.sh/metricName- optional, specifies the metric name collected from the vLLM pod, which is used for monitoring and triggering the scaling operation, default is
vllm:num_requests_waiting
- optional, specifies the metric name collected from the vLLM pod, which is used for monitoring and triggering the scaling operation, default is
scaledobject.kaito.sh/threshold- required, specifies the threshold for the monitored metric that triggers the scaling operation
cat <<EOF | kubectl apply -f -
apiVersion: kaito.sh/v1alpha1
kind: InferenceSet
metadata:
annotations:
scaledobject.kaito.sh/auto-provision: "true"
scaledobject.kaito.sh/metricName: "vllm:num_requests_waiting"
scaledobject.kaito.sh/threshold: "10"
name: phi-4-mini
namespace: default
spec:
labelSelector:
matchLabels:
apps: phi-4-mini
replicas: 1
nodeCountLimit: 5
template:
inference:
preset:
accessMode: public
name: phi-4-mini-instruct
resource:
instanceType: Standard_NC24ads_A100_v4
EOF
- In just a few seconds, the KEDA Kaito Scaler will automatically create the
scaledobjectandhpaobjects. After a few minutes, once the inference pod is running, the KEDA Kaito Scaler will begin scraping metric values from the inference pod, and the status of thescaledobjectandhpaobjects will be marked as ready.
# kubectl get scaledobject
NAME SCALETARGETKIND SCALETARGETNAME MIN MAX READY ACTIVE FALLBACK PAUSED TRIGGERS AUTHENTICATIONS AGE
phi-4-mini kaito.sh/v1alpha1.InferenceSet phi-4-mini 1 5 True True False False external keda-kaito-scaler-creds 10m
# kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
keda-hpa-phi-4-mini InferenceSet/phi-4-mini 0/10 (avg) 1 5 1 11m
That's it! Your KAITO workloads will now automatically scale based on the number of waiting inference requests(vllm:num_requests_waiting).
- in below example, when
vllm:num_requests_waitingexceeds the threshold (10s) for more than 60 seconds, KEDA will scale up a newInferenceSet/phi-4-minireplica.
Every 2.0s: kubectl describe hpa
Name: keda-hpa-phi-4-mini
Namespace: default
Labels: app.kubernetes.io/managed-by=keda-operator
app.kubernetes.io/name=keda-hpa-phi-4-mini
app.kubernetes.io/part-of=phi-4-mini
app.kubernetes.io/version=2.18.1
scaledobject.keda.sh/name=phi-4-mini
Annotations: scaledobject.kaito.sh/managed-by: keda-kaito-scaler
CreationTimestamp: Tue, 09 Dec 2025 03:35:09 +0000
Reference: InferenceSet/phi-4-mini
Metrics: ( current / target )
"s0-vllm:num_requests_waiting" (target average value): 58 / 10
Min replicas: 1
Max replicas: 5
Behavior:
Scale Up:
Stabilization Window: 60 seconds
Select Policy: Max
Policies:
- Type: Pods Value: 1 Period: 300 seconds
Scale Down:
Stabilization Window: 300 seconds
Select Policy: Max
Policies:
- Type: Pods Value: 1 Period: 600 seconds
InferenceSet pods: 2 current / 2 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from external metric s0-vllm:num_requests_waiting(&Lab
elSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: phi-4-mini,},MatchExpressions:[]LabelSelectorRequirement{},})
ScalingLimited True ScaleUpLimit the desired replica count is increasing faster than the maximum scale rate
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulRescale 33s horizontal-pod-autoscaler New size: 2; reason: external metric s0-vllm:num_requests_waiting(&LabelSelector{MatchLabels:ma
p[string]string{scaledobject.keda.sh/name: phi-4-mini,},MatchExpressions:[]LabelSelectorRequirement{},}) above target