Quick Start
Install KAITO first following the installation guide. After installation, you can quickly deploy a phi-4-mini-instruct inference service to get started.
Deploy Your First Model
Option 1: Auto-provision GPU nodes
The following cloud providers support auto-provisioning GPU nodes.
- Azure
- AWS
If you have not already, follow the steps here to install the gpu-provisioner Helm chart.
Create a YAML file named phi-4-workspace.yaml with the following content. The instanceType field will specify what nodes will be auto-provisioned instead of only matching existing nodes in the BYO case.
apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
name: workspace-phi-4-mini
resource:
instanceType: "Standard_NC6s_v3" # Specifies the node type that will be auto-provisioned.
labelSelector:
matchLabels:
apps: phi-4-mini
inference:
preset:
name: phi-4-mini-instruct
If you have not already, follow the steps here to install the Karpenter Helm chart.
Create a YAML file named phi-4-workspace.yaml with the following content. The instanceType field will specify what nodes will be auto-provisioned instead of only matching existing nodes in the BYO case.
apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
name: workspace-phi-4-mini
resource:
instanceType: "g5.4xlarge" # Specifies the node type that will be auto-provisioned.
labelSelector:
matchLabels:
apps: phi-4-mini
inference:
preset:
name: phi-4-mini-instruct
Apply your configuration to your cluster:
kubectl apply -f phi-4-workspace.yaml
Option 2: Bring your own GPU nodes
Before using this option, ensure that:
- KAITO was installed with Node Auto Provisioning disabled:
--set featureGates.disableNodeAutoProvisioning=true. - You have existing GPU nodes in your cluster with device plugin and GPU drivers installed.
- Your GPU nodes are properly labeled for workload selection.
Assuming the GPU node is added label apps=llm-inference for a KAITO Workspace to select, create a YAML file named phi-4-workspace.yaml with the following content. Make sure that resource.instanceType is empty
when using BYO nodes.
apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
name: workspace-phi-4-mini
resource:
labelSelector:
matchLabels:
apps: llm-inference
inference:
preset:
name: phi-4-mini-instruct
Apply your configuration to your cluster:
kubectl apply -f phi-4-workspace.yaml
Monitor Deployment
Track the workspace status to see when the model has been deployed successfully:
kubectl get workspace workspace-phi-4-mini
When the WORKSPACESUCCEEDED column becomes True, the model has been deployed successfully:
NAME INSTANCE RESOURCEREADY INFERENCEREADY JOBSTARTED WORKSPACESUCCEEDED AGE
workspace-phi-4-mini Standard_NC24ads_A100_v4 True True True 4h15m
The INSTANCE column will be empty if BYO nodes are used. Otherwise, it will show the specific instance type used.
Test the Model
Find the inference service's cluster IP and test it using a temporary curl pod:
# Get the service endpoint
kubectl get svc workspace-phi-4-mini
export CLUSTERIP=$(kubectl get svc workspace-phi-4-mini -o jsonpath="{.spec.clusterIPs[0]}")
# List available models
kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -s http://$CLUSTERIP/v1/models | jq
You should see output similar to:
{
"object": "list",
"data": [
{
"id": "phi-4-mini-instruct",
"object": "model",
"created": 1733370094,
"owned_by": "vllm",
"root": "/workspace/vllm/weights",
"parent": null,
"max_model_len": 16384
}
]
}
Make an Inference Call
Now make an inference call using the model:
kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -X POST http://$CLUSTERIP/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4-mini-instruct",
"messages": [{"role": "user", "content": "What is kubernetes?"}],
"max_tokens": 50,
"temperature": 0
}'
🎉 Congratulations! You've successfully deployed and tested your first model with KAITO.
What's Next
- Supported Models: Check out how KAITO supports models from HuggingFace in presets documentation.