Version: Next

Quick Start

After installing KAITO, you can quickly deploy a phi-4-mini-instruct inference service to get started.

Prerequisites

A Kubernetes cluster with KAITO installed (see Installation)
kubectl configured to access your cluster
GPU nodes available in your cluster - You have two mutually exclusive options:
- Bring your own GPU (BYO) nodes: If you already have GPU nodes in your cluster, you can use the preferred nodes approach (shown below). Requires Node Auto Provisioning to be disabled.
- Auto-provisioning: Set up automatic GPU node provisioning for your cloud provider (see these steps). Cannot be used with BYO nodes.

Deploy Your First Model

Your model can be deployed either on your own GPU nodes, or KAITO can auto-provision nodes for it to run on. These options are mutually exclusive - you must choose one approach.

Important

BYO nodes and Auto-provisioning are mutually exclusive. If you're using your own GPU nodes, ensure KAITO was installed with --set featureGates.disableNodeAutoProvisioning=true. If you're using auto-provisioning, do not use this flag.

Option 1: Bring your own GPU nodes

Prerequisites for BYO Nodes

Before using this option, ensure that:

KAITO was installed with Node Auto Provisioning disabled: --set featureGates.disableNodeAutoProvisioning=true
You have existing GPU nodes in your cluster
Your GPU nodes are properly labeled for workload selection

Let's start by deploying a phi-4-mini-instruct model on your existing GPU nodes.

First get the nodes.

kubectl get nodes -l accelerator=nvidia

The output should look similar to this, showing all your GPU nodes.

NAME                                  STATUS   ROLES    AGE     VERSION
gpunp-26695285-vmss000000             Ready    <none>   2d21h   v1.31.9
gpunp-26695285-vmss000001             Ready    <none>   2d21h   v1.31.9

The GPU nodes will need a label in order for a KAITO Workspace to select it. We'll use the label apps=llm-inference for this example. Label the nodes you want to use.

kubectl label node gpunp-26695285-vmss000001 apps=llm-inference
kubectl label node gpunp-26695285-vmss000000 apps=llm-inference

Create a YAML file named phi-4-workspace.yaml with the following content. The label selector here must match the labels you set on the nodes.

phi-4-workspace.yaml
apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
  name: workspace-phi-4-mini
resource:
  labelSelector:
    matchLabels:
      apps: llm-inference
inference:
  preset:
    name: phi-4-mini-instruct

note

Make sure that resource.instanceType is empty when using BYO as it's only used for the auto-provisioning scenario.

Apply your configuration to your cluster:

kubectl apply -f phi-4-workspace.yaml

The nodes will be populated with their specs using NVIDIA GPU Feature Discovery, and KAITO will verify that your nodes have the proper memory, # GPUS, etc. to run the model.

Option 2: Auto-provision GPU nodes

The following cloud providers support auto-provisioning GPU nodes in addition to BYO nodes.

Azure
AWS

info

If you have not already, follow the steps here to install the gpu-provisioner Helm chart.

Ensure that the chart is ready and pods are running with these steps otherwise the GPU nodes will not be provisioned.

Create a YAML file named phi-4-workspace.yaml with the following content. The instanceType field will specify what nodes will be auto-provisioned instead of only matching existing nodes in the BYO case.

phi-4-workspace.yaml
apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
  name: workspace-phi-4-mini
resource:
  instanceType: "Standard_NC6s_v3" # Specifies the node type that will be auto-provisioned.
  labelSelector:
    matchLabels:
      apps: phi-4-mini
inference:
  preset:
    name: phi-4-mini-instruct

info

If you have not already, follow the steps here to install the Karpenter Helm chart.

Ensure that the chart is ready and pods are running, otherwise the GPU nodes will not be provisioned.

phi-4-workspace.yaml
apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
  name: workspace-phi-4-mini
resource:
  instanceType: "g5.4xlarge" # Specifies the node type that will be auto-provisioned.
  labelSelector:
    matchLabels:
      apps: phi-4-mini
inference:
  preset:
    name: phi-4-mini-instruct

Apply your configuration to your cluster:

kubectl apply -f phi-4-workspace.yaml

Monitor Deployment

Track the workspace status to see when the model has been deployed successfully:

kubectl get workspace workspace-phi-4-mini

When the WORKSPACESUCCEEDED column becomes True, the model has been deployed successfully:

NAME                   INSTANCE                   RESOURCEREADY   INFERENCEREADY   JOBSTARTED   WORKSPACESUCCEEDED   AGE
workspace-phi-4-mini   Standard_NC24ads_A100_v4   True            True                          True                 4h15m

note

The INSTANCE column will default to Standard_NC24ads_A100_v4 if you have not set up auto-provisioning. If you have auto-provisioning configured, it will show the specific instance type used.

Test the Model

Find the inference service's cluster IP and test it using a temporary curl pod:

# Get the service endpoint
kubectl get svc workspace-phi-4-mini
export CLUSTERIP=$(kubectl get svc workspace-phi-4-mini -o jsonpath="{.spec.clusterIPs[0]}")

# List available models
kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -s http://$CLUSTERIP/v1/models | jq

You should see output similar to:

{
  "object": "list",
  "data": [
    {
      "id": "phi-4-mini-instruct",
      "object": "model",
      "created": 1733370094,
      "owned_by": "vllm",
      "root": "/workspace/vllm/weights",
      "parent": null,
      "max_model_len": 16384
    }
  ]
}

Make an Inference Call

Now make an inference call using the model:

kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -X POST http://$CLUSTERIP/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi-4-mini-instruct",
    "messages": [{"role": "user", "content": "What is kubernetes?"}],
    "max_tokens": 50,
    "temperature": 0
  }'

Next Steps

🎉 Congratulations! You've successfully deployed and tested your first model with KAITO.

What's Next:

Explore More Models: Check out the full range of supported models in the presets documentation
Advanced Configurations: Learn about workspace configurations
Custom Models: See how to contribute new models

Additional Resources:

Fine-tuning Guide - Customize models for your specific use cases
RAG Examples - Retrieval-Augmented Generation patterns
Troubleshooting Guide - Common issues and solutions

Prerequisites​

Deploy Your First Model​

Option 1: Bring your own GPU nodes​

Option 2: Auto-provision GPU nodes​

Monitor Deployment​

Test the Model​

Make an Inference Call​

Next Steps​