Introduction
Coming soon: KAITO v0.5.0. Retrieval-augmented generation (RAG) - RagEngine support with LlamaIndex orchestration and Faiss as the default vectorDB, learn about recent updates here!
Latest Release: May 14th, 2025. KAITO v0.4.6.
First Release: Nov 15th, 2023. KAITO v0.1.0.
KAITO is an operator that automates the AI/ML model inference or tuning workload in a Kubernetes cluster. The target models are popular open-sourced large models such as falcon and phi-3.
Key Features
KAITO has the following key differentiations compared to most of the mainstream model deployment methodologies built on top of virtual machine infrastructures:
- Container-based Model Management: Manage large model files using container images with an OpenAI-compatible server for inference calls
- Preset Configurations: Avoid adjusting workload parameters based on GPU hardware with built-in configurations
- Multiple Runtime Support: Support for popular inference runtimes including vLLM and transformers
- Auto-provisioning: Automatically provision GPU nodes based on model requirements
- Public Registry: Host large model images in the public Microsoft Container Registry (MCR) when licenses allow
Using KAITO, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
Architecture
KAITO follows the classic Kubernetes Custom Resource Definition(CRD)/controller design pattern. Users manage a workspace
custom resource which describes the GPU requirements and the inference or tuning specification. KAITO controllers automate the deployment by reconciling the workspace
custom resource.
The above figure presents the KAITO architecture overview. Its major components consist of:
- Workspace controller: Reconciles the
workspace
custom resource, createsmachine
custom resources to trigger node auto provisioning, and creates the inference or tuning workload (deployment
,statefulset
orjob
) based on the model preset configurations. - Node provisioner controller: The controller's name is gpu-provisioner in gpu-provisioner helm chart. It uses the
machine
CRD originated from Karpenter to interact with the workspace controller. It integrates with Azure Resource Manager REST APIs to add new GPU nodes to the AKS or AKS Arc cluster.
The gpu-provisioner is an open sourced component. It can be replaced by other controllers if they support Karpenter-core APIs.
Getting Started
👉 To get started, please see the Installation Guide!
👉 For a quick start tutorial, check out Quick Start!
Community
- GitHub: kaito-project/kaito
- Slack: Join our community
- Email: kaito-dev@microsoft.com