Skip to main content
Version: Next

Presets

What's NEW!

Best-effort Hugging Face vLLM model support

Starting from KAITO v0.9.0, generic Hugging Face models are supported on a best-effort basis. By specifying a Hugging Face model card ID as inference.preset.name in the KAITO workspace or InferenceSet configuration, you can run any Hugging Face model with a model architecture supported by vLLM on KAITO. In this process, KAITO retrieves the model metadata from the Hugging Face website and generates model preset configurations by analyzing this data. During the creation of vLLM inference workloads, KAITO downloads the model weights directly from the Hugging Face site. Below is an example illustrating how to create a Hugging Face inference workload using the model card ID Qwen/Qwen3-0.6B from https://huggingface.co/Qwen/Qwen3-0.6B:

tip

For certain Hugging Face models that require authentication, configure inference.preset.presetOptions.modelAccessSecret to reference a Secret containing a Hugging Face access token under the HF_TOKEN key.

apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
name: qwen3-06b
resource:
instanceType: Standard_NC24ads_A100_v4
labelSelector:
matchLabels:
apps: qwen3-06b
inference:
preset:
name: Qwen/Qwen3-0.6B

:::

The current supported built-in model families with preset configurations are listed below.

Model FamilyCompatible KAITO Versions
deepseekv0.6.0+
falconv0.0.1+
gemma-3v0.8.0+
gpt-ossv0.7.0+
llamav0.4.6+
mistralv0.2.0+
phi-3v0.3.0+
phi-4v0.4.5+
qwenv0.4.1+

Validation

Each preset built-in model has its own hardware requirements in terms of GPU count and GPU memory defined in the respective model.go file. KAITO controller performs a validation check of whether the specified SKU and node count are sufficient to run the model or not. In case the provided SKU is not in the known list, the controller bypasses the validation check which means users need to ensure the model can run with the provided SKU.

Distributed inference

For models that support distributed inference, when the node count is larger than one, Torch Distributed Elastic is configured with master/worker pods running in multiple nodes and the service endpoint is the master pod.

The following preset models support multi-node distributed inference:

Model FamilyModelsMulti-Node Support
deepseekdeepseek-r1, deepseek-v3
llamallama-3.3-70b-instruct

For detailed information on configuring and using multi-node inference, see the Multi-Node Inference documentation.