Skip to main content

Providers

Engine & Provider Selection

When spec.engine.type is omitted, the controller auto-selects the engine from provider capabilities. When spec.provider.name is omitted, the controller auto-selects a provider using CEL-based selection rules from InferenceProviderConfig resources. Each provider declares rules with priorities; the highest-priority match wins.

Engine Auto-Selection

The controller selects the engine by scanning per-engine capabilities from all ready InferenceProviderConfig resources:

  1. Filter engines by per-engine compatibility with the deployment:
    • GPU/CPU: each engine declares gpuSupport and cpuSupport independently
    • Serving mode: each engine declares its supported servingModes
  2. Rank available engines by preference: vllm > sglang > trtllm > llamacpp
  3. Pick the first available engine by preference

The selected engine is stored in status.engine.type with a reason in status.engine.selectedReason.

Provider Auto-Selection

With the engine resolved, provider selection evaluates CEL rules from each InferenceProviderConfig:

Default selection behavior depends on the InferenceProviderConfig resources installed in the cluster. With the provider configs bundled in this repository, the shipped rules are:

IF gpu.count == 0 OR resources.gpu is omitted:
→ KAITO (CPU-capable provider), engine auto-selected to llamacpp when needed

IF engine == "trtllm" OR engine == "sglang":
→ Dynamo

IF engine == "llamacpp":
→ KAITO

IF mode == "disaggregated":
→ Dynamo

IF gpu.count > 1 AND engine == "vllm":
→ KubeRay

IF gpu.count > 0 AND engine == "vllm":
→ Dynamo

Note: Provider auto-selection is driven by registered InferenceProviderConfig.selectionRules; the core selector does not hard-code KubeRay, llm-d, or Direct vLLM. Providers with empty or no matching rules are explicit-only unless their installed config makes them selectable.

The selection reason is recorded in status.provider.selectedReason for observability.

Provider Capability Matrix

CriteriaKAITODynamoKubeRayllm-dDirect vLLM
CPU inferenceYesNoNoNoNo
GPU inferenceYesYesYesYesYes
vLLM engineYesYesYesYesYes
sglang engineNoYesNoNoNo
trtllm engineNoYesNoNoNo
llamacpp engineYesNoNoNoNo
Disaggregated P/DNoYesYesYesNo
Self-managed InferencePoolNoYesNoNoNo
Self-managed EPPNoYesNoNoNo
Customizable EPP image/configNoNoNoYesNo
Auto-selectionYesYesVia selection rulesExplicit/config rules onlyExplicit only

Provider Abstraction

AI Runway supports two deployment methods, both using the provider abstraction pattern:

Users create ModelDeployment CRs, and the controller + provider controllers handle the rest:

  • Automatic provider selection based on capabilities
  • Unified status reporting
  • Provider-agnostic lifecycle management

Web UI Deployment

The Web UI backend reads provider information (capabilities, installation steps, Helm charts) from InferenceProviderConfig CRDs in the cluster. These CRDs are created by provider shims — each provider shim must be installed (e.g., kubectl apply -f providers/kaito/deploy/kaito.yaml) before its provider appears in the UI. Once visible, the UI can trigger Helm-based upstream provider installation and creates ModelDeployment CRs for model deployment, which are then handled by the controller and provider controllers.

Supported Providers

ProviderUpstream CRDStatusShim YAMLDescription
NVIDIA DynamoDynamoGraphDeployment✅ Availabledynamo.yamlHigh-performance GPU inference with KV-cache routing and disaggregated serving
KubeRayRayService✅ Availablekuberay.yamlRay-based distributed inference with autoscaling
KAITOWorkspace✅ Availablekaito.yamlFlexible inference with vLLM (GPU) or llama.cpp (CPU/GPU)
llm-dnone✅ Availablellmd.yamlFlexible inference with vLLM (GPU) with KV-cache routing and disaggregated serving
Direct vLLMDeployment✅ Availablevllm.yamlDirect vLLM OpenAI-compatible server deployments using spec.engine.image; see Direct vLLM guide

KAITO Provider

The KAITO provider enables flexible inference with multiple backends:

  • vLLM Mode: GPU inference using vLLM engine with full HuggingFace model support
  • Pre-made GGUF: Ready-to-deploy quantized models from ghcr.io/kaito-project/aikit/*
  • HuggingFace GGUF: Run any GGUF model from HuggingFace directly (no build required)
  • CPU/GPU Flexibility: llama.cpp models can run on CPU nodes (no GPU required) or GPU nodes
ModeEngineComputeUse Case
vLLMvLLMGPUHigh-performance GPU inference
Pre-made GGUFllama.cppCPU/GPUReady-to-deploy quantized models
HuggingFace GGUFllama.cppCPU/GPURun any HuggingFace GGUF model

Build Infrastructure

For HuggingFace GGUF models, KAITO uses in-cluster image building:

┌────────────────┐ ┌──────────────┐ ┌─────────────────┐
│ HuggingFace │────▶│ BuildKit │────▶│ In-Cluster │
│ GGUF Model │ │ (K8s Driver)│ │ Registry │
└────────────────┘ └──────────────┘ └─────────────────┘


┌─────────────────┐
│ KAITO Pod │
│ (llama.cpp) │
└─────────────────┘
  • RegistryService (backend/src/services/registry.ts): Manages in-cluster registry
  • BuildKitService (backend/src/services/buildkit.ts): Manages BuildKit builder
  • AikitService (backend/src/services/aikit.ts): Handles GGUF image building

See also