Providers

Engine & Provider Selection

When spec.engine.type is omitted, the controller auto-selects the engine from provider capabilities. When spec.provider.name is omitted, the controller auto-selects a provider using CEL-based selection rules from InferenceProviderConfig resources. Each provider declares rules with priorities; the highest-priority match wins.

Engine Auto-Selection

The controller selects the engine by scanning per-engine capabilities from all ready InferenceProviderConfig resources:

Filter engines by per-engine compatibility with the deployment:
- GPU/CPU: each engine declares gpuSupport and cpuSupport independently
- Serving mode: each engine declares its supported servingModes
Rank available engines by preference: vllm > sglang > trtllm > llamacpp
Pick the first available engine by preference

The selected engine is stored in status.engine.type with a reason in status.engine.selectedReason.

Provider Auto-Selection

With the engine resolved, provider selection evaluates CEL rules from each InferenceProviderConfig:

Default selection behavior depends on the InferenceProviderConfig resources installed in the cluster. With the provider configs bundled in this repository, the shipped rules are:

IF gpu.count == 0 OR resources.gpu is omitted:
    → KAITO (CPU-capable provider), engine auto-selected to llamacpp when needed

IF engine == "trtllm" OR engine == "sglang":
    → Dynamo

IF engine == "llamacpp":
    → KAITO

IF mode == "disaggregated":
    → Dynamo

IF gpu.count > 1 AND engine == "vllm":
    → KubeRay

IF gpu.count > 0 AND engine == "vllm":
    → Dynamo

Note: Provider auto-selection is driven by registered InferenceProviderConfig.selectionRules; the core selector does not hard-code KubeRay, llm-d, or Direct vLLM. Providers with empty or no matching rules are explicit-only unless their installed config makes them selectable.

The selection reason is recorded in status.provider.selectedReason for observability.

Provider Capability Matrix

Criteria	KAITO	Dynamo	KubeRay	llm-d	Direct vLLM
CPU inference	Yes	No	No	No	No
GPU inference	Yes	Yes	Yes	Yes	Yes
vLLM engine	Yes	Yes	Yes	Yes	Yes
sglang engine	No	Yes	No	No	No
trtllm engine	No	Yes	No	No	No
llamacpp engine	Yes	No	No	No	No
Disaggregated P/D	No	Yes	Yes	Yes	No
Self-managed InferencePool	No	Yes	No	No	No
Self-managed EPP	No	Yes	No	No	No
Customizable EPP image/config	No	No	No	Yes	No
Auto-selection	Yes	Yes	Via selection rules	Explicit/config rules only	Explicit only

Provider Abstraction

AI Runway supports two deployment methods, both using the provider abstraction pattern:

CRD-Based Deployment (Recommended)

Users create ModelDeployment CRs, and the controller + provider controllers handle the rest:

Automatic provider selection based on capabilities
Unified status reporting
Provider-agnostic lifecycle management

Web UI Deployment

The Web UI backend reads provider information (capabilities, installation steps, Helm charts) from InferenceProviderConfig CRDs in the cluster. These CRDs are created by provider shims — each provider shim must be installed (e.g., kubectl apply -f providers/kaito/deploy/kaito.yaml) before its provider appears in the UI. Once visible, the UI can trigger Helm-based upstream provider installation and creates ModelDeployment CRs for model deployment, which are then handled by the controller and provider controllers.

Supported Providers

Provider	Upstream CRD	Status	Shim YAML	Description
NVIDIA Dynamo	DynamoGraphDeployment	✅ Available	dynamo.yaml	High-performance GPU inference with KV-cache routing and disaggregated serving
KubeRay	RayService	✅ Available	kuberay.yaml	Ray-based distributed inference with autoscaling
KAITO	Workspace	✅ Available	kaito.yaml	Flexible inference with vLLM (GPU) or llama.cpp (CPU/GPU)
llm-d	none	✅ Available	llmd.yaml	Flexible inference with vLLM (GPU) with KV-cache routing and disaggregated serving
Direct vLLM	Deployment	✅ Available	vllm.yaml	Direct vLLM OpenAI-compatible server deployments using `spec.engine.image`; see Direct vLLM guide

KAITO Provider

The KAITO provider enables flexible inference with multiple backends:

vLLM Mode: GPU inference using vLLM engine with full HuggingFace model support
Pre-made GGUF: Ready-to-deploy quantized models from ghcr.io/kaito-project/aikit/*
HuggingFace GGUF: Run any GGUF model from HuggingFace directly (no build required)
CPU/GPU Flexibility: llama.cpp models can run on CPU nodes (no GPU required) or GPU nodes

Mode	Engine	Compute	Use Case
vLLM	vLLM	GPU	High-performance GPU inference
Pre-made GGUF	llama.cpp	CPU/GPU	Ready-to-deploy quantized models
HuggingFace GGUF	llama.cpp	CPU/GPU	Run any HuggingFace GGUF model

Build Infrastructure

For HuggingFace GGUF models, KAITO uses in-cluster image building:

┌────────────────┐     ┌──────────────┐     ┌─────────────────┐
│  HuggingFace   │────▶│  BuildKit    │────▶│  In-Cluster     │
│  GGUF Model    │     │  (K8s Driver)│     │  Registry       │
└────────────────┘     └──────────────┘     └─────────────────┘
                                                    │
                                                    ▼
                                            ┌─────────────────┐
                                            │  KAITO Pod      │
                                            │  (llama.cpp)    │
                                            └─────────────────┘

RegistryService (backend/src/services/registry.ts): Manages in-cluster registry
BuildKitService (backend/src/services/buildkit.ts): Manages BuildKit builder
AikitService (backend/src/services/aikit.ts): Handles GGUF image building

Engine & Provider Selection​

Engine Auto-Selection​

Provider Auto-Selection​

Provider Capability Matrix​

Provider Abstraction​

CRD-Based Deployment (Recommended)​

Web UI Deployment​

Supported Providers​

KAITO Provider​

Build Infrastructure​

Related Services​

See also​