Providers
Engine & Provider Selection
When spec.engine.type is omitted, the controller auto-selects the engine from provider capabilities. When spec.provider.name is omitted, the controller auto-selects a provider using CEL-based selection rules from InferenceProviderConfig resources. Each provider declares rules with priorities; the highest-priority match wins.
Engine Auto-Selection
The controller selects the engine by scanning per-engine capabilities from all ready InferenceProviderConfig resources:
- Filter engines by per-engine compatibility with the deployment:
- GPU/CPU: each engine declares
gpuSupportandcpuSupportindependently - Serving mode: each engine declares its supported
servingModes
- GPU/CPU: each engine declares
- Rank available engines by preference:
vllm>sglang>trtllm>llamacpp - Pick the first available engine by preference
The selected engine is stored in status.engine.type with a reason in status.engine.selectedReason.
Provider Auto-Selection
With the engine resolved, provider selection evaluates CEL rules from each InferenceProviderConfig:
Default selection behavior depends on the InferenceProviderConfig resources installed in the cluster. With the provider configs bundled in this repository, the shipped rules are:
IF gpu.count == 0 OR resources.gpu is omitted:
→ KAITO (CPU-capable provider), engine auto-selected to llamacpp when needed
IF engine == "trtllm" OR engine == "sglang":
→ Dynamo
IF engine == "llamacpp":
→ KAITO
IF mode == "disaggregated":
→ Dynamo
IF gpu.count > 1 AND engine == "vllm":
→ KubeRay
IF gpu.count > 0 AND engine == "vllm":
→ Dynamo
Note: Provider auto-selection is driven by registered InferenceProviderConfig.selectionRules; the core selector does not hard-code KubeRay, llm-d, or Direct vLLM. Providers with empty or no matching rules are explicit-only unless their installed config makes them selectable.
The selection reason is recorded in status.provider.selectedReason for observability.
Provider Capability Matrix
| Criteria | KAITO | Dynamo | KubeRay | llm-d | Direct vLLM |
|---|---|---|---|---|---|
| CPU inference | Yes | No | No | No | No |
| GPU inference | Yes | Yes | Yes | Yes | Yes |
| vLLM engine | Yes | Yes | Yes | Yes | Yes |
| sglang engine | No | Yes | No | No | No |
| trtllm engine | No | Yes | No | No | No |
| llamacpp engine | Yes | No | No | No | No |
| Disaggregated P/D | No | Yes | Yes | Yes | No |
| Self-managed InferencePool | No | Yes | No | No | No |
| Self-managed EPP | No | Yes | No | No | No |
| Customizable EPP image/config | No | No | No | Yes | No |
| Auto-selection | Yes | Yes | Via selection rules | Explicit/config rules only | Explicit only |
Provider Abstraction
AI Runway supports two deployment methods, both using the provider abstraction pattern:
CRD-Based Deployment (Recommended)
Users create ModelDeployment CRs, and the controller + provider controllers handle the rest:
- Automatic provider selection based on capabilities
- Unified status reporting
- Provider-agnostic lifecycle management
Web UI Deployment
The Web UI backend reads provider information (capabilities, installation steps, Helm charts) from InferenceProviderConfig CRDs in the cluster. These CRDs are created by provider shims — each provider shim must be installed (e.g., kubectl apply -f providers/kaito/deploy/kaito.yaml) before its provider appears in the UI. Once visible, the UI can trigger Helm-based upstream provider installation and creates ModelDeployment CRs for model deployment, which are then handled by the controller and provider controllers.
Supported Providers
| Provider | Upstream CRD | Status | Shim YAML | Description |
|---|---|---|---|---|
| NVIDIA Dynamo | DynamoGraphDeployment | ✅ Available | dynamo.yaml | High-performance GPU inference with KV-cache routing and disaggregated serving |
| KubeRay | RayService | ✅ Available | kuberay.yaml | Ray-based distributed inference with autoscaling |
| KAITO | Workspace | ✅ Available | kaito.yaml | Flexible inference with vLLM (GPU) or llama.cpp (CPU/GPU) |
| llm-d | none | ✅ Available | llmd.yaml | Flexible inference with vLLM (GPU) with KV-cache routing and disaggregated serving |
| Direct vLLM | Deployment | ✅ Available | vllm.yaml | Direct vLLM OpenAI-compatible server deployments using spec.engine.image; see Direct vLLM guide |
KAITO Provider
The KAITO provider enables flexible inference with multiple backends:
- vLLM Mode: GPU inference using vLLM engine with full HuggingFace model support
- Pre-made GGUF: Ready-to-deploy quantized models from
ghcr.io/kaito-project/aikit/* - HuggingFace GGUF: Run any GGUF model from HuggingFace directly (no build required)
- CPU/GPU Flexibility: llama.cpp models can run on CPU nodes (no GPU required) or GPU nodes
| Mode | Engine | Compute | Use Case |
|---|---|---|---|
| vLLM | vLLM | GPU | High-performance GPU inference |
| Pre-made GGUF | llama.cpp | CPU/GPU | Ready-to-deploy quantized models |
| HuggingFace GGUF | llama.cpp | CPU/GPU | Run any HuggingFace GGUF model |
Build Infrastructure
For HuggingFace GGUF models, KAITO uses in-cluster image building:
┌────────────────┐ ┌──────────────┐ ┌─────────────────┐
│ HuggingFace │────▶│ BuildKit │────▶│ In-Cluster │
│ GGUF Model │ │ (K8s Driver)│ │ Registry │
└────────────────┘ └──────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ KAITO Pod │
│ (llama.cpp) │
└─────────────────┘
Related Services
- RegistryService (
backend/src/services/registry.ts): Manages in-cluster registry - BuildKitService (
backend/src/services/buildkit.ts): Manages BuildKit builder - AikitService (
backend/src/services/aikit.ts): Handles GGUF image building