Skip to main content

API Reference

AI Runway provides two APIs for managing deployments:

  1. CRD API (Recommended) - Create ModelDeployment custom resources directly via kubectl
  2. REST API - Web UI backend API for browser-based management

CRD API (Kubernetes Native)

The preferred way to deploy models is via the ModelDeployment CRD:

# Create a deployment
kubectl apply -f - <<EOF
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
name: qwen-demo
namespace: default
spec:
model:
id: "Qwen/Qwen3-0.6B"
source: huggingface
engine:
type: vllm
resources:
gpu:
count: 1
scaling:
replicas: 1
EOF

# List deployments
kubectl get modeldeployments

# Check status
kubectl describe modeldeployment qwen-demo

# Delete deployment
kubectl delete modeldeployment qwen-demo

See controller-architecture.md for controller internals and providers.md for provider selection.

ModelDeployment Spec Reference

FieldTypeRequiredDefaultDescription
model.idstringYes (when source=huggingface)HuggingFace model ID
model.sourcestringNohuggingfacehuggingface or custom
model.servedNamestringNoModel ID basenameAPI-facing model name
engine.typestringNoAuto-selectedvllm, sglang, trtllm, or llamacpp. If omitted, auto-selected from provider capabilities
engine.imagestringNoProvider defaultEngine-specific image override; preferred for Direct vLLM/custom vLLM images
engine.contextLengthintNoModel defaultMax context length
engine.trustRemoteCodeboolNofalseAllow remote code (vLLM/SGLang only)
engine.argsmap[string]stringNo{}Engine-specific named arguments/CLI flags
engine.extraArgs[]stringNo[]Additional raw engine flags
provider.namestringNoAuto-selecteddynamo, kaito, kuberay, llmd, or vllm
provider.overridesobjectNo{}Provider-specific escape hatch
serving.modestringNoaggregatedaggregated or disaggregated
scaling.replicasintNo1Replicas (aggregated mode)
scaling.prefillobjectNoPrefill scaling (disaggregated mode)
scaling.decodeobjectNoDecode scaling (disaggregated mode)
resources.gpu.countintNo0GPU count
resources.gpu.typestringNonvidia.com/gpuGPU resource name
resources.memorystringNoMemory request
resources.cpustringNoCPU request
imagestringNoProvider defaultLegacy provider-level custom image override; prefer engine.image for Direct vLLM
env[]EnvVarNo[]Environment variables
podTemplate.metadata.labelsmapNo{}Labels for pods
podTemplate.metadata.annotationsmapNo{}Annotations for pods
secrets.huggingFaceTokenstringNoK8s secret name for HF token
nodeSelectormapNo{}Node selector
tolerations[]TolerationNo[]Tolerations
gateway.enabled*boolNotrue (when Gateway detected)Enable/disable gateway integration
gateway.modelNamestringNoModel served name or IDOverride model name for gateway routing

Update Semantics

When updating a ModelDeployment, changes are handled based on field type:

Identity fields — changing these triggers delete + recreate (brief downtime):

  • model.id, model.source, engine.type (once set), provider.name, serving.mode

Config fields — changed in-place without recreation:

  • model.servedName, scaling.*, env, resources, engine.image, engine.args, engine.extraArgs, engine.contextLength, legacy image, secrets.*, podTemplate.metadata, nodeSelector, tolerations, provider.overrides

API Versioning

VersionStatusStability
v1alpha1CurrentExperimental — breaking changes allowed
v1beta1PlannedFeature complete — breaking changes with deprecation warnings
v1FutureStable — no breaking changes, long-term support

Engine-Specific Parameters

Common concepts are abstracted via engine.contextLength and engine.trustRemoteCode. For engine-specific flags, use engine.args:

Context length mapping:

EngineCLI flagDefault
vLLM--max-model-lenModel default
SGLang--context-lengthModel default
TensorRT-LLMBuild-time config
llama.cpp--ctx-sizeModel max

Quantization (via engine.args):

engine:
type: vllm
args:
quantization: "awq" # awq, gptq, squeezellm, fp8

GPU memory utilization (via engine.args):

EngineArg keyDefault
vLLMgpu-memory-utilization0.9
SGLangmem-fraction-static0.88

Example Transformations

GPU Deployment → Dynamo (auto-selected)

# User creates:
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
name: llama-8b
spec:
model:
id: "meta-llama/Llama-3.1-8B-Instruct"
source: huggingface
engine:
type: vllm
contextLength: 8192
scaling:
replicas: 1
resources:
gpu:
count: 1
memory: "32Gi"
secrets:
huggingFaceToken: "hf-token"

# Controller creates DynamoGraphDeployment with:
# - Frontend service (router)
# - VllmWorker with 1 GPU, 32Gi memory
# - vLLM runtime image
# Status:
# provider.name: dynamo
# provider.selectedReason: "default → dynamo (GPU inference default)"
# endpoint.service: llama-8b-frontend, port: 8000

CPU Deployment → KAITO (auto-selected)

# User creates:
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
name: gemma-cpu
spec:
model:
id: "google/gemma-3-1b-it-qat-q8_0-gguf"
source: huggingface
engine:
type: llamacpp
scaling:
replicas: 1
resources:
gpu:
count: 0
memory: "16Gi"
cpu: "8"
image: "ghcr.io/sozercan/llama-cpp-runner:latest"

# Controller creates KAITO Workspace with:
# - llama.cpp container, CPU-only
# Status:
# provider.name: kaito
# provider.selectedReason: "no GPU requested → kaito (only CPU provider)"
# endpoint.service: gemma-cpu, port: 80

Disaggregated P/D → Dynamo (explicit)

# User creates:
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
name: llama-70b-pd
spec:
model:
id: "meta-llama/Llama-3.1-70B-Instruct"
provider:
name: dynamo
overrides:
routerMode: "kv"
frontend:
replicas: 2
engine:
type: vllm
serving:
mode: disaggregated
scaling:
prefill:
replicas: 2
gpu:
count: 4
memory: "128Gi"
decode:
replicas: 4
gpu:
count: 2
memory: "64Gi"
secrets:
huggingFaceToken: "hf-token"

# Controller creates DynamoGraphDeployment with:
# - Frontend (2 replicas, KV routing)
# - VllmPrefillWorker (2 replicas, 4 GPUs each)
# - VllmDecodeWorker (4 replicas, 2 GPUs each)
# Status:
# provider.name: dynamo
# replicas.desired: 6 (2 prefill + 4 decode)
# endpoint.service: llama-70b-pd-frontend, port: 8000

REST API (Web UI)

Base URL: http://localhost:3001/api

Health & Status

GET /health

Health check endpoint.

Response:

{
"status": "healthy",
"timestamp": "2025-01-15T10:30:00.000Z"
}

GET /health/version

Get build version information.

Response:

{
"version": "v1.0.0",
"buildTime": "2025-01-15T10:00:00.000Z",
"gitCommit": "abc1234"
}

GET /cluster/status

Get Kubernetes cluster connection status.

Response:

{
"connected": true,
"namespace": "airunway-system",
"providerId": "dynamo",
"providerInstalled": true
}

GET /cluster/nodes

Get list of cluster nodes with GPU information.

Response:

{
"nodes": [
{
"name": "gpu-node-1",
"ready": true,
"gpuCount": 2
},
{
"name": "cpu-node-1",
"ready": true,
"gpuCount": 0
}
]
}

Settings

GET /settings

Get current settings and available providers.

Response:

{
"config": {
"defaultNamespace": "airunway-system"
},
"auth": {
"enabled": false
}
}

PUT /settings

Update application settings.

Request Body:

{
"defaultNamespace": "my-namespace"
}

Installation

GET /installation/helm/status

Check if Helm CLI is available.

Response:

{
"available": true,
"version": "v3.14.0"
}

GET /installation/gateway/status

Check whether the Gateway API and Gateway API Inference Extension (GAIE) CRDs are installed in the cluster, and report the pinned GAIE version the controller expects.

Response:

{
"gatewayApiInstalled": true,
"inferenceExtInstalled": true,
"gatewayApiVersion": "v1.2.1",
"inferenceExtVersion": "v1.5.0",
"pinnedVersion": "v1.5.0",
"gatewayAvailable": true,
"gatewayEndpoint": "10.0.0.50",
"message": "Gateway API and Inference Extension CRDs are installed. Gateway is available.",
"installCommands": [
"kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/latest/download/standard-install.yaml",
"kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.5.0/manifests.yaml"
]
}

Notes:

  • pinnedVersion is the GAIE version the controller is built against (PINNED_GAIE_VERSION from @airunway/shared, sourced from versions.env).
  • installCommands are the manual fallback commands; prefer the POST endpoint below when Helm/kubectl is available server-side.

POST /installation/gateway/install-crds

Install (or re-apply) the Gateway API CRDs and the pinned Gateway API Inference Extension CRDs. This is a prerequisite for every provider install and only needs to be run once per cluster.

Response:

{
"success": true,
"message": "Gateway API and Inference Extension CRDs installed successfully",
"results": [
{
"step": "gateway-api-crds",
"success": true,
"output": "customresourcedefinition.apiextensions.k8s.io/gateways.gateway.networking.k8s.io created"
},
{
"step": "inference-extension-crds",
"success": true,
"output": "customresourcedefinition.apiextensions.k8s.io/inferencepools.inference.networking.x-k8s.io created"
}
]
}

GET /installation/providers/:id/status

Get provider installation status.

Response:

{
"providerId": "dynamo",
"providerName": "Dynamo",
"installed": true,
"crdFound": true,
"operatorRunning": true,
"version": "dynamo-provider:v0.2.0",
"message": "Dynamo is installed and running"
}

GET /installation/providers/:id/commands

Get manual installation commands for a provider.

Prerequisite — install the Gateway API Inference Extension (GAIE) CRDs first. The commands returned here only cover the provider's own Helm install. GAIE is a shared dependency required by every provider and is installed through a separate flow:

  1. Check status with GET /installation/gateway/status.
  2. If inferenceExtInstalled is false, install via POST /installation/gateway/install-crds (or run the kubectl apply lines returned in installCommands).

Only then run the commands from this endpoint.

Response:

{
"commands": [
"helm repo add nvidia-ai-dynamo https://helm.ngc.nvidia.com/nvidia/ai-dynamo",
"helm repo update",
"helm install dynamo-platform https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-1.1.1.tgz --namespace dynamo-system --create-namespace --set-json global.grove.install=true"
]
}

POST /installation/providers/:id/install

Install a provider via Helm.

Response:

{
"success": true,
"message": "Provider installed successfully"
}

POST /installation/providers/:id/uninstall

Uninstall a provider (preserves CRDs by default).

Response:

{
"success": true,
"message": "Provider uninstalled (CRDs preserved - use 'Uninstall CRDs' for complete removal)",
"installationStatus": {
"installed": false,
"crdFound": true,
"operatorRunning": false
},
"results": [
{
"step": "Uninstall Helm chart: kaito-workspace",
"success": true,
"output": "release \"kaito-workspace\" uninstalled"
}
]
}

POST /installation/providers/:id/uninstall-crds

Delete CRDs for a provider (complete removal).

Response:

{
"success": true,
"message": "Provider CRDs uninstalled",
"installationStatus": {
"installed": false,
"crdFound": false,
"operatorRunning": false
},
"results": [
{
"step": "Delete CRD: workspaces.kaito.sh",
"success": true,
"output": "CRD workspaces.kaito.sh deleted"
}
]
}

Notes:

  • This is a destructive operation - existing workloads using the CRDs will be affected
  • Use regular uninstall first to remove Helm releases while preserving CRDs
  • Use this endpoint only when you want complete removal

GET /installation/gpu-operator/status

Check NVIDIA GPU Operator installation status and GPU availability.

Response:

{
"installed": true,
"crdFound": true,
"operatorRunning": true,
"gpusAvailable": true,
"totalGPUs": 4,
"gpuNodes": ["node-1", "node-2"],
"message": "GPUs enabled: 4 GPU(s) on 2 node(s)",
"helmCommands": [
"helm repo add nvidia https://helm.ngc.nvidia.com/nvidia",
"helm repo update",
"helm install gpu-operator nvidia/gpu-operator --namespace gpu-operator --create-namespace"
]
}

GET /installation/gpu-capacity

Get detailed GPU capacity information for the cluster.

Response:

{
"totalGpus": 4,
"allocatedGpus": 1,
"availableGpus": 3,
"maxContiguousAvailable": 2,
"totalMemoryGb": 80,
"nodes": [
{
"nodeName": "gpu-node-1",
"totalGpus": 2,
"allocatedGpus": 1,
"availableGpus": 1
},
{
"nodeName": "gpu-node-2",
"totalGpus": 2,
"allocatedGpus": 0,
"availableGpus": 2
}
]
}

Notes:

  • totalMemoryGb is detected from nvidia.com/gpu.memory node label (MiB converted to GB)
  • Falls back to detecting memory from nvidia.com/gpu.product label if not available
  • Used by frontend to show GPU fit indicators for HuggingFace search results

POST /installation/gpu-operator/install

Install the NVIDIA GPU Operator via Helm.

Response:

{
"success": true,
"message": "NVIDIA GPU Operator installed successfully",
"status": {
"installed": true,
"crdFound": true,
"operatorRunning": true,
"gpusAvailable": false,
"totalGPUs": 0,
"gpuNodes": [],
"message": "GPU Operator installed but no GPUs detected on nodes"
}
}

GET /installation/gpu-capacity/detailed

Get detailed GPU capacity with node pool breakdown.

Response:

{
"totalGpus": 4,
"allocatedGpus": 1,
"availableGpus": 3,
"maxContiguousAvailable": 2,
"maxNodeGpuCapacity": 2,
"gpuNodeCount": 2,
"totalMemoryGb": 80,
"nodePools": [
{
"name": "gpu",
"gpuCount": 4,
"nodeCount": 2,
"availableGpus": 3,
"gpuModel": "NVIDIA-A100-SXM4-80GB"
}
]
}

Notes:

  • Groups nodes by node pool (agentpool, kubernetes.azure.com/agentpool, etc.)
  • Shows per-pool GPU capacity and availability
  • Used for capacity planning and autoscaler guidance

Autoscaler

GET /autoscaler/detection

Detect cluster autoscaler type and health status.

Response:

{
"detected": true,
"type": "aks-managed",
"healthy": true,
"message": "Cluster Autoscaler running on 1 node group(s)",
"nodeGroupCount": 1
}

Autoscaler Types:

  • aks-managed - AKS managed cluster autoscaler (Azure)
  • cluster-autoscaler - Self-managed cluster autoscaler (any cloud)
  • none - No autoscaler detected

Detection Logic:

  • Primary: Checks for cluster-autoscaler-status ConfigMap in kube-system
  • Fallback: Checks for cluster-autoscaler Deployment
  • Health: ConfigMap timestamp < 5 minutes = healthy

GET /autoscaler/status

Get detailed autoscaler status from ConfigMap.

Response:

{
"health": "Healthy",
"lastUpdated": "2025-01-15T10:30:00Z",
"nodeGroups": [
{
"name": "gpu",
"health": "Healthy",
"minSize": 1,
"maxSize": 10,
"currentSize": 2
}
]
}

Models

GET /models

Get the curated model catalog.

Response:

{
"models": [
{
"id": "Qwen/Qwen3-0.6B",
"name": "Qwen3 0.6B",
"description": "Small, efficient model ideal for development",
"size": "0.6B",
"task": "text-generation",
"contextLength": 32768,
"supportedEngines": ["vllm", "sglang", "trtllm"],
"minGpuMemory": "4GB",
"gated": false
},
{
"id": "meta-llama/Llama-3.2-1B-Instruct",
"name": "Llama 3.2 1B Instruct",
"description": "Compact Llama model optimized for instruction following",
"size": "1B",
"task": "chat",
"contextLength": 131072,
"supportedEngines": ["vllm", "sglang", "trtllm"],
"minGpuMemory": "4GB",
"gated": true
}
]
}

Model Fields:

  • id - HuggingFace model ID (e.g., "Qwen/Qwen3-0.6B")
  • name - Display name
  • description - Brief description
  • size - Parameter count (e.g., "0.6B")
  • task - Model task type ("text-generation", "chat", "fill-mask")
  • contextLength - Maximum context length
  • supportedEngines - Compatible inference engines
  • minGpuMemory - Minimum GPU memory required
  • minGpus - Minimum number of GPUs required (default: 1)
  • gated - Whether model requires HuggingFace authentication (true for Llama, Mistral, etc.)
  • estimatedGpuMemory - Estimated GPU memory from HF search (e.g., "16GB")
  • estimatedGpuMemoryGb - Numeric GPU memory for capacity comparisons
  • parameterCount - Parameter count from safetensors metadata
  • fromHfSearch - True if model came from HuggingFace search

GET /models/:modelId/gguf-files

Get available GGUF files for a HuggingFace model.

Headers:

  • X-HF-Token (optional) - HuggingFace token for gated models

Response:

{
"files": [
{
"filename": "model-Q8_0.gguf",
"size": 1340000000
}
]
}

GET /models/search

Search HuggingFace Hub for compatible models.

Query Parameters:

  • q (required) - Search query
  • limit (optional) - Number of results (default: 20, max: 50)
  • offset (optional) - Pagination offset

Headers:

  • Authorization: Bearer <hf_token> (optional) - For accessing gated models

Response:

{
"models": [
{
"id": "meta-llama/Llama-3.1-8B-Instruct",
"name": "Llama-3.1-8B-Instruct",
"author": "meta-llama",
"downloads": 1500000,
"likes": 2500,
"pipelineTag": "text-generation",
"gated": true,
"supportedEngines": ["vllm", "sglang", "trtllm"],
"estimatedGpuMemory": "19.2GB",
"estimatedGpuMemoryGb": 19.2,
"parameterCount": 8000000000
}
],
"total": 150,
"offset": 0,
"limit": 20
}

Notes:

  • Only returns models with text-generation pipeline tag
  • Filters out models with incompatible architectures
  • GPU memory estimated as: (params × 2GB) × 1.2 for FP16 inference
  • Results cached client-side for 60 seconds

Deployments

GET /deployments

List all deployments for the active provider.

Query Parameters:

  • namespace (optional) - Filter by namespace

Response:

{
"deployments": [
{
"name": "qwen-deployment",
"namespace": "airunway-system",
"modelId": "Qwen/Qwen3-0.6B",
"engine": "vllm",
"phase": "Running",
"replicas": { "desired": 1, "ready": 1, "available": 1 },
"createdAt": "2024-01-15T10:30:00Z"
}
]
}

POST /deployments

Create a new deployment.

Request Body:

{
"name": "qwen-deployment",
"namespace": "airunway-system",
"provider": "vllm",
"modelId": "Qwen/Qwen3-0.6B",
"engine": "vllm",
"mode": "aggregated",
"replicas": 1,
"hfTokenSecret": "hf-token-secret",
"enforceEager": true,
"enablePrefixCaching": false,
"trustRemoteCode": false,
"imageRef": "vllm/vllm-openai:cu130-nightly",
"engineArgs": {
"trust-remote-code": ""
},
"engineExtraArgs": []
}

Key Fields:

  • name - Kubernetes resource name
  • namespace - Target namespace
  • provider - Runtime provider (dynamo, kuberay, kaito, llmd, or vllm). Omit to let the controller auto-select.
  • modelId - HuggingFace model ID
  • engine - Inference engine (vllm, sglang, trtllm, or llamacpp). Providers support different subsets; Direct vLLM and llm-d use vllm.
  • hfTokenSecret - Name of the Kubernetes secret containing HuggingFace token
  • imageRef - Optional custom image. For provider/runtime vllm, maps to spec.engine.image; for other providers, maps to legacy top-level spec.image.
  • engineArgs - Optional object mapped to spec.engine.args.
  • engineExtraArgs - Optional string array mapped to spec.engine.extraArgs.

Response:

{
"message": "Deployment created successfully",
"name": "qwen-deployment",
"namespace": "airunway-system",
"provider": "vllm"
}

GET /deployments/:name

Get deployment details including pod status.

Query Parameters:

  • namespace (required)

Response:

{
"name": "qwen-deployment",
"namespace": "airunway-system",
"modelId": "Qwen/Qwen3-0.6B",
"engine": "vllm",
"provider": "dynamo",
"phase": "Running",
"replicas": { "desired": 1, "ready": 1, "available": 1 },
"pods": [
{
"name": "qwen-deployment-worker-0",
"phase": "Running",
"ready": true,
"restarts": 0
}
],
"createdAt": "2024-01-15T10:30:00Z"
}

GET /deployments/:name/manifest

Get the Kubernetes manifest resources for a deployment.

Query Parameters:

  • namespace (optional)

Response:

{
"resources": [
{
"kind": "ModelDeployment",
"apiVersion": "airunway.ai/v1alpha1",
"name": "qwen-deployment",
"manifest": { }
}
],
"primaryResource": {
"kind": "ModelDeployment",
"apiVersion": "airunway.ai/v1alpha1"
}
}

Runtimes

GET /runtimes/status

Get installation and health status of all runtimes.

Response:

{
"runtimes": [
{
"id": "dynamo",
"name": "Dynamo",
"installed": true,
"healthy": true,
"version": "dynamo-provider:v0.2.0",
"message": "Provider ready"
},
{
"id": "kuberay",
"name": "KubeRay",
"installed": false,
"healthy": false,
"message": "CRD not found"
},
{
"id": "kaito",
"name": "KAITO",
"installed": true,
"healthy": true,
"version": "0.6.0",
"message": "KAITO is installed and running"
},
{
"id": "llmd",
"name": "llm-d",
"installed": false,
"healthy": false,
"message": "Provider config not found"
},
{
"id": "vllm",
"name": "Direct vLLM",
"installed": false,
"healthy": false,
"message": "Provider config not found"
}
]
}

Fields:

  • id - Runtime identifier (dynamo, kuberay, kaito, llmd, or vllm)
  • name - Display name
  • installed - Whether the runtime/provider is ready to use
  • healthy - Whether runtime health checks pass
  • version - Detected version (if available)
  • message - Status message

Notes:

  • Used by the frontend to show available runtimes in the deployment wizard
  • Checks provider configuration and available health signals for each provider/runtime; Direct vLLM is registered by the repo-local providers/vllm shim

DELETE /deployments/:name

Delete a deployment.

Query Parameters:

  • namespace (required)

Response:

{
"success": true,
"message": "Deployment deleted"
}

GET /deployments/:name/pods

Get pods for a deployment.

Query Parameters:

  • namespace (optional)

Response:

{
"pods": [
{
"name": "qwen-deployment-worker-0",
"phase": "Running",
"ready": true,
"restarts": 0,
"node": "gpu-node-1"
}
]
}

GET /deployments/:name/logs

Get logs from a deployment's pods.

Query Parameters:

  • namespace (optional) - Deployment namespace
  • podName (optional) - Specific pod to get logs from (defaults to first pod)
  • container (optional) - Specific container name
  • tailLines (optional) - Number of lines to return (default: 100, max: 10000)
  • timestamps (optional) - Include timestamps in log lines (true/false)

Response:

{
"logs": "[INFO] Model loaded successfully\n[INFO] Server started on port 8000\n...",
"podName": "qwen-deployment-worker-0",
"container": "model"
}

Notes:

  • ANSI color codes are automatically stripped from logs
  • If no pods exist for the deployment, returns empty logs with a message

GET /deployments/:name/metrics

Get Prometheus metrics from a deployment's inference service.

Query Parameters:

  • namespace (optional) - Deployment namespace

Response (available):

{
"available": true,
"timestamp": "2025-01-15T10:30:00.000Z",
"metrics": [
{
"name": "vllm:num_requests_running",
"value": 5,
"labels": { "model": "Qwen/Qwen3-0.6B" }
},
{
"name": "vllm:gpu_cache_usage_perc",
"value": 45.2,
"labels": {}
}
]
}

Response (off-cluster):

{
"available": false,
"error": "Metrics are only available when AI Runway is deployed inside the Kubernetes cluster.",
"timestamp": "2025-01-15T10:30:00.000Z",
"metrics": [],
"runningOffCluster": true
}

Notes:

  • Metrics require AI Runway to be running inside the cluster
  • Supports both vLLM and llama.cpp metric formats
  • Returns runningOffCluster: true when running locally

GET /deployments/:name/pending-reasons

Get reasons why deployment pods are pending (unschedulable).

Query Parameters:

  • namespace (optional) - Deployment namespace

Response:

{
"reasons": [
{
"reason": "FailedScheduling",
"message": "0/3 nodes are available: 3 Insufficient nvidia.com/gpu",
"isResourceConstraint": true,
"resourceType": "gpu",
"canAutoscalerHelp": true
}
]
}

Resource Types:

  • gpu - Insufficient GPU resources
  • cpu - Insufficient CPU resources
  • memory - Insufficient memory

Notes:

  • Only returns reasons for pending pods
  • canAutoscalerHelp indicates if cluster autoscaler can provision resources
  • Taint and node selector issues will have canAutoscalerHelp: false

HuggingFace OAuth

AI Runway supports HuggingFace OAuth with PKCE for secure token acquisition. This enables access to gated models (e.g., Llama, Mistral) without manually managing tokens.

GET /oauth/huggingface/config

Get OAuth configuration for initiating HuggingFace sign-in.

Response:

{
"clientId": "e05817a1-7053-4b9e-b292-29cd219fccf8",
"authorizeUrl": "https://huggingface.co/oauth/authorize",
"scopes": ["openid", "profile", "read-repos"]
}

POST /oauth/huggingface/start

Start an OAuth flow with PKCE. Generates a code verifier and state parameter.

Request Body:

{
"redirectUri": "http://localhost:3000/oauth/callback/huggingface"
}

Response:

{
"authorizationUrl": "https://huggingface.co/oauth/authorize?client_id=...&state=...",
"state": "random-state-string"
}

GET /oauth/huggingface/verifier/:state

Retrieve the PKCE code verifier for a given OAuth state. One-time use — the verifier is deleted after retrieval.

Response:

{
"codeVerifier": "pkce_code_verifier_string",
"redirectUri": "http://localhost:3000/oauth/callback/huggingface"
}

POST /oauth/huggingface/token

Exchange OAuth authorization code for access token using PKCE.

Request Body:

{
"code": "authorization_code_from_callback",
"codeVerifier": "pkce_code_verifier_min_43_chars",
"redirectUri": "http://localhost:3000/oauth/callback/huggingface"
}

Response:

{
"accessToken": "hf_xxxxx",
"tokenType": "Bearer",
"expiresIn": 3600,
"scope": "openid profile read-repos",
"user": {
"id": "user123",
"name": "username",
"fullname": "Full Name",
"email": "user@example.com",
"avatarUrl": "https://huggingface.co/avatars/xxx.png"
}
}

HuggingFace Secrets

Manages HuggingFace tokens as Kubernetes secrets across provider namespaces.

GET /secrets/huggingface/status

Get the status of HuggingFace token secrets across namespaces.

Response:

{
"configured": true,
"namespaces": [
{ "name": "dynamo-system", "exists": true },
{ "name": "ray-system", "exists": true },
{ "name": "kuberay-system", "exists": true },
{ "name": "default", "exists": true }
],
"user": {
"id": "user123",
"name": "username",
"fullname": "Full Name"
}
}

POST /secrets/huggingface

Save HuggingFace access token as Kubernetes secrets in all required namespaces.

Request Body:

{
"accessToken": "hf_xxxxx"
}

Response:

{
"success": true,
"message": "HuggingFace token saved successfully",
"user": {
"id": "user123",
"name": "username",
"fullname": "Full Name"
},
"results": [
{ "namespace": "dynamo-system", "success": true },
{ "namespace": "ray-system", "success": true },
{ "namespace": "kuberay-system", "success": true },
{ "namespace": "default", "success": true }
]
}

DELETE /secrets/huggingface

Delete HuggingFace token secrets from all namespaces.

Response:

{
"success": true,
"message": "HuggingFace secrets deleted successfully",
"results": [
{ "namespace": "dynamo-system", "success": true },
{ "namespace": "ray-system", "success": true },
{ "namespace": "kuberay-system", "success": true },
{ "namespace": "default", "success": true }
]
}

AIKit (KAITO Image Building)

Endpoints for building and managing KAITO/AIKit images for GGUF model deployment.

GET /aikit/models

List available pre-made AIKit models.

Response:

{
"models": [
{
"id": "llama3.2-1b",
"modelName": "Llama 3.2 1B",
"image": "ghcr.io/kaito-project/aikit/llama3.2-1b:0.0.1",
"license": "Llama"
},
{
"id": "phi4-14b",
"modelName": "Phi 4 14B",
"image": "ghcr.io/kaito-project/aikit/phi4-14b:0.0.1",
"license": "MIT"
}
],
"total": 15
}

GET /aikit/models/:id

Get details for a specific pre-made model.

Response:

{
"id": "llama3.2-1b",
"modelName": "Llama 3.2 1B",
"image": "ghcr.io/kaito-project/aikit/llama3.2-1b:0.0.1",
"license": "Llama"
}

POST /aikit/build

Build an AIKit image from a HuggingFace GGUF model or get pre-made image reference.

Request Body (Pre-made):

{
"modelSource": "premade",
"premadeModel": "llama3.2-1b"
}

Request Body (HuggingFace GGUF):

{
"modelSource": "huggingface",
"modelId": "bartowski/gemma-3-1b-it-GGUF",
"ggufFile": "gemma-3-1b-it-Q8_0.gguf",
"imageName": "my-model",
"imageTag": "v1"
}

Response:

{
"success": true,
"imageRef": "registry.airunway-system.svc.cluster.local:5000/my-model:v1",
"buildTime": 120,
"wasPremade": false,
"message": "AIKit image built successfully"
}

POST /aikit/build/preview

Preview what image would be built (dry-run, no actual build).

Response:

{
"imageRef": "registry.airunway-system.svc.cluster.local:5000/my-model:v1",
"wasPremade": false,
"requiresBuild": true,
"registryUrl": "registry.airunway-system.svc.cluster.local:5000"
}

GET /aikit/infrastructure/status

Check build infrastructure (registry and BuildKit) status.

Response:

{
"ready": true,
"registry": {
"ready": true,
"url": "registry.airunway-system.svc.cluster.local:5000",
"message": "Registry is running"
},
"builder": {
"exists": true,
"ready": true,
"running": true,
"message": "BuildKit builder is ready"
}
}

POST /aikit/infrastructure/setup

Set up build infrastructure (deploy registry and BuildKit if needed).

Response:

{
"success": true,
"message": "Build infrastructure is ready",
"registry": {
"url": "registry.airunway-system.svc.cluster.local:5000",
"ready": true
},
"builder": {
"name": "buildkit-airunway",
"ready": true
}
}

AI Configurator

Endpoints for NVIDIA AI Configurator integration to get optimal inference configurations.

GET /aiconfigurator/status

Check if AI Configurator CLI is available on the system.

Response (available):

{
"available": true,
"version": "0.4.0"
}

Response (unavailable):

{
"available": false,
"error": "AI Configurator CLI not found"
}

Response (running in-cluster):

{
"available": false,
"runningInCluster": true,
"error": "AI Configurator is only available when running AI Runway locally"
}

Notes:

  • Status is cached for 5 minutes to avoid repeated CLI calls
  • AI Configurator must be installed locally: github.com/ai-dynamo/aiconfigurator
  • When running inside Kubernetes, returns runningInCluster: true (AI Configurator is local-only)

POST /aiconfigurator/analyze

Analyze a model + GPU combination and return optimal configuration.

Request Body:

{
"modelId": "Qwen/Qwen3-0.6B",
"gpuType": "H100-80GB",
"gpuCount": 2,
"optimizeFor": "throughput",
"maxLatencyMs": 100
}

Required Fields:

  • modelId - HuggingFace model ID (validated format: org/model-name or model-name)
  • gpuType - GPU type (e.g., "A100-80GB", "H100", "L40S")
  • gpuCount - Number of GPUs available (minimum: 1)

Optional Fields:

  • optimizeFor - Optimization target: "throughput" (default) or "latency"
  • maxLatencyMs - Target time-to-first-token latency constraint in milliseconds

Response (success):

{
"success": true,
"config": {
"tensorParallelDegree": 1,
"pipelineParallelDegree": 1,
"maxBatchSize": 256,
"maxNumSeqs": 256,
"gpuMemoryUtilization": 0.8,
"maxModelLen": 5000
},
"mode": "aggregated",
"replicas": 1,
"warnings": [],
"estimatedPerformance": {
"throughputTokensPerSec": 8901.5,
"latencyP50Ms": 187.99,
"latencyP99Ms": 281.98,
"gpuUtilization": 0.8
},
"backend": "vllm",
"supportedBackends": ["vllm", "sglang", "trtllm"]
}

Response (disaggregated mode):

{
"success": true,
"config": {
"tensorParallelDegree": 1,
"pipelineParallelDegree": 1,
"maxBatchSize": 256,
"maxNumSeqs": 256,
"gpuMemoryUtilization": 0.8,
"maxModelLen": 5000,
"prefillTensorParallel": 1,
"decodeTensorParallel": 1,
"prefillReplicas": 1,
"decodeReplicas": 1
},
"mode": "disaggregated",
"replicas": 1,
"warnings": [],
"estimatedPerformance": {
"throughputTokensPerSec": 8405.12,
"latencyP50Ms": 25.42,
"latencyP99Ms": 38.13,
"gpuUtilization": 0.8
},
"backend": "vllm",
"supportedBackends": ["vllm", "sglang", "trtllm"]
}

Response (CLI unavailable - returns defaults):

{
"success": false,
"config": {
"tensorParallelDegree": 2,
"maxBatchSize": 256,
"gpuMemoryUtilization": 0.9,
"maxModelLen": 4096
},
"mode": "aggregated",
"replicas": 1,
"error": "AI Configurator CLI not found",
"warnings": ["AI Configurator not available, using default configuration"]
}

Modes:

  • aggregated - Traditional serving where prefill and decode run on same GPUs
  • disaggregated - Prefill and decode separated for lower latency (NVIDIA Dynamo feature)

Supported Backends by GPU:

  • H100: vLLM, SGLang, TensorRT-LLM
  • A100, H200, L40S, B200, GB200: TensorRT-LLM only (vLLM data not available in AI Configurator)

POST /aiconfigurator/normalize-gpu

Normalize a GPU product string to AI Configurator format.

Request Body:

{
"gpuProduct": "nvidia-a100-sxm4-80gb"
}

Response:

{
"gpuProduct": "nvidia-a100-sxm4-80gb",
"normalized": "A100-80GB"
}

Notes:

  • Useful for converting Kubernetes node GPU labels to AI Configurator expected format
  • Handles various formats: NVIDIA prefixes, SXM/PCIe variants, Tesla prefixes

Cost Estimation

Endpoints for real-time cloud pricing and cost estimation for GPU node pools.

POST /costs/estimate

Estimate deployment cost based on GPU configuration (static estimate).

Request Body:

{
"gpuType": "A100-80GB",
"gpuCount": 1,
"replicas": 1,
"hoursPerMonth": 730
}

Required Fields:

  • gpuType - GPU model name (e.g., "A100-80GB", "H100", "T4")
  • gpuCount - Number of GPUs per replica (minimum: 1)
  • replicas - Number of replicas (minimum: 1)

Optional Fields:

  • hoursPerMonth - Hours per month for cost calculation (1-744, default: 730)

Response:

{
"success": true,
"breakdown": {
"totalGpus": 1,
"gpuModel": "A100-80GB",
"normalizedGpuModel": "A100-80GB",
"perGpu": { "hourly": 0, "monthly": 0 },
"estimate": {
"hourly": 0,
"monthly": 0,
"currency": "USD",
"source": "static",
"confidence": "low"
},
"notes": ["Use real-time pricing via /costs/node-pools for accurate cloud pricing"]
}
}

Notes:

  • Static pricing is deprecated; use /costs/node-pools for real-time cloud pricing
  • Returns confidence: "low" to indicate static estimates should not be relied upon

GET /costs/node-pools

Get cost estimates for all node pools using real-time cloud pricing.

Query Parameters:

  • gpuCount (optional) - Number of GPUs per deployment (default: 1)
  • replicas (optional) - Number of replicas (default: 1)
  • realtime (optional) - Enable real-time pricing, set to "false" for static (default: true)
  • computeType (optional) - Filter by "gpu" or "cpu" (default: "gpu")

Response:

{
"success": true,
"nodePoolCosts": [
{
"poolName": "gpu",
"gpuModel": "A100-80GB",
"availableGpus": 4,
"costBreakdown": {
"totalGpus": 1,
"gpuModel": "A100-80GB",
"normalizedGpuModel": "A100-80GB",
"perGpu": { "hourly": 0, "monthly": 0 },
"estimate": {
"hourly": 3.5,
"monthly": 2555,
"currency": "USD",
"source": "cloud-api",
"confidence": "high"
},
"notes": ["Real-time pricing from AZURE"]
},
"realtimePricing": {
"instanceType": "Standard_NC24ads_A100_v4",
"hourlyPrice": 3.5,
"monthlyPrice": 2555,
"currency": "USD",
"region": "eastus",
"source": "realtime"
}
}
],
"pricingSource": "realtime-with-fallback",
"cacheStats": {
"size": 5,
"ttlMs": 3600000,
"maxEntries": 1000
}
}

Notes:

  • Fetches real-time pricing from Azure Retail Prices API
  • Falls back to static estimates if cloud pricing unavailable
  • Pricing is cached for 1 hour to reduce API calls
  • AWS and GCP pricing not yet implemented

GET /costs/instance-price

Get real-time pricing for a specific instance type.

Query Parameters:

  • instanceType (required) - Cloud instance type (e.g., "Standard_NC24ads_A100_v4")
  • region (optional) - Cloud region (e.g., "eastus")

Response (success):

{
"success": true,
"price": {
"instanceType": "Standard_NC24ads_A100_v4",
"provider": "azure",
"region": "eastus",
"hourlyPrice": 3.5,
"currency": "USD",
"priceType": "ondemand",
"gpuCount": 1,
"gpuModel": "A100-80GB",
"lastUpdated": "2025-01-01T00:00:00.000Z"
},
"cached": false
}

Response (provider not detected):

{
"success": false,
"error": "Could not detect cloud provider for instance type: unknown-instance"
}

Response (pricing not found):

{
"success": false,
"error": "Price not found",
"provider": "azure"
}

Provider Detection:

  • Azure: Instance types starting with Standard_ or Basic_
  • AWS: Instance types with format like p4d.24xlarge, g5.xlarge (not yet implemented)
  • GCP: Instance types like n1-standard-4, a2-highgpu-1g (not yet implemented)

GET /costs/gpu-models

Get list of supported GPU models with specifications.

Response:

{
"success": true,
"models": [
{
"model": "A100-80GB",
"memoryGb": 80,
"generation": "Ampere"
},
{
"model": "H100-80GB",
"memoryGb": 80,
"generation": "Hopper"
},
{
"model": "T4",
"memoryGb": 16,
"generation": "Turing"
}
],
"note": "For actual pricing, use /costs/node-pools or /costs/instance-price for real-time cloud provider pricing"
}

Notes:

  • Returns GPU specifications only (memory, generation)
  • For real-time pricing, use /costs/node-pools or /costs/instance-price endpoints
  • GPU models are used for normalization and capacity planning

GET /costs/normalize-gpu

Normalize a GPU label to a standard GPU model name.

Query Parameters:

  • label (required) - GPU label from Kubernetes node (e.g., "NVIDIA-A100-SXM4-80GB")

Response:

{
"success": true,
"originalLabel": "NVIDIA-A100-SXM4-80GB",
"normalizedModel": "A100-80GB",
"gpuInfo": {
"memoryGb": 80,
"generation": "Ampere"
}
}

Notes:

  • Handles various GPU label formats: NVIDIA prefixes, SXM/PCIe variants, Tesla prefixes
  • Returns GPU specifications when available

Gateway

Authentication: Gateway endpoints require a valid Bearer token when authentication is enabled (same as all other non-public API routes). Access control is governed by Kubernetes TokenReview. When auth is disabled (default single-cluster mode), these endpoints are publicly accessible.

GET /gateway/status

Get Gateway API Inference Extension availability and endpoint.

Response:

{
"available": true,
"endpoint": "http://10.0.0.1"
}

GET /gateway/models

List all models accessible through the unified gateway endpoint.

Response:

[
{
"name": "llama-3-8b",
"deploymentName": "my-llama",
"provider": "kaito",
"ready": true
}
]

Error Responses

All endpoints return errors in this format:

{
"error": "Error message",
"code": "ERROR_CODE",
"details": {}
}

Common error codes:

  • CLUSTER_UNAVAILABLE - Cannot connect to Kubernetes
  • PROVIDER_NOT_INSTALLED - Active provider not installed
  • VALIDATION_ERROR - Invalid request body
  • NOT_FOUND - Resource not found