Gateway API Inference Extension Integration
Pinned versions: the
GAIE_VERSIONreferenced in this document is sourced from/versions.envat the repo root. Substitute that value (currentlyv1.5.0) when running the commands below, orsourcethe file in your shell:set -a; source versions.env; set +a.
Overview
AI Runway integrates with the Gateway API Inference Extension to provide a unified inference gateway. Instead of accessing each model's Service individually, you deploy a single Gateway and call all models through one endpoint using the standard OpenAI-compatible API. The Gateway routes requests to the correct model based on the model field in the request body.
When gateway integration is active, AI Runway automatically creates an InferencePool, Endpoint Picker (EPP), and an HTTPRoute for each ModelDeployment. You only need to provide the Gateway itself.
Architecture
┌───────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
┌────────┐ │ ┌─────────┐ ┌───────────┐ │
│ Client │────────▶│ │ Gateway │──────▶│ HTTPRoute │ │
│ (curl/ │ │ │ + BBR │ │ │ │
│ openai) │ │ └─────────┘ └─────┬─────┘ │
└────────┘ │ │ │
│ ▼ │
│ ┌───────────────┐ │
│ │ InferencePool │ │
│ │ (auto-created)│ │
│ └───────┬───────┘ │
│ │ │
│ ▼ │
│ ┌───────────────┐ │
│ │ EPP (Endpoint│ │
│ │ Picker Proxy)│ │
│ │ (auto-created)│ │
│ └───────┬───────┘ │
│ │ │
│ ▼ │
│ ┌───────────────┐ │
│ │ Model Server │ │
│ │ Pod (vLLM, │ │
│ │ sglang, etc.) │ │
│ └───────────────┘ │
└───────────────────────────────────────────────┘
Request flow: Client → Gateway (+BBR) → HTTPRoute → InferencePool → Endpoint Picker (EPP) → Model Server Pod
What AI Runway creates automatically (when gateway.enabled is true or omitted, and Gateway CRDs are detected):
InferencePool— selects pods labeled withairunway.ai/model-deployment: <name>on the model's serving portHTTPRoute— routes from the Gateway to the InferencePool (unlesshttpRouteRefis set)EPP— Endpoint Picker Proxy for intelligent endpoint selection
What you provide:
- A Gateway resource (with any compatible implementation)
Prerequisites
- Kubernetes cluster with Gateway API CRDs installed
- Gateway API Inference Extension CRDs installed (provides
InferencePool) - A compatible gateway implementation (see below)
Gateway Implementations
AI Runway works with any Gateway API implementation that supports the Inference Extension. You are responsible for installing and managing your own gateway. Some known implementations:
| Implementation | gatewayClassName | Status | Docs |
|---|---|---|---|
| Envoy Gateway | eg | Not tested | Inference Extension guide |
| Istio | istio | Tested | Inference Extension guide |
| kgateway | kgateway | Tested (still requires the X-Gateway-Model-Name header) | Inference Extension guide |
| GKE Gateway | gke-l7-rilb | Not tested | GKE Inference guide |
Note: The only difference between implementations is the
gatewayClassNamein your Gateway resource. All AIRunway-managed resources (InferencePool, HTTPRoute) are identical regardless of which gateway you use.
Setup
[!TIP] Istio shortcut:
make setup-gateway(from the repo root) performs the entire manual Istio setup below in one shot — it installs the Gateway API CRDs (Step 1), the Gateway API Inference Extension (GAIE) CRDs (Step 2), Istio with the inference extension enabled (Step 3), theinference-gatewayGateway resource (Step 4), and the Body-Based Router (see Body-Based Routing). TheGATEWAY_API_VERSION,ISTIO_VERSION, andGAIE_VERSIONit uses are pinned in/versions.env, andistioctlmust be on your PATH. For other gateway implementations, follow the manual steps below.
Step 1: Install Gateway API CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/latest/download/standard-install.yaml
Step 2: Install Gateway API Inference Extension CRDs
kubectl apply -f "https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/${GAIE_VERSION}/manifests.yaml"
Step 3: Install a Gateway Implementation
Follow the installation guide for your chosen implementation:
- Envoy Gateway: quickstart
- Istio: getting started
- kgateway: quickstart
[!NOTE] Istio: Inference Extension support must be explicitly enabled by setting
ENABLE_GATEWAY_API_INFERENCE_EXTENSION=trueon theistioddeployment (or passing--set values.pilot.env.ENABLE_GATEWAY_API_INFERENCE_EXTENSION=trueduringistioctl install). Without this, Istio ignores InferencePool backend refs in HTTPRoutes. Theminimalprofile is sufficient — Istio auto-creates a gateway deployment and LoadBalancer Service when you create a Gateway resource. See the Istio Inference Extension guide for full details.
Step 4: Create a Gateway Resource
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: inference-gateway
namespace: default
spec:
gatewayClassName: eg # Change to match your implementation
infrastructure:
annotations:
# Required on AKS with Istio. Azure otherwise probes GET / on port 80,
# but the gateway returns 404 there and the public IP can time out.
service.beta.kubernetes.io/port_80_health-probe_protocol: tcp
listeners:
- name: http
protocol: HTTP
port: 80
If you have multiple Gateways in the cluster, label the one to use for inference:
metadata:
labels:
airunway.ai/inference-gateway: "true"
[!NOTE] AKS with Istio: Keep the
spec.infrastructure.annotations.service.beta.kubernetes.io/port_80_health-probe_protocol: tcpsetting in your Gateway. Azure otherwise configures an HTTP health probe for/on port80, but Istio's generated gateway returns404on/. The result is a public IP that times out even though the gateway works throughkubectl port-forwardor from inside the cluster.
Step 5: Deploy Models
Deploy models as usual. AI Runway automatically creates the InferencePool, EPP, and HTTPRoute:
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
name: qwen3
namespace: default
spec:
model:
id: "Qwen/Qwen3-0.6B"
gateway:
enabled: true # Optional: enabled by default when Gateway is detected; set to false to explicitly disable
The ModelDeployment status will show gateway information once ready:
kubectl get modeldeployment qwen3 -o jsonpath='{.status.gateway}'
Configuration
Auto-detection
The controller auto-detects Gateway API Inference Extension CRDs at startup by querying the Kubernetes discovery API. If the CRDs (InferencePool, HTTPRoute, Gateway) are present, gateway integration is enabled. If not, it is silently disabled — no errors, no resources created.
Explicit Gateway Selection
If you have multiple Gateways or want deterministic behavior, use controller flags:
--gateway-name=inference-gateway
--gateway-namespace=default
When set, the controller always uses the specified Gateway as the HTTPRoute parent instead of auto-detecting.
Endpoint Picker (EPP) Configuration
The controller automatically deploys an EPP (Endpoint Picker Proxy) per ModelDeployment, named <deployment-name>-epp. The EPP handles intelligent request routing to model server pods.
--epp-service-port=9002 # EPP Service port (default: 9002)
--epp-image=<image> # EPP container image (default: upstream GAIE image)
--patch-gateway-allowed-routes=true # Patch Gateway allowedRoutes for cross-namespace routing (default: true)
Body-Based Routing (BBR)
When serving multiple models through a single Gateway, a Body-Based Router (BBR) is needed to extract the model field from the request body and route to the correct InferencePool. BBR is a separate component deployed via the upstream GAIE helm chart.
Install BBR using the upstream helm chart:
helm install body-based-router \
--set provider.name=istio \
--version "${GAIE_VERSION}" \
oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing
[!NOTE] The BBR chart version should match the GAIE version used by AI Runway. The pinned value lives in
/versions.env; update both at the same time when bumping.
Replace provider.name with your gateway implementation (istio, gke, or omit for others). The chart deploys the BBR container and any provider-specific resources (e.g. EnvoyFilter for Istio).
See the upstream multi-model guide for full details.
Known limitation — BBR restart on each new model. BBR builds its model registry only at startup and does not dynamically watch InferencePools, so the controller triggers a rolling restart of the shared BBR Deployment once per new
ModelDeployment(tracked by theairunway.ai/bbr-restartedannotation). The restart is not zero-downtime: while BBR is restarting, its registry is incomplete, so an in-flight request for an already-serving model can miss itsX-Gateway-Model-Nameheader and mis-route to another model's InferencePool. With disaggregated Dynamo serving this surfaces as aWorker ID required (--direct-route)500 on a concurrent aggregated request. This mainly affects deploying multiple models close together; once all models are settled, routing is correct and stable. A zero-downtime BBR reload (or a BBR that watches InferencePools) would remove the window. The GPU e2e suite leaves disaggregated serving out of its default matrix for this reason.
Auto-detection with Multiple Gateways
When no explicit gateway is configured and multiple Gateway resources exist in the cluster, the controller looks for one labeled with:
airunway.ai/inference-gateway: "true"
If no labeled Gateway is found, the controller skips gateway reconciliation and sets the GatewayReady condition to False.
Cross-namespace Gateway
When the Gateway is in a different namespace than the ModelDeployment, the controller automatically patches each Gateway listener to allow HTTPRoutes from the ModelDeployment's namespace using a namespace selector:
allowedRoutes:
namespaces:
from: Selector
selector:
matchLabels:
kubernetes.io/metadata.name: <modeldeployment-namespace>
This is required because Gateway API uses allowedRoutes on the listener to control cross-namespace route binding. Without it, the Gateway will reject HTTPRoutes from other namespaces.
Opting out of Gateway patching: In security-conscious environments where a Gateway admin manages allowedRoutes independently, start the controller with --patch-gateway-allowed-routes=false. The controller will skip patching the Gateway globally, and the admin is responsible for configuring the listener to accept HTTPRoutes from ModelDeployment namespaces.
[!NOTE] When
--patch-gateway-allowed-routes=falseis set and the Gateway does not allow routes from the ModelDeployment's namespace, the HTTPRoute will not be accepted by the Gateway and the model will not be reachable through the gateway endpoint.
Per-deployment Configuration
Each ModelDeployment can override gateway behavior:
spec:
gateway:
# Disable gateway integration for this specific deployment
enabled: false
# Override the model name used in routing (defaults to auto-discovered from /v1/models, or spec.model.id)
modelName: "my-custom-model-name"
| Field | Default | Description |
|---|---|---|
spec.gateway.enabled | true (when Gateway detected) | Set to false to skip InferencePool/HTTPRoute creation |
spec.gateway.modelName | Auto-discovered or spec.model.id | Model name used for routing and in API requests |
Provider-Managed Gateway Resources
Some inference providers (e.g., NVIDIA Dynamo, llm-d) have native Gateway API Inference Extension support with their own InferencePool and Endpoint Picker (EPP). These providers deploy specialized EPPs with capabilities beyond the generic upstream EPP — for example, Dynamo's EPP uses KV-cache-aware scoring to route requests to endpoints with the highest KV cache hit probability.
When a provider declares gateway capabilities in its InferenceProviderConfig, the controller adapts what it creates. Two extension points exist:
- Full delegation (
managesInferencePool: true): the provider owns both the InferencePool and the EPP. The controller skips creating either and only wires the HTTPRoute. Used by Dynamo. - EPP customization (
endpointPicker: { image, configData }): the controller still creates the InferencePool, EPP & scaffolding, but substitutes the provider's EPP image and plugin configuration. Used by llm-d.
endpointPicker is ignored when managesInferencePool: true — full delegation supersedes any EPP override.
How It Works
Providers declare gateway capabilities in their InferenceProviderConfig:
apiVersion: airunway.ai/v1alpha1
kind: InferenceProviderConfig
metadata:
name: dynamo
spec:
capabilities:
engines:
- name: vllm
gateway:
managesInferencePool: true # Provider creates and owns the InferencePool/EPP
inferencePoolNamePattern: "{name}-pool" # Pattern for the pool name
inferencePoolNamespace: "{namespace}" # Namespace where the pool is created
- name: sglang
gateway:
managesInferencePool: true
inferencePoolNamePattern: "{name}-pool"
inferencePoolNamespace: "{namespace}"
- name: trtllm
gateway:
managesInferencePool: true
inferencePoolNamePattern: "{name}-pool"
inferencePoolNamespace: "{namespace}"
The controller adapts its reconciliation based on these fields:
| Field | When set | When unset / absent |
|---|---|---|
managesInferencePool | When set to true, controller waits for the provider's InferencePool to exist, then uses it as the HTTPRoute backend. Skips reconcileInferencePool(), reconcileEPP(), and labelModelPods(). | Controller creates and owns the InferencePool and the EPP (default behavior). |
endpointPicker.image / endpointPicker.configData | Controller still creates the InferencePool and EPP Deployment/Service, but the EPP container uses the provider's image and the EPP ConfigMap carries configData as default-plugins.yaml. | Controller deploys the generic upstream GAIE EPP image with an empty plugin config. |
The HTTPRoute is always managed by the controller regardless of provider capabilities.
Cross-Namespace Routing
Provider-managed resources often live in a different namespace than the ModelDeployment (e.g., Dynamo pods and InferencePool are in dynamo-system). The controller handles this by:
- Setting the HTTPRoute backend ref with the provider pool's namespace
- Creating a
ReferenceGrantin the pool's namespace to allow cross-namespace HTTPRoute references
Single Gateway
├─ HTTPRoute "llama-70b" → Dynamo InferencePool (dynamo-system) → KV-aware EPP
├─ HTTPRoute "phi-4" → Controller InferencePool (default) → generic EPP → KAITO
└─ HTTPRoute "mistral" → Controller InferencePool (default) → generic EPP → KubeRay
Pool Name Resolution
The inferencePoolNamePattern supports {name} and {namespace} placeholders, substituted with the ModelDeployment's name and namespace:
| Pattern | ModelDeployment default/llama-70b | Resolved Pool Name |
|---|---|---|
{namespace}-{name}-pool | default/llama-70b | default-llama-70b-pool |
{name}-pool | default/llama-70b | llama-70b-pool |
| (empty) | default/llama-70b | llama-70b (fallback to MD name) |
Cleanup Behavior
When gateway resources are cleaned up (e.g., gateway.enabled: false):
- Controller-managed InferencePool and EPP resources are deleted normally
- Provider-managed InferencePool and EPP resources are not deleted — they are owned by the provider and cleaned up when the underlying provider CRD (e.g., DynamoGraphDeployment) is deleted
- The HTTPRoute is always deleted by the controller (it always owns the HTTPRoute)
Dynamo Provider Gateway Support
The Dynamo provider registers full gateway capabilities. When a ModelDeployment uses Dynamo with gateway enabled:
- The Dynamo operator creates a
DynamoGraphDeploymentwith anEppservice configured for KV-cache-aware scoring - The Dynamo operator creates an InferencePool pointing at its managed EPP
- The AIRunway controller detects the provider's gateway capabilities, waits for the InferencePool, creates the ReferenceGrant and HTTPRoute
- Requests are routed through Dynamo's intelligent EPP instead of the generic EPP since that EPP creation has been skipped.
llm-d Provider Gateway Support
The llm-d provider takes the EPP-customization path: the controller still owns the InferencePool and the EPP Deployment/Service, but uses llm-d's scheduler image and plugin chain. The provider declares only endpointPicker on the vLLM engine — managesInferencePool stays false:
apiVersion: airunway.ai/v1alpha1
kind: InferenceProviderConfig
metadata:
name: llmd
spec:
capabilities:
engines:
- name: vllm
gateway:
endpointPicker:
image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.6.0
configData: |
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: prefix-cache-scorer
- type: decode-filter
- type: max-score-picker
- type: single-profile-handler
schedulingProfiles:
- name: default
plugins:
- pluginRef: decode-filter
- pluginRef: max-score-picker
- pluginRef: prefix-cache-scorer
weight: 2
When a ModelDeployment uses llm-d with gateway enabled:
- The llm-d provider creates the model server Deployment + Service in the ModelDeployment's namespace
- The AIRunway controller creates the InferencePool, the EPP Deployment + Service (using the llm-d image), the EPP ConfigMap (containing
configDataasdefault-plugins.yaml), and the HTTPRoute - Requests are routed through the llm-d scheduler's plugin chain (prefix-cache-aware scoring, decode-filter, max-score-picker) instead of the generic EPP defaults
Model Name Resolution
The controller resolves the gateway model name using this priority:
spec.gateway.modelName— explicit override, always winsspec.model.servedName— user-specified served name- Auto-discovered from
/v1/models— the controller probes the running model server's OpenAI-compatible/v1/modelsendpoint and uses the first model ID returned. This handles baked-in images where the served name differs fromspec.model.id. spec.model.id— final fallback
Auto-discovery runs only when the deployment reaches Running phase. If the probe fails (timeout, error, no models), it silently falls through to the next level.
Using the Gateway
Finding the Gateway Endpoint
# Get the Gateway address
kubectl get gateway inference-gateway -o jsonpath='{.status.addresses[0].value}'
# Or check the ModelDeployment status
kubectl get modeldeployment qwen3 -o jsonpath='{.status.gateway.endpoint}'
Calling Models via curl
GATEWAY_IP=$(kubectl get gateway inference-gateway -o jsonpath='{.status.addresses[0].value}')
curl http://${GATEWAY_IP}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Calling Models via Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url=f"http://{GATEWAY_IP}/v1",
api_key="unused", # No auth by default
)
response = client.chat.completions.create(
model="Qwen/Qwen3-0.6B",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
Multiple Models, One Endpoint
The gateway routes to the correct model based on the model field in the request body. Deploy multiple models and call them all through the same endpoint:
# Call model A
curl http://${GATEWAY_IP}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hi"}]}'
# Call model B through the same endpoint
curl http://${GATEWAY_IP}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hi"}]}'
Troubleshooting
Gateway integration is not activating
Symptom: No InferencePool or HTTPRoute created for deployments.
-
Check that CRDs are installed:
kubectl api-resources | grep -E "inferencepools|httproutes|gateways" -
Check controller logs for detection messages:
kubectl logs -n airunway-system deploy/airunway-controller-manager | grep -i gateway -
If CRDs were installed after the controller started, restart the controller to refresh detection.
GatewayReady condition is False
Symptom: ModelDeployment has GatewayReady=False.
-
Check the condition message:
kubectl get modeldeployment <name> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="GatewayReady")' -
Common reasons:
- NoGateway — No Gateway resource found. Create one or set
--gateway-name/--gateway-namespace. - Multiple Gateways — Multiple Gateways exist but none is labeled
airunway.ai/inference-gateway=true. - InferencePoolFailed / HTTPRouteFailed — RBAC issue or CRD version mismatch.
- NoGateway — No Gateway resource found. Create one or set
Requests return 404 or connection refused
-
Verify the Gateway has an address:
kubectl get gateway inference-gateway -o jsonpath='{.status.addresses}' -
Verify the HTTPRoute is accepted:
kubectl get httproute <deployment-name> -o yaml -
Verify the InferencePool matches running pods:
kubectl get inferencepool <deployment-name> -o yamlkubectl get pods -l airunway.ai/model-deployment=<deployment-name> -
If the Gateway has a public IP on AKS but requests to that IP time out, make sure the Gateway sets:
spec:infrastructure:annotations:service.beta.kubernetes.io/port_80_health-probe_protocol: tcpAzure can otherwise probe
GET /on port80. Istio's gateway returns404there, so the load balancer marks the backend unhealthy even though requests succeed throughkubectl port-forward.