Skip to main content

Gateway API Inference Extension Integration

Pinned versions: the GAIE_VERSION referenced in this document is sourced from /versions.env at the repo root. Substitute that value (currently v1.5.0) when running the commands below, or source the file in your shell: set -a; source versions.env; set +a.

Overview

AI Runway integrates with the Gateway API Inference Extension to provide a unified inference gateway. Instead of accessing each model's Service individually, you deploy a single Gateway and call all models through one endpoint using the standard OpenAI-compatible API. The Gateway routes requests to the correct model based on the model field in the request body.

When gateway integration is active, AI Runway automatically creates an InferencePool, Endpoint Picker (EPP), and an HTTPRoute for each ModelDeployment. You only need to provide the Gateway itself.

Architecture

┌───────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
┌────────┐ │ ┌─────────┐ ┌───────────┐ │
│ Client │────────▶│ │ Gateway │──────▶│ HTTPRoute │ │
│ (curl/ │ │ │ + BBR │ │ │ │
│ openai) │ │ └─────────┘ └─────┬─────┘ │
└────────┘ │ │ │
│ ▼ │
│ ┌───────────────┐ │
│ │ InferencePool │ │
│ │ (auto-created)│ │
│ └───────┬───────┘ │
│ │ │
│ ▼ │
│ ┌───────────────┐ │
│ │ EPP (Endpoint│ │
│ │ Picker Proxy)│ │
│ │ (auto-created)│ │
│ └───────┬───────┘ │
│ │ │
│ ▼ │
│ ┌───────────────┐ │
│ │ Model Server │ │
│ │ Pod (vLLM, │ │
│ │ sglang, etc.) │ │
│ └───────────────┘ │
└───────────────────────────────────────────────┘

Request flow: Client → Gateway (+BBR) → HTTPRoute → InferencePool → Endpoint Picker (EPP) → Model Server Pod

What AI Runway creates automatically (when gateway.enabled is true or omitted, and Gateway CRDs are detected):

  • InferencePool — selects pods labeled with airunway.ai/model-deployment: <name> on the model's serving port
  • HTTPRoute — routes from the Gateway to the InferencePool (unless httpRouteRef is set)
  • EPP — Endpoint Picker Proxy for intelligent endpoint selection

What you provide:

  • A Gateway resource (with any compatible implementation)

Prerequisites

Gateway Implementations

AI Runway works with any Gateway API implementation that supports the Inference Extension. You are responsible for installing and managing your own gateway. Some known implementations:

ImplementationgatewayClassNameStatusDocs
Envoy GatewayegNot testedInference Extension guide
IstioistioTestedInference Extension guide
kgatewaykgatewayTested (still requires the X-Gateway-Model-Name header)Inference Extension guide
GKE Gatewaygke-l7-rilbNot testedGKE Inference guide

Note: The only difference between implementations is the gatewayClassName in your Gateway resource. All AIRunway-managed resources (InferencePool, HTTPRoute) are identical regardless of which gateway you use.

Setup

[!TIP] Istio shortcut: make setup-gateway (from the repo root) performs the entire manual Istio setup below in one shot — it installs the Gateway API CRDs (Step 1), the Gateway API Inference Extension (GAIE) CRDs (Step 2), Istio with the inference extension enabled (Step 3), the inference-gateway Gateway resource (Step 4), and the Body-Based Router (see Body-Based Routing). The GATEWAY_API_VERSION, ISTIO_VERSION, and GAIE_VERSION it uses are pinned in /versions.env, and istioctl must be on your PATH. For other gateway implementations, follow the manual steps below.

Step 1: Install Gateway API CRDs

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/latest/download/standard-install.yaml

Step 2: Install Gateway API Inference Extension CRDs

kubectl apply -f "https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/${GAIE_VERSION}/manifests.yaml"

Step 3: Install a Gateway Implementation

Follow the installation guide for your chosen implementation:

[!NOTE] Istio: Inference Extension support must be explicitly enabled by setting ENABLE_GATEWAY_API_INFERENCE_EXTENSION=true on the istiod deployment (or passing --set values.pilot.env.ENABLE_GATEWAY_API_INFERENCE_EXTENSION=true during istioctl install). Without this, Istio ignores InferencePool backend refs in HTTPRoutes. The minimal profile is sufficient — Istio auto-creates a gateway deployment and LoadBalancer Service when you create a Gateway resource. See the Istio Inference Extension guide for full details.

Step 4: Create a Gateway Resource

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: inference-gateway
namespace: default
spec:
gatewayClassName: eg # Change to match your implementation
infrastructure:
annotations:
# Required on AKS with Istio. Azure otherwise probes GET / on port 80,
# but the gateway returns 404 there and the public IP can time out.
service.beta.kubernetes.io/port_80_health-probe_protocol: tcp
listeners:
- name: http
protocol: HTTP
port: 80

If you have multiple Gateways in the cluster, label the one to use for inference:

metadata:
labels:
airunway.ai/inference-gateway: "true"

[!NOTE] AKS with Istio: Keep the spec.infrastructure.annotations.service.beta.kubernetes.io/port_80_health-probe_protocol: tcp setting in your Gateway. Azure otherwise configures an HTTP health probe for / on port 80, but Istio's generated gateway returns 404 on /. The result is a public IP that times out even though the gateway works through kubectl port-forward or from inside the cluster.

Step 5: Deploy Models

Deploy models as usual. AI Runway automatically creates the InferencePool, EPP, and HTTPRoute:

apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
name: qwen3
namespace: default
spec:
model:
id: "Qwen/Qwen3-0.6B"
gateway:
enabled: true # Optional: enabled by default when Gateway is detected; set to false to explicitly disable

The ModelDeployment status will show gateway information once ready:

kubectl get modeldeployment qwen3 -o jsonpath='{.status.gateway}'

Configuration

Auto-detection

The controller auto-detects Gateway API Inference Extension CRDs at startup by querying the Kubernetes discovery API. If the CRDs (InferencePool, HTTPRoute, Gateway) are present, gateway integration is enabled. If not, it is silently disabled — no errors, no resources created.

Explicit Gateway Selection

If you have multiple Gateways or want deterministic behavior, use controller flags:

--gateway-name=inference-gateway
--gateway-namespace=default

When set, the controller always uses the specified Gateway as the HTTPRoute parent instead of auto-detecting.

Endpoint Picker (EPP) Configuration

The controller automatically deploys an EPP (Endpoint Picker Proxy) per ModelDeployment, named <deployment-name>-epp. The EPP handles intelligent request routing to model server pods.

--epp-service-port=9002 # EPP Service port (default: 9002)
--epp-image=<image> # EPP container image (default: upstream GAIE image)
--patch-gateway-allowed-routes=true # Patch Gateway allowedRoutes for cross-namespace routing (default: true)

Body-Based Routing (BBR)

When serving multiple models through a single Gateway, a Body-Based Router (BBR) is needed to extract the model field from the request body and route to the correct InferencePool. BBR is a separate component deployed via the upstream GAIE helm chart.

Install BBR using the upstream helm chart:

helm install body-based-router \
--set provider.name=istio \
--version "${GAIE_VERSION}" \
oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing

[!NOTE] The BBR chart version should match the GAIE version used by AI Runway. The pinned value lives in /versions.env; update both at the same time when bumping.

Replace provider.name with your gateway implementation (istio, gke, or omit for others). The chart deploys the BBR container and any provider-specific resources (e.g. EnvoyFilter for Istio).

See the upstream multi-model guide for full details.

Known limitation — BBR restart on each new model. BBR builds its model registry only at startup and does not dynamically watch InferencePools, so the controller triggers a rolling restart of the shared BBR Deployment once per new ModelDeployment (tracked by the airunway.ai/bbr-restarted annotation). The restart is not zero-downtime: while BBR is restarting, its registry is incomplete, so an in-flight request for an already-serving model can miss its X-Gateway-Model-Name header and mis-route to another model's InferencePool. With disaggregated Dynamo serving this surfaces as a Worker ID required (--direct-route) 500 on a concurrent aggregated request. This mainly affects deploying multiple models close together; once all models are settled, routing is correct and stable. A zero-downtime BBR reload (or a BBR that watches InferencePools) would remove the window. The GPU e2e suite leaves disaggregated serving out of its default matrix for this reason.

Auto-detection with Multiple Gateways

When no explicit gateway is configured and multiple Gateway resources exist in the cluster, the controller looks for one labeled with:

airunway.ai/inference-gateway: "true"

If no labeled Gateway is found, the controller skips gateway reconciliation and sets the GatewayReady condition to False.

Cross-namespace Gateway

When the Gateway is in a different namespace than the ModelDeployment, the controller automatically patches each Gateway listener to allow HTTPRoutes from the ModelDeployment's namespace using a namespace selector:

allowedRoutes:
namespaces:
from: Selector
selector:
matchLabels:
kubernetes.io/metadata.name: <modeldeployment-namespace>

This is required because Gateway API uses allowedRoutes on the listener to control cross-namespace route binding. Without it, the Gateway will reject HTTPRoutes from other namespaces.

Opting out of Gateway patching: In security-conscious environments where a Gateway admin manages allowedRoutes independently, start the controller with --patch-gateway-allowed-routes=false. The controller will skip patching the Gateway globally, and the admin is responsible for configuring the listener to accept HTTPRoutes from ModelDeployment namespaces.

[!NOTE] When --patch-gateway-allowed-routes=false is set and the Gateway does not allow routes from the ModelDeployment's namespace, the HTTPRoute will not be accepted by the Gateway and the model will not be reachable through the gateway endpoint.

Per-deployment Configuration

Each ModelDeployment can override gateway behavior:

spec:
gateway:
# Disable gateway integration for this specific deployment
enabled: false
# Override the model name used in routing (defaults to auto-discovered from /v1/models, or spec.model.id)
modelName: "my-custom-model-name"
FieldDefaultDescription
spec.gateway.enabledtrue (when Gateway detected)Set to false to skip InferencePool/HTTPRoute creation
spec.gateway.modelNameAuto-discovered or spec.model.idModel name used for routing and in API requests

Provider-Managed Gateway Resources

Some inference providers (e.g., NVIDIA Dynamo, llm-d) have native Gateway API Inference Extension support with their own InferencePool and Endpoint Picker (EPP). These providers deploy specialized EPPs with capabilities beyond the generic upstream EPP — for example, Dynamo's EPP uses KV-cache-aware scoring to route requests to endpoints with the highest KV cache hit probability.

When a provider declares gateway capabilities in its InferenceProviderConfig, the controller adapts what it creates. Two extension points exist:

  1. Full delegation (managesInferencePool: true): the provider owns both the InferencePool and the EPP. The controller skips creating either and only wires the HTTPRoute. Used by Dynamo.
  2. EPP customization (endpointPicker: { image, configData }): the controller still creates the InferencePool, EPP & scaffolding, but substitutes the provider's EPP image and plugin configuration. Used by llm-d.

endpointPicker is ignored when managesInferencePool: true — full delegation supersedes any EPP override.

How It Works

Providers declare gateway capabilities in their InferenceProviderConfig:

apiVersion: airunway.ai/v1alpha1
kind: InferenceProviderConfig
metadata:
name: dynamo
spec:
capabilities:
engines:
- name: vllm
gateway:
managesInferencePool: true # Provider creates and owns the InferencePool/EPP
inferencePoolNamePattern: "{name}-pool" # Pattern for the pool name
inferencePoolNamespace: "{namespace}" # Namespace where the pool is created
- name: sglang
gateway:
managesInferencePool: true
inferencePoolNamePattern: "{name}-pool"
inferencePoolNamespace: "{namespace}"
- name: trtllm
gateway:
managesInferencePool: true
inferencePoolNamePattern: "{name}-pool"
inferencePoolNamespace: "{namespace}"

The controller adapts its reconciliation based on these fields:

FieldWhen setWhen unset / absent
managesInferencePoolWhen set to true, controller waits for the provider's InferencePool to exist, then uses it as the HTTPRoute backend. Skips reconcileInferencePool(), reconcileEPP(), and labelModelPods().Controller creates and owns the InferencePool and the EPP (default behavior).
endpointPicker.image / endpointPicker.configDataController still creates the InferencePool and EPP Deployment/Service, but the EPP container uses the provider's image and the EPP ConfigMap carries configData as default-plugins.yaml.Controller deploys the generic upstream GAIE EPP image with an empty plugin config.

The HTTPRoute is always managed by the controller regardless of provider capabilities.

Cross-Namespace Routing

Provider-managed resources often live in a different namespace than the ModelDeployment (e.g., Dynamo pods and InferencePool are in dynamo-system). The controller handles this by:

  1. Setting the HTTPRoute backend ref with the provider pool's namespace
  2. Creating a ReferenceGrant in the pool's namespace to allow cross-namespace HTTPRoute references
Single Gateway
├─ HTTPRoute "llama-70b" → Dynamo InferencePool (dynamo-system) → KV-aware EPP
├─ HTTPRoute "phi-4" → Controller InferencePool (default) → generic EPP → KAITO
└─ HTTPRoute "mistral" → Controller InferencePool (default) → generic EPP → KubeRay

Pool Name Resolution

The inferencePoolNamePattern supports {name} and {namespace} placeholders, substituted with the ModelDeployment's name and namespace:

PatternModelDeployment default/llama-70bResolved Pool Name
{namespace}-{name}-pooldefault/llama-70bdefault-llama-70b-pool
{name}-pooldefault/llama-70bllama-70b-pool
(empty)default/llama-70bllama-70b (fallback to MD name)

Cleanup Behavior

When gateway resources are cleaned up (e.g., gateway.enabled: false):

  • Controller-managed InferencePool and EPP resources are deleted normally
  • Provider-managed InferencePool and EPP resources are not deleted — they are owned by the provider and cleaned up when the underlying provider CRD (e.g., DynamoGraphDeployment) is deleted
  • The HTTPRoute is always deleted by the controller (it always owns the HTTPRoute)

Dynamo Provider Gateway Support

The Dynamo provider registers full gateway capabilities. When a ModelDeployment uses Dynamo with gateway enabled:

  1. The Dynamo operator creates a DynamoGraphDeployment with an Epp service configured for KV-cache-aware scoring
  2. The Dynamo operator creates an InferencePool pointing at its managed EPP
  3. The AIRunway controller detects the provider's gateway capabilities, waits for the InferencePool, creates the ReferenceGrant and HTTPRoute
  4. Requests are routed through Dynamo's intelligent EPP instead of the generic EPP since that EPP creation has been skipped.

llm-d Provider Gateway Support

The llm-d provider takes the EPP-customization path: the controller still owns the InferencePool and the EPP Deployment/Service, but uses llm-d's scheduler image and plugin chain. The provider declares only endpointPicker on the vLLM engine — managesInferencePool stays false:

apiVersion: airunway.ai/v1alpha1
kind: InferenceProviderConfig
metadata:
name: llmd
spec:
capabilities:
engines:
- name: vllm
gateway:
endpointPicker:
image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.6.0
configData: |
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: prefix-cache-scorer
- type: decode-filter
- type: max-score-picker
- type: single-profile-handler
schedulingProfiles:
- name: default
plugins:
- pluginRef: decode-filter
- pluginRef: max-score-picker
- pluginRef: prefix-cache-scorer
weight: 2

When a ModelDeployment uses llm-d with gateway enabled:

  1. The llm-d provider creates the model server Deployment + Service in the ModelDeployment's namespace
  2. The AIRunway controller creates the InferencePool, the EPP Deployment + Service (using the llm-d image), the EPP ConfigMap (containing configData as default-plugins.yaml), and the HTTPRoute
  3. Requests are routed through the llm-d scheduler's plugin chain (prefix-cache-aware scoring, decode-filter, max-score-picker) instead of the generic EPP defaults

Model Name Resolution

The controller resolves the gateway model name using this priority:

  1. spec.gateway.modelName — explicit override, always wins
  2. spec.model.servedName — user-specified served name
  3. Auto-discovered from /v1/models — the controller probes the running model server's OpenAI-compatible /v1/models endpoint and uses the first model ID returned. This handles baked-in images where the served name differs from spec.model.id.
  4. spec.model.id — final fallback

Auto-discovery runs only when the deployment reaches Running phase. If the probe fails (timeout, error, no models), it silently falls through to the next level.

Using the Gateway

Finding the Gateway Endpoint

# Get the Gateway address
kubectl get gateway inference-gateway -o jsonpath='{.status.addresses[0].value}'

# Or check the ModelDeployment status
kubectl get modeldeployment qwen3 -o jsonpath='{.status.gateway.endpoint}'

Calling Models via curl

GATEWAY_IP=$(kubectl get gateway inference-gateway -o jsonpath='{.status.addresses[0].value}')

curl http://${GATEWAY_IP}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}]
}'

Calling Models via Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
base_url=f"http://{GATEWAY_IP}/v1",
api_key="unused", # No auth by default
)

response = client.chat.completions.create(
model="Qwen/Qwen3-0.6B",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Multiple Models, One Endpoint

The gateway routes to the correct model based on the model field in the request body. Deploy multiple models and call them all through the same endpoint:

# Call model A
curl http://${GATEWAY_IP}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hi"}]}'

# Call model B through the same endpoint
curl http://${GATEWAY_IP}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hi"}]}'

Troubleshooting

Gateway integration is not activating

Symptom: No InferencePool or HTTPRoute created for deployments.

  1. Check that CRDs are installed:

    kubectl api-resources | grep -E "inferencepools|httproutes|gateways"
  2. Check controller logs for detection messages:

    kubectl logs -n airunway-system deploy/airunway-controller-manager | grep -i gateway
  3. If CRDs were installed after the controller started, restart the controller to refresh detection.

GatewayReady condition is False

Symptom: ModelDeployment has GatewayReady=False.

  1. Check the condition message:

    kubectl get modeldeployment <name> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="GatewayReady")'
  2. Common reasons:

    • NoGateway — No Gateway resource found. Create one or set --gateway-name/--gateway-namespace.
    • Multiple Gateways — Multiple Gateways exist but none is labeled airunway.ai/inference-gateway=true.
    • InferencePoolFailed / HTTPRouteFailed — RBAC issue or CRD version mismatch.

Requests return 404 or connection refused

  1. Verify the Gateway has an address:

    kubectl get gateway inference-gateway -o jsonpath='{.status.addresses}'
  2. Verify the HTTPRoute is accepted:

    kubectl get httproute <deployment-name> -o yaml
  3. Verify the InferencePool matches running pods:

    kubectl get inferencepool <deployment-name> -o yaml
    kubectl get pods -l airunway.ai/model-deployment=<deployment-name>
  4. If the Gateway has a public IP on AKS but requests to that IP time out, make sure the Gateway sets:

    spec:
    infrastructure:
    annotations:
    service.beta.kubernetes.io/port_80_health-probe_protocol: tcp

    Azure can otherwise probe GET / on port 80. Istio's gateway returns 404 there, so the load balancer marks the backend unhealthy even though requests succeed through kubectl port-forward.