Observability

AI Runway exposes Prometheus metrics from both the controller and inference provider pods, with a pre-built Grafana dashboard for visualization.

What's tracked

Controller metrics - reconciliation duration, errors, provider selection events
Deployment state - aggregate deployment phase counts by provider/phase, replica counts
Platform engineering indicators - phase transitions (deployment frequency, change failure rate), lead time (creation → ready), infrastructure provisioning duration
Inference engine metrics - vLLM request queues, time-to-first-token, KV-cache utilization, token throughput (via provider PodMonitors)

[!NOTE]

The controller emits both classic histogram buckets and native histograms for improved compatibility. When using Prometheus with native histogram support enabled (--enable-feature=native-histograms), users may need to configure their ServiceMonitor with scrapeClassicHistograms: true.

Getting started

See the Observability Demo for a complete walkthrough covering:

Installing kube-prometheus-stack
Configuring Prometheus to scrape the controller (ServiceMonitor + RBAC)
Setting up PodMonitors for each inference provider (KAITO, Dynamo, KubeRay, llm-d)
Importing the Grafana dashboard

Kubernetes Events

The controller also emits Kubernetes events for key lifecycle moments:

Events:
  Type    Reason              Message
  ----    ------              -------
  Normal  ProviderSelected    Selected provider 'kaito': matched capabilities
  Normal  ResourceCreated     Created Workspace 'my-model'
  Warning ProviderError       Provider resource in error state: insufficient GPUs
  Warning DriftDetected       Provider resource was modified directly, reconciling

What's tracked​

Getting started​

Kubernetes Events​

See also​

What's tracked

Getting started

Kubernetes Events

See also