Monitoring
Inference service exposes Prometheus metrics for monitoring system stats.
vLLM runtime
vLLM exposes Prometheus metrics at the /metrics endpoint. These metrics provide detailed insights into the system's performance, resource utilization, and request processing statistics. The following table lists all available metrics:
vLLM version: 0.8.2
note
vLLM V1 engine is now enabled by default. Check here for details.
| Category | Metric Name | Type | Description |
|---|---|---|---|
| System Stats - Scheduler | vllm:num_requests_running | Gauge | Number of requests currently running on GPU |
| System Stats - Scheduler | vllm:num_requests_waiting | Gauge | Number of requests waiting to be processed |
| System Stats - Scheduler | vllm:lora_requests_info | Gauge | Running stats on lora requests |
| System Stats - Scheduler | vllm:num_requests_swapped | Gauge | Number of requests swapped to CPU (DEPRECATED: KV cache offloading is not used in V1) |
| System Stats - Cache | vllm:gpu_cache_usage_perc | Gauge | GPU KV-cache usage. 1 means 100 percent usage |
| System Stats - Cache | vllm:cpu_cache_usage_perc | Gauge | CPU KV-cache usage. 1 means 100 percent usage (DEPRECATED: KV cache offloading is not used in V1) |
| System Stats - Cache | vllm:cpu_prefix_cache_hit_rate | Gauge | CPU prefix cache block hit rate (DEPRECATED: KV cache offloading is not used in V1) |
| System Stats - Cache | vllm:gpu_prefix_cache_hit_rate | Gauge | GPU prefix cache block hit rate (DEPRECATED: use vllm:gpu_prefix_cache_queries and vllm:gpu_prefix_cache_hits in V1) |
| System Stats - Cache | vllm:gpu_cache_usage_perc | Counter | GPU prefix cache queries, in terms of number of queried blocks (V1 only) |
| System Stats - Cache | vllm:gpu_prefix_cache_hits | Counter | GPU prefix cache hits, in terms of number of cached blocks (V1 only) |
| Iteration Stats | vllm:num_preemptions_total | Counter | Cumulative number of preemption from the engine |
| Iteration Stats | vllm:prompt_tokens_total | Counter | Number of prefill tokens processed |
| Iteration Stats | vllm:generation_tokens_total | Counter | Number of generation tokens processed |
| Iteration Stats | vllm:iteration_tokens_total | Histogram | Number of tokens per engine_step |
| Iteration Stats | vllm:time_to_first_token_seconds | Histogram | Time to first token in seconds |
| Iteration Stats | vllm:time_per_output_token_seconds | Histogram | Time per output token in seconds |
| Request Stats | vllm:e2e_request_latency_seconds | Histogram | End to end request latency in seconds |
| Request Stats | vllm:request_queue_time_seconds | Histogram | Time spent in WAITING phase for request |
| Request Stats | vllm:request_inference_time_seconds | Histogram | Time spent in RUNNING phase for request |
| Request Stats | vllm:request_prefill_time_seconds | Histogram | Time spent in PREFILL phase for request |
| Request Stats | vllm:request_decode_time_seconds | Histogram | Time spent in DECODE phase for request |
| Request Stats | vllm:request_prompt_tokens | Histogram | Number of prefill tokens processed per request |
| Request Stats | vllm:request_generation_tokens | Histogram | Number of generation tokens processed per request |
| Request Stats | vllm:request_max_num_generation_tokens | Histogram | Maximum number of requested generation tokens |
| Request Stats | vllm:request_params_n | Histogram | The 'n' request parameter |
| Request Stats | vllm:request_params_max_tokens | Histogram | The 'max_tokens' request parameter |
| Request Stats | vllm:request_success_total | Counter | Count of successfully processed requests |
| Speculative Decoding | vllm:spec_decode_draft_acceptance_rate | Gauge | Speculative token acceptance rate (DEPRECATED: Unused in V1) |
| Speculative Decoding | vllm:spec_decode_efficiency | Gauge | Speculative decoding system efficiency (DEPRECATED: Unused in V1) |
| Speculative Decoding | vllm:spec_decode_num_accepted_tokens_total | Counter | Number of accepted tokens |
| Speculative Decoding | vllm:spec_decode_num_draft_tokens_total | Counter | Number of draft tokens |
| Speculative Decoding | vllm:spec_decode_num_emitted_tokens_total | Counter | Number of emitted tokens (DEPRECATED: Unused in V1) |