Version: v0.8.x

Multi-Node Inference

This document explains how to configure and use multi-node distributed inference in KAITO for large models that require more GPU resources than a single node can provide.

Overview

Multi-node inference allows you to deploy large AI models across multiple nodes (servers) when the model is too large to fit on a single node. KAITO supports different parallelism strategies depending on your model's requirements:

Strategy	Use Case	Supported
Single GPU	Small models that fit on one GPU	✅
Single-Node Multi-GPU	Models that need multiple GPUs but fit on one node	✅
Multi-Node Multi-GPU	Very large models (400B+ parameters) requiring multiple nodes	✅

When to Use Multi-Node Inference

Consider multi-node inference when:

Your model has 400B+ parameters and cannot fit on a single node
You need to serve models like very large language models that exceed single-node memory capacity
You have specific performance requirements that benefit from distributed processing

note

Multi-node inference introduces additional complexity and network overhead. Only use it when your model truly requires more resources than a single node can provide.

Supported Models

The following preset models support multi-node distributed inference:

Llama3: llama-3.3-70b-instruct
DeepSeek: deepseek-r1-0528, deepseek-v3-0324

Check the presets documentation for the complete list and their specific requirements.

Configuration

Basic Multi-Node Setup

To deploy a model across multiple nodes, specify the node count in your Workspace configuration:

apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
  name: workspace-large-model
resource:
  count: 2                    # Number of nodes to use
  instanceType: "Standard_NC80adis_H100_v5"
  labelSelector:
    matchLabels:
      apps: large-model
inference:
  preset:
    name: "llama-3.3-70b-instruct"

Pre-Provisioned Nodes

If you're using pre-provisioned GPU nodes, specify them explicitly:

apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
  name: workspace-large-model
resource:
  count: 2
  preferredNodes:
    - gpu-node-1
    - gpu-node-2
  labelSelector:
    matchLabels:
      apps: large-model
inference:
  preset:
    name: "llama-3.3-70b-instruct"

warning

Pre-provisioned nodes must have the same matching labels as specified in the resource spec, and each node must report available GPU resources.

Custom vLLM Parameters for Multi-Node

You can customize vLLM runtime parameters for distributed inference using a ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: distributed-inference-config
data:
  inference_config.yaml: |
    vllm:
      gpu-memory-utilization: 0.95
      max-model-len: 131072
---
apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
  name: workspace-large-model
resource:
  count: 2
  instanceType: "Standard_NC80adis_H100_v5"
  labelSelector:
    matchLabels:
      apps: large-model
inference:
  preset:
    name: "llama-3.3-70b-instruct"
  config: "distributed-inference-config"

Key parameters for multi-node inference:

tensor-parallel-size: Automatically set by KAITO based on the number of GPUs per node
pipeline-parallel-size: Automatically set by KAITO based on the number of nodes
gpu-memory-utilization: Fraction of GPU memory to use (0.0-1.0) - user configurable
max-model-len: Maximum sequence length - user configurable

note

The tensor-parallel-size and pipeline-parallel-size parameters are automatically managed by KAITO based on your cluster configuration and do not need to be specified in the ConfigMap.

Architecture

Single-Node Multi-GPU

When using multiple GPUs on a single node, KAITO uses tensor parallelism to split the model across GPUs within that node:

┌─────────────────────────────────────┐
│             Node 1                  │
│  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐    │
│  │GPU 1│ │GPU 2│ │GPU 3│ │GPU 4│    │
│  └─────┘ └─────┘ └─────┘ └─────┘    │
│           Model Split               │
└─────────────────────────────────────┘

Multi-Node Multi-GPU

For multi-node deployments, KAITO combines pipeline parallelism between nodes and tensor parallelism within each node:

┌─────────────────────┐    ┌─────────────────────┐
│       Node 1        │    │       Node 2        │
│  ┌─────┐┌─────┐     │    │  ┌─────┐ ┌─────┐    │
│  │GPU 1││GPU 2│     │◄──►│  │GPU 3│ │GPU 4│    │
│  └─────┘└─────┘     │    │  └─────┘ └─────┘    │
│   Layers 1-N/2      │    │   Layers N/2+1-N    │
└─────────────────────┘    └─────────────────────┘

Resource Validation

KAITO automatically validates that your configuration provides sufficient resources:

GPU Count: (GPUs per instance) × (workspace.resource.count) ≥ (Required GPUs for model)
Memory: (GPU memory) × (Total GPUs) ≥ (Required model memory)

If validation fails, you'll receive an error message when creating or updating the workspace.

Resource Optimization

KAITO may use fewer nodes than specified in workspace.resource.count if the model can fit efficiently on fewer nodes. This optimizes GPU utilization and reduces network overhead, but be mindful of the costs when provisioning many nodes.

Service Architecture

Multi-node inference uses Kubernetes StatefulSets to ensure stable pod identity and coordination:

Leader Pod (index 0): Coordinates the distributed inference and serves the API
Worker Pods (index 1+): Join the Ray cluster and participate in model serving

The service endpoint points to the leader pod, which handles all incoming requests and coordinates with worker pods.

Future Enhancement

KAITO will support LeaderWorkerSet in the future to provide better management of leader-worker topologies and improved fault tolerance for multi-node deployments.

Health Monitoring

Multi-node deployments use specialized health checks:

Liveness Probes: Monitor Ray cluster health and detect dead actors
Readiness Probes: Check service availability via the leader pod's /health endpoint

If worker pods fail, the leader will restart to reinitialize the entire cluster, ensuring all pods are synchronized.

Best Practices

Resource Planning: Carefully plan your GPU and memory requirements before deployment
Network Bandwidth: Ensure sufficient network bandwidth between nodes for optimal performance
Monitoring: Monitor both individual node health and overall cluster performance
Cost Management: Be aware that multi-node deployments can be expensive; only use when necessary

Troubleshooting

Service Unavailable After Deployment

If the service becomes unavailable:

Check if all pods are running: kubectl get pods -l app=<your-app-label>
Verify Ray cluster health in leader pod logs
Ensure network connectivity between nodes
Check resource allocation and GPU availability

Worker Pod Failures

Worker pod failures will trigger leader pod restart to reinitialize the cluster:

Monitor pod restart events
Check for resource constraints (memory, GPU)
Verify node-to-node network connectivity
Review pod logs for Ray cluster connection issues

Performance Issues

If you experience poor performance:

Monitor network latency between nodes
Check GPU utilization across all nodes
Review memory usage and potential bottlenecks
Consider adjusting parallelism parameters

API Usage

Multi-node inference services expose the same API as single-node deployments. The vLLM runtime provides OpenAI-compatible endpoints:

# Get the cluster IP of your service
kubectl get services

# Check service health
kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- \
  curl http://<CLUSTER-IP>:80/health

# Generate text using chat completions API
kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- \
  curl -X POST http://<CLUSTER-IP>:80/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.3-70b-instruct",
    "messages": [{"role": "user", "content": "Your prompt here"}],
    "max_tokens": 100
  }'

For detailed API specifications, see the inference documentation.

Limitations

Custom Models: Multi-node inference is currently only supported for preset models
Fault Tolerance: The system requires leader restart when worker pods fail
Network Dependency: Performance heavily depends on inter-node network quality
Complexity: Debugging and monitoring are more complex than single-node deployments

Inference - General inference documentation
Presets - Supported models and their requirements
Custom Model - Using custom models (single-node only)

Overview​

When to Use Multi-Node Inference​

Supported Models​

Configuration​

Basic Multi-Node Setup​

Pre-Provisioned Nodes​

Custom vLLM Parameters for Multi-Node​

Architecture​

Single-Node Multi-GPU​

Multi-Node Multi-GPU​

Resource Validation​

Service Architecture​

Health Monitoring​

Best Practices​

Troubleshooting​

Service Unavailable After Deployment​

Worker Pod Failures​

Performance Issues​

API Usage​

Limitations​

Related Documentation​