Multi-Node Inference
This document explains how to configure and use multi-node distributed inference in KAITO for large models that require more GPU resources than a single node can provide.
Overview
Multi-node inference allows you to deploy large AI models across multiple nodes (servers) when the model is too large to fit on a single node. KAITO supports different parallelism strategies depending on your model's requirements:
Strategy | Use Case | Supported |
---|---|---|
Single GPU | Small models that fit on one GPU | ✅ |
Single-Node Multi-GPU | Models that need multiple GPUs but fit on one node | ✅ |
Multi-Node Multi-GPU | Very large models (400B+ parameters) requiring multiple nodes | ✅ |
When to Use Multi-Node Inference
Consider multi-node inference when:
- Your model has 400B+ parameters and cannot fit on a single node
- You need to serve models like very large language models that exceed single-node memory capacity
- You have specific performance requirements that benefit from distributed processing
Multi-node inference introduces additional complexity and network overhead. Only use it when your model truly requires more resources than a single node can provide.
Supported Models
The following preset models support multi-node distributed inference:
- Llama3:
llama-3.3-70b-instruct
Check the presets documentation for the complete list and their specific requirements.
Configuration
Basic Multi-Node Setup
To deploy a model across multiple nodes, specify the node count in your Workspace configuration:
apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
name: workspace-large-model
resource:
count: 2 # Number of nodes to use
instanceType: "Standard_NC80adis_H100_v5"
labelSelector:
matchLabels:
apps: large-model
inference:
preset:
name: "llama-3.3-70b-instruct"
Pre-Provisioned Nodes
If you're using pre-provisioned GPU nodes, specify them explicitly:
apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
name: workspace-large-model
resource:
count: 2
preferredNodes:
- gpu-node-1
- gpu-node-2
labelSelector:
matchLabels:
apps: large-model
inference:
preset:
name: "llama-3.3-70b-instruct"
Pre-provisioned nodes must have the same matching labels as specified in the resource
spec, and each node must report available GPU resources.
Custom vLLM Parameters for Multi-Node
You can customize vLLM runtime parameters for distributed inference using a ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: distributed-inference-config
data:
inference_config.yaml: |
vllm:
gpu-memory-utilization: 0.95
max-model-len: 131072
---
apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
name: workspace-large-model
resource:
count: 2
instanceType: "Standard_NC80adis_H100_v5"
labelSelector:
matchLabels:
apps: large-model
inference:
preset:
name: "llama-3.3-70b-instruct"
config: "distributed-inference-config"
Key parameters for multi-node inference:
tensor-parallel-size
: Automatically set by KAITO based on the number of GPUs per nodepipeline-parallel-size
: Automatically set by KAITO based on the number of nodesgpu-memory-utilization
: Fraction of GPU memory to use (0.0-1.0) - user configurablemax-model-len
: Maximum sequence length - user configurable
The tensor-parallel-size
and pipeline-parallel-size
parameters are automatically managed by KAITO based on your cluster configuration and do not need to be specified in the ConfigMap.
Architecture
Single-Node Multi-GPU
When using multiple GPUs on a single node, KAITO uses tensor parallelism to split the model across GPUs within that node:
┌─────────────────────────────────────┐
│ Node 1 │
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │GPU 1│ │GPU 2│ │GPU 3│ │GPU 4│ │
│ └─────┘ └─────┘ └─────┘ └─────┘ │
│ Model Split │
└─────────────────────────────────────┘
Multi-Node Multi-GPU
For multi-node deployments, KAITO combines pipeline parallelism between nodes and tensor parallelism within each node:
┌─────────────────────┐ ┌─────────────────────┐
│ Node 1 │ │ Node 2 │
│ ┌─────┐┌─────┐ │ │ ┌─────┐ ┌─────┐ │
│ │GPU 1││GPU 2│ │◄──►│ │GPU 3│ │GPU 4│ │
│ └─────┘└─────┘ │ │ └─────┘ └─────┘ │
│ Layers 1-N/2 │ │ Layers N/2+1-N │
└─────────────────────┘ └─────────────────────┘
Resource Validation
KAITO automatically validates that your configuration provides sufficient resources:
- GPU Count:
(GPUs per instance) × (workspace.resource.count) ≥ (Required GPUs for model)
- Memory:
(GPU memory) × (Total GPUs) ≥ (Required model memory)
If validation fails, you'll receive an error message when creating or updating the workspace.
KAITO may use fewer nodes than specified in workspace.resource.count
if the model can fit efficiently on fewer nodes. This optimizes GPU utilization and reduces network overhead, but be mindful of the costs when provisioning many nodes.
Service Architecture
Multi-node inference uses Kubernetes StatefulSets to ensure stable pod identity and coordination:
- Leader Pod (index 0): Coordinates the distributed inference and serves the API
- Worker Pods (index 1+): Join the Ray cluster and participate in model serving
The service endpoint points to the leader pod, which handles all incoming requests and coordinates with worker pods.
KAITO will support LeaderWorkerSet in the future to provide better management of leader-worker topologies and improved fault tolerance for multi-node deployments.
Health Monitoring
Multi-node deployments use specialized health checks:
- Liveness Probes: Monitor Ray cluster health and detect dead actors
- Readiness Probes: Check service availability via the leader pod's
/health
endpoint
If worker pods fail, the leader will restart to reinitialize the entire cluster, ensuring all pods are synchronized.
Best Practices
- Resource Planning: Carefully plan your GPU and memory requirements before deployment
- Network Bandwidth: Ensure sufficient network bandwidth between nodes for optimal performance
- Monitoring: Monitor both individual node health and overall cluster performance
- Cost Management: Be aware that multi-node deployments can be expensive; only use when necessary
Troubleshooting
Service Unavailable After Deployment
If the service becomes unavailable:
- Check if all pods are running:
kubectl get pods -l app=<your-app-label>
- Verify Ray cluster health in leader pod logs
- Ensure network connectivity between nodes
- Check resource allocation and GPU availability
Worker Pod Failures
Worker pod failures will trigger leader pod restart to reinitialize the cluster:
- Monitor pod restart events
- Check for resource constraints (memory, GPU)
- Verify node-to-node network connectivity
- Review pod logs for Ray cluster connection issues
Performance Issues
If you experience poor performance:
- Monitor network latency between nodes
- Check GPU utilization across all nodes
- Review memory usage and potential bottlenecks
- Consider adjusting parallelism parameters
API Usage
Multi-node inference services expose the same API as single-node deployments. The vLLM runtime provides OpenAI-compatible endpoints:
# Get the cluster IP of your service
kubectl get services
# Check service health
kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- \
curl http://<CLUSTER-IP>:80/health
# Generate text using chat completions API
kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- \
curl -X POST http://<CLUSTER-IP>:80/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.3-70b-instruct",
"messages": [{"role": "user", "content": "Your prompt here"}],
"max_tokens": 100
}'
For detailed API specifications, see the inference documentation.
Limitations
- Custom Models: Multi-node inference is currently only supported for preset models
- Fault Tolerance: The system requires leader restart when worker pods fail
- Network Dependency: Performance heavily depends on inter-node network quality
- Complexity: Debugging and monitoring are more complex than single-node deployments
Related Documentation
- Inference - General inference documentation
- Presets - Supported models and their requirements
- Custom Model - Using custom models (single-node only)