Skip to main content

Multi-Node Distributed Inference

The Headlamp-KAITO plugin provides powerful multi-node distributed inference capabilities, enabling you to deploy large AI models across multiple GPU nodes for improved performance, scalability, and resource utilization.

Overview

Multi-node distributed inference allows you to:

  • Scale horizontally by distributing model workloads across multiple GPU nodes
  • Handle larger models that don't fit on a single node
  • Increase availability with redundancy across nodes

Deployment Strategies

1. Explicit Node Selection

Select specific nodes for precise control over your deployment:

resource:
count: 3
preferredNodes:
- gpu-node-1
- gpu-node-2
- gpu-node-3
instanceType: Standard_NC80adis_H100_v5
labelSelector:
matchLabels:
node.kubernetes.io/instance-type: Standard_NC80adis_H100_v5

Use cases:

  • High-performance workloads requiring specific hardware
  • Testing on known node configurations

2. Count-Based Auto-Provisioning

Specify the number of nodes while letting KAITO handle the selection:

resource:
count: 4
instanceType: Standard_NC80adis_H100_v5
labelSelector:
matchLabels:
apps: llama-3-8b

Use cases:

  • Predictable scaling requirements
  • Cost optimization with specific node counts
  • Load balancing across available resources

3. Full Auto-Provisioning

Let KAITO determine the optimal configuration:

resource:
instanceType: Standard_NC80adis_H100_v5
labelSelector:
matchLabels:
apps: llama-3-8b

Use cases:

  • Development and testing environments
  • Dynamic workloads with varying requirements
  • Simplified deployment workflows

Node Selection Interface

The plugin provides an intuitive interface for managing multi-node deployments:

Node Selection Features

  • GPU Node Filtering: Automatically shows only GPU-enabled nodes
  • Instance Type Detection: Dynamically extracts instance types from selected nodes
  • Node Status Indicators: Real-time health and availability status
  • Taint Awareness: Displays node scheduling restrictions
  • Label-Based Filtering: Advanced filtering using Kubernetes labels

Quick Selectors

Pre-configured options for common scenarios:

  • Standard NC24 Instances
  • Standard NC96 Instances
  • GPU + AMD64 architecture
  • Custom label combinations

Benefits of Multi-Node Inference

Performance Improvements

  • Parallel Processing: Distribute inference requests across multiple nodes
  • Reduced Latency: Process multiple requests simultaneously
  • Higher Throughput: Aggregate processing power of multiple GPUs

Scalability Advantages

  • Horizontal Scaling: Add more nodes to handle increased load
  • Model Sharding: Split large models across multiple nodes
  • Dynamic Scaling: Adjust node count based on demand

Cost Optimization

  • Instance Type Flexibility: Use smaller, more cost-effective instances
  • Spot Instance Support: Leverage spot pricing across multiple nodes
  • Resource Efficiency: Better GPU utilization across the cluster

High Availability

  • Fault Tolerance: Continue operation if individual nodes fail
  • Load Distribution: Prevent single points of failure
  • Graceful Degradation: Maintain service with reduced capacity

Integration with KAITO

The multi-node features integrate seamlessly with KAITO's capabilities:

  • Automatic Model Sharding: KAITO handles model distribution
  • Service Discovery: Unified endpoints for distributed clusters
  • Health Monitoring: Built-in monitoring and alerting
  • Dynamic Scaling: Runtime adjustment of node configurations

Getting Started

  1. Access the Deployment Dialog: Select a model from the catalog
  2. Configure Node Selection: Choose your preferred deployment strategy
  3. Review Configuration: Verify the generated YAML configuration
  4. Deploy: Apply the configuration to your cluster
  5. Monitor: Track deployment status and performance metrics

The multi-node distributed inference feature makes it easy to deploy and manage large-scale AI workloads while maintaining the simplicity and user-friendliness that KAITO is known for.