Title

Add meta-llama/Llama-3.3-70B-Instruct to KAITO supported model list

Glossary

N/A

Summary

Model description: The Meta Llama 3.3 multilingual large language model (LLM) is an instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperforms many of the available open source and closed chat models on common industry benchmarks.
Model usage statistics: 700,643 downloads in the last 30 days as of 2025-05-29, according to Hugging Face.
Model license: The model is licensed under the LLAMA 3.3 COMMUNITY LICENSE. Therefore, users must agree to the terms of the license and supply Hugging Face credentials to download the model files.

apiVersion: v1
kind: Secret
metadata:
  name: hf-token
  namespace: default
data:
  HF_TOKEN: <base64-encoded-huggingface-token>
type: Opaque

In the Workspace configuration, you can specify the secret name to use the Hugging Face token:

apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
  name: workspace-llama-3-3-70b-instruct
resource:
  instanceType: Standard_ND96isr_H100_v5
  labelSelector:
    matchLabels:
      apps: llama-3-3-70b-instruct
inference:
  preset:
    name: llama-3.3-70b-instruct
    presetOptions:
      modelAccessSecret: hf-token

Requirements

The following table describes the basic model characteristics and the resource requirements of running it.

Field	Notes
Family name	Llama 3
Type	`conversational`
Download site	https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/tree/main
Version	6f6073b423013f6a7d4d9f39144961bfbfbc386b
Storage size	140GB
GPU count	4
Total GPU memory	320GB (~16k context length, more GPUs required for longer context lengths)
Per GPU memory	80GB

Requirements

Runtimes

This section describes how to configure the runtime framework to support the inference calls.

Options	Notes
Runtime	vLLM
Distributed Inference	True
Custom configurations	`--max_model_len=16384` by default but users with more GPUs can override it in the Workspace configuration if they want to use a longer context length.

History

2025-05-29: Open proposal PR.
2025-06-05: Start model integration.
2025-06-05: Complete model support.

Title

Glossary​

Summary​

Requirements​

Runtimes​

History

Glossary

Summary

Requirements

Runtimes