Skip to main content
Version: v0.8.x

Retrieval-Augmented Generation (RAG)

This document presents how to use the KAITO ragengine Custom Resource Definition (CRD) for retrieval-augumented generatoin workflow. By creating a RAGEngine resource, you can quickly stand up a service that indexes documents and queries them in conjunction with an existing LLM inference endpoint—no need to custom-build pipelines. This enables your large language model to answer questions based on your own private content.

Installation

Be sure you've cloned this repo and followed kaito workspace installation if you plan to use local embedding model. RAGEngine needs the gpu-provisioner component to provision GPU nodes.

helm repo add kaito https://kaito-project.github.io/kaito/charts/kaito
helm repo update
helm upgrade --install kaito-ragengine kaito/ragengine \
--namespace kaito-ragengine \
--create-namespace

Verify installation

You can run the following commands to verify the installation of the controllers were successful.

Check status of the Helm chart installations.

helm list -n kaito-ragengine

Check status of the ragengine.

kubectl describe deploy ragengine -n kaito-ragengine

Clean up

helm uninstall kaito-ragengine

Usage

Prerequisite

Before creating a RAGEngine, ensure you have an accessible model inference endpoint. This endpoint can be:

  1. A model deployed through KAITO Workspace CRD (e.g., a local Hugging Face model, a vLLM instance, etc.).
  2. An external API (e.g., Huggingface service or other REST-based LLM providers).

Define the RAGEngine

Create a YAML manifest defining your RAGEngine. Key fields under spec include:

Embedding: how to generate vector embeddings for your documents. You may choose remote or local (one must be left unset if you pick the other):

embedding:
local:
modelID: "BAAI/bge-small-en-v1.5"

InferenceService: points to the LLM endpoint that RAGEngine will call for final text generation.

inferenceService:
url: "<inference-url>/v1/completions"

Users also need to specify the GPU SKU used for inference in the compute spec. For example,

apiVersion: kaito.sh/v1alpha1
kind: RAGEngine
metadata:
name: ragengine-start
spec:
compute:
instanceType: "Standard_NC4as_T4_v3"
labelSelector:
matchLabels:
apps: ragengine-example
embedding:
local:
modelID: "BAAI/bge-small-en-v1.5"
inferenceService:
url: "<inference-url>/v1/completions"
contextWindowSize: 512 # Modify to fit the model's context window.

Persistent Storage (Optional)

RAGEngine supports persistent storage for vector indexes using Kubernetes PersistentVolumeClaims (PVC). When configured, indexed documents are automatically saved to persistent storage and restored on pod restarts. Users can also manually persist and load indexes using the RAG service API endpoints (/persist/{index_name} and /load/{index_name}).

Example with Azure Disk PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-ragengine-vector-db
spec:
accessModes:
- ReadWriteOnce
storageClassName: managed-csi-premium
resources:
requests:
storage: 50Gi
---
apiVersion: kaito.sh/v1alpha1
kind: RAGEngine
metadata:
name: ragengine-with-pvc
spec:
compute:
instanceType: "Standard_NC4as_T4_v3"
labelSelector:
matchLabels:
apps: ragengine-example
storage:
persistentVolumeClaim: pvc-ragengine-vector-db
mountPath: /mnt/vector-db
embedding:
local:
modelID: "BAAI/bge-small-en-v1.5"
inferenceService:
url: "<inference-url>/v1/completions"
contextWindowSize: 512

Key points:

  • Indexes are automatically persisted when the pod terminates (via PreStop lifecycle hook)
  • Indexes are automatically restored when the pod starts (via PostStart lifecycle hook)
  • Snapshots are stored with timestamps and the 5 most recent snapshots are retained
  • Storage class should support ReadWriteOnce access mode

Apply the manifest

After you create your YAML configuration, run:

kubectl apply -f examples/RAG/kaito_ragengine_phi_3.yaml