Retrieval-Augmented Generation (RAG)
This document presents how to use the KAITO ragengine
Custom Resource Definition (CRD) for retrieval-augumented generatoin workflow. By creating a RAGEngine resource, you can quickly stand up a service that indexes documents and queries them in conjunction with an existing LLM inference endpoint—no need to custom-build pipelines. This enables your large language model to answer questions based on your own private content.
Installation
Be sure you've cloned this repo and followed kaito workspace installation
helm install ragengine ./charts/kaito/ragengine --namespace kaito-ragengine --create-namespace
Verify installation
You can run the following commands to verify the installation of the controllers were successful.
Check status of the Helm chart installations.
helm list -n kaito-ragengine
Check status of the ragengine
.
kubectl describe deploy ragengine -n kaito-ragengine
Clean up
helm uninstall kaito-ragengine
Usage
Prerequisite
Before creating a RAGEngine, ensure you have an accessible model inference endpoint. This endpoint can be:
- A model deployed through KAITO Workspace CRD (e.g., a local Hugging Face model, a vLLM instance, etc.).
- An external API (e.g., Huggingface service or other REST-based LLM providers).
Define the RAGEngine
Create a YAML manifest defining your RAGEngine. Key fields under spec include:
Embedding: how to generate vector embeddings for your documents. You may choose remote or local (one must be left unset if you pick the other):
embedding:
local:
modelID: "BAAI/bge-small-en-v1.5"
InferenceService: points to the LLM endpoint that RAGEngine will call for final text generation.
inferenceService:
url: "<inference-url>/v1/completions"
Users also need to specify the GPU SKU used for inference in the compute
spec. For example,
apiVersion: kaito.sh/v1alpha1
kind: RAGEngine
metadata:
name: ragengine-start
spec:
compute:
instanceType: "Standard_NC6s_v3"
labelSelector:
matchLabels:
apps: ragengine-example
embedding:
local:
modelID: "BAAI/bge-small-en-v1.5"
inferenceService:
url: "<inference-url>/v1/completions"
Splitting Documents with CodeSplitter
By default, RAGEngine splits documents into sentences. However, you can instruct the engine to split documents using the CodeSplitter
(for code-aware chunking) by providing metadata in your API request.
To use the CodeSplitter
, set the split_type
to "code"
and specify the programming language in the language
field of the document metadata. For example, when calling the RAGEngine API to index documents:
{
"documents": [
{
"text": "def foo():\n return 42\n\n# Another function\ndef bar():\n pass",
"metadata": {
"split_type": "code",
"language": "python"
}
}
]
}
This instructs the RAGEngine to use code-aware splitting for the provided document. If split_type
is not set or set to any other value, sentence splitting will be used by default.
Apply the Manifest
After you create your YAML configuration, run:
kubectl apply -f examples/RAG/kaito_ragengine_phi_3.yaml