Retrieval-Augmented Generation (RAG)
This document presents how to use the KAITO ragengine
Custom Resource Definition (CRD) for retrieval-augumented generatoin workflow. By creating a RAGEngine resource, you can quickly stand up a service that indexes documents and queries them in conjunction with an existing LLM inference endpoint—no need to custom-build pipelines. This enables your large language model to answer questions based on your own private content.
Installation
Be sure you've cloned this repo and followed kaito workspace installation if you plan to use local embedding model. RAGEngine needs the gpu-provisioner component to provision GPU nodes.
helm install ragengine ./charts/kaito/ragengine --namespace kaito-ragengine --create-namespace
Verify installation
You can run the following commands to verify the installation of the controllers were successful.
Check status of the Helm chart installations.
helm list -n kaito-ragengine
Check status of the ragengine
.
kubectl describe deploy ragengine -n kaito-ragengine
Clean up
helm uninstall kaito-ragengine
Usage
Prerequisite
Before creating a RAGEngine, ensure you have an accessible model inference endpoint. This endpoint can be:
- A model deployed through KAITO Workspace CRD (e.g., a local Hugging Face model, a vLLM instance, etc.).
- An external API (e.g., Huggingface service or other REST-based LLM providers).
Define the RAGEngine
Create a YAML manifest defining your RAGEngine. Key fields under spec include:
Embedding: how to generate vector embeddings for your documents. You may choose remote or local (one must be left unset if you pick the other):
embedding:
local:
modelID: "BAAI/bge-small-en-v1.5"
InferenceService: points to the LLM endpoint that RAGEngine will call for final text generation.
inferenceService:
url: "<inference-url>/v1/completions"
Users also need to specify the GPU SKU used for inference in the compute
spec. For example,
apiVersion: kaito.sh/v1alpha1
kind: RAGEngine
metadata:
name: ragengine-start
spec:
compute:
instanceType: "Standard_NC4as_T4_v3"
labelSelector:
matchLabels:
apps: ragengine-example
embedding:
local:
modelID: "BAAI/bge-small-en-v1.5"
inferenceService:
url: "<inference-url>/v1/completions"
Apply the manifest
After you create your YAML configuration, run:
kubectl apply -f examples/RAG/kaito_ragengine_phi_3.yaml
API definitions and examples
A RAGEngine index is a logical collection that organizes and stores your documents for retrieval-augmented generation workflows. The relationship between indexes, documents, and document nodes is as follows:
-
Index: An index is a named container that holds a set of documents. Each index is independent and can be created, updated, queried, persisted, loaded, or deleted via the API.
-
Documents: Documents are the primary units of content that you add to an index. Each document contains a
text
field (the content to be indexed) and optionalmetadata
(such as author, source, or custom tags). When you add documents to an index, each is assigned a uniquedoc_id
. -
Document Nodes: When a document is ingested, it is automatically split into smaller chunks called nodes. The splitting strategy depends on the document type and metadata:
- By default, documents are split into sentences.
- If you specify code-aware splitting (using the
split_type
andlanguage
metadata), the document is split into code blocks or logical code units. - Each node represents a chunk of text that is indexed and can be retrieved as part of a query.
How it works in practice:
- When you index a document, it is divided into nodes for efficient retrieval and semantic search.
- When you query an index, the engine retrieves the most relevant nodes (not necessarily whole documents) and can use them to generate answers or summaries.
- The
source_nodes
field in query responses contains the actual nodes that matched your query, along with their scores and metadata.
This design enables fine-grained retrieval and more accurate, context-aware responses from your LLM-powered applications.
Creating an Index With Documents
To add documents to an index or create a new index, use the /index
API route. This endpoint accepts a POST request with the index name and a list of documents to be indexed.
Request Example:
POST /index
{
"index_name": "rag_index",
"documents": [
{
"text": "Retrieval Augmented Generation (RAG) is an architecture that augments the capabilities of a Large Language Model (LLM) like ChatGPT by adding an information retrieval system that provides grounding data.",
"metadata": {
"author": "Microsoft",
"source": "https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview?tabs=docs"
}
}
]
}
index_name
: The name of the index to create or update.documents
: A list of documents, each with atext
field and optionalmetadata
.
Response Example:
[
{
"doc_id": "123456",
"text": "Retrieval Augmented Generation (RAG) is an architecture that augments the capabilities of a Large Language Model (LLM) like ChatGPT by adding an information retrieval system that provides grounding data.",
"hash_value": "text_hash_value",
"metadata": {
"author": "Microsoft",
"source": "https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview?tabs=docs"
},
"is_truncated": false
}
]
Each returned document includes its unique doc_id
, the original text, metadata, and a flag indicating if the text was truncated. The doc_id
will be important for document update/delete calls.
Splitting Documents with CodeSplitter
By default, RAGEngine splits documents into sentences. However, you can instruct the engine to split documents using the CodeSplitter
(for code-aware chunking) by providing metadata in your API request.
To use the CodeSplitter
, set the split_type
to "code"
and specify the programming language in the language
field of the document metadata. For example, when calling the RAGEngine API to index documents:
{
"documents": [
{
"text": "def foo():\n return 42\n\n# Another function\ndef bar():\n pass",
"metadata": {
"split_type": "code",
"language": "python"
}
}
]
}
This instructs the RAGEngine to use code-aware splitting for the provided document. If split_type
is not set or set to any other value, sentence splitting will be used by default.
List Documents
To retrieve a paginated list of documents from a specific index, use the /indexes/{index_name}/documents
API route. This endpoint accepts a GET request with optional query parameters for pagination, text truncation, and metadata filtering.
Request Example:
GET /indexes/rag_index/documents?limit=5&offset=0&max_text_length=500
limit
: (optional) Maximum number of documents to return (default: 10, max: 100).offset
: (optional) Starting point for the document list (default: 0).max_text_length
: (optional) Maximum text length to return per document (default: 1000).metadata_filter
: (optional) A JSON string representing key-value pairs to filter documents by their metadata.
Response Example:
{
"documents": [
{
"doc_id": "123456",
"text": "Retrieval Augmented Generation (RAG) is an architecture that augments the capabilities of a Large Language Model (LLM) like ChatGPT by adding an information retrieval system that provides grounding data.",
"hash_value": "text_hash_value",
"metadata": {
"author": "Microsoft",
"source": "https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview?tabs=docs"
},
"is_truncated": false
}
],
"count": 1
}
Each document in the response includes its unique doc_id
, the (possibly truncated) text, metadata, and a flag indicating if the text was truncated.
Note:
If you want to filter documents by metadata, provide the metadata_filter
parameter as a JSON string. For example:
GET /indexes/rag_index/documents?metadata_filter={"author":"Microsoft"}
Updating Documents
To update existing documents in a specific index, use the /indexes/{index_name}/documents
API route. This endpoint accepts a POST request with the index name in the URL and a list of documents to update in the request body.
Request Example:
POST /indexes/rag_index/documents
{
"documents": [
{
"doc_id": "123456",
"text": "Retrieval Augmented Generation (RAG) is an architecture that augments the capabilities of a Large Language Model (LLM) like ChatGPT by adding an information retrieval system that provides grounding data. Adding an information retrieval system gives you control over grounding data used by an LLM when it formulates a response.",
"hash_value": "text_hash_value",
"metadata": {
"author": "Microsoft",
"source": "https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview?tabs=docs"
},
}
]
}
doc_id
: The unique identifier of the document to update.text
: The new or updated text for the document.metadata
: (Optional) Updated metadata for the document.
Response Example:
{
"updated_documents": [
{
"doc_id": "123456",
"text": "Retrieval Augmented Generation (RAG) is an architecture that augments the capabilities of a Large Language Model (LLM) like ChatGPT by adding an information retrieval system that provides grounding data. Adding an information retrieval system gives you control over grounding data used by an LLM when it formulates a response.",
"hash_value": "text_hash_value",
"metadata": {
"author": "Microsoft",
"source": "https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview?tabs=docs"
},
}
],
"unchanged_documents": [],
"not_found_documents": []
}
updated_documents
: Documents that were successfully updated.unchanged_documents
: Documents that were provided but did not require changes.not_found_documents
: Documents with IDs that were not found in the index.
Use this endpoint to keep your indexed documents up to date with the latest content or metadata.
Delete Documents
To delete one or more documents from a specific index, use the /indexes/{index_name}/documents/delete
API route. This endpoint accepts a POST request with the index name in the URL and a list of document IDs to delete in the request body.
Request Example:
POST /indexes/rag_index/documents/delete
{
"doc_ids": ["123456"]
}
doc_ids
: A list of document IDs to delete from the specified index.
Response Example:
{
"deleted_doc_ids": ["123456"],
"not_found_doc_ids": []
}
deleted_doc_ids
: Document IDs that were successfully deleted.not_found_doc_ids
: Document IDs that were not found in the index.
Use this endpoint to remove documents that are no longer needed from your index.
Persist Index
To save (persist) the data of an index to disk, use the /persist/{index_name}
API route. This endpoint accepts a POST request with the index name in the URL and an optional path
query parameter specifying where to save the index data.
Request Example:
POST /persist/rag_index?path=./custom_path
index_name
: The name of the index to persist.path
: (optional) The directory path where the index will be saved. If not provided, the default directory is used.
Response Example:
{
"message": "Successfully persisted index rag_index to ./custom_path/rag_index."
}
Use this endpoint to ensure your indexed data is safely stored on disk.
Load Index
To load an existing index from disk, use the /load/{index_name}
API route. This endpoint accepts a POST request with the index name in the URL, an optional path
query parameter specifying where to load the index from, and an optional overwrite
flag.
Request Example:
POST /load/rag_index?path=./custom_path/rag_index
index_name
: The name of the index to load.path
: (optional) The path to load the index from. If not provided, the default directory is used.overwrite
: (optional, default: false) If true, will overwrite the existing index if it already exists in memory.
Response Example:
{
"message": "Successfully loaded index rag_index from ./custom_path/rag_index."
}
Use this endpoint to restore previously persisted indexes into memory for querying and updates.
Delete Index
To delete an entire index and all of its documents, use the /indexes/{index_name}
API route. This endpoint accepts a DELETE request with the index name in the URL. Deleting an index is irreversible and will remove all associated documents from memory.
Request Example:
DELETE /indexes/rag_index
index_name
: The name of the index to delete.
Response Example:
{
"message": "Successfully deleted index rag_index."
}
Use this endpoint to permanently remove an index and all its data when it is no longer needed.
Query Index
To query a specific index for relevant documents, use the /query
API route. This endpoint accepts a POST request with the index name, query string, and optional parameters for result count, and LLM generation.
Request Example:
POST /query
{
"index_name": "rag_index",
"query": "What is RAG?",
"top_k": 5,
"llm_params": {
"temperature": 0.7,
"max_tokens": 2048
}
}
index_name
: The name of the index to query.query
: The query string.top_k
: (optional) Number of top documents to retrieve (default: 5).llm_params
: (optional) Parameters for LLM-based generation (e.g., temperature, max_tokens).
Response Example:
{
"response": "Retrieval Augmented Generation (RAG) is an architecture that augments the capabilities of a Large Language Model...",
"source_nodes": [
{
"doc_id": "123456",
"node_id": "2853a565-8c1f-4982-acaa-a0ab52691435",
"text": "Retrieval Augmented Generation (RAG) is an architecture that augments the capabilities of a Large Language Model...",
"score": 0.95,
"metadata": {
"author": "Microsoft",
"source": "https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview?tabs=docs"
},
}
],
"metadata": {
"2853a565-8c1f-4982-acaa-a0ab52691435": {
"author": "Microsoft",
"source": "https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview?tabs=docs"
}
}
}
response
: The generated answer or summary from the LLM (if enabled).source_nodes
: List of source nodes with their text, score, and metadata.metadata
: Additional metadata about the query or response.
Use this endpoint to retrieve relevant information from your indexed documents and optionally generate answers using an LLM.
Example Client
You can leverage the example_rag_client.py as a starting point for a rag client with inputs that match the route documentation above.