AI Agent RAG Tutorial: Build a Knowledge Retrieval Agent with Your Own Data

TutorialsBy Ivern AI Team15 min read

AI Agent RAG Tutorial: Build a Knowledge Retrieval Agent with Your Own Data

ChatGPT knows about the internet. But it doesn't know about your company's internal docs, your research papers, or your customer database. RAG (Retrieval-Augmented Generation) fixes this by letting AI agents search your own documents before answering.

This tutorial walks you through building a complete RAG agent: from ingesting documents to answering questions with citations. You'll learn the architecture, implement each component, and understand the tradeoffs.

In this tutorial:

Related tutorials: Build AI Agent From Scratch · AI Agent Python Tutorial · AI Agent Tools Tutorial

How RAG Works

RAG adds a retrieval step before the AI generates a response:

User Question
     │
     ▼
[1. Embed the question]
     │
     ▼
[2. Search vector database for similar documents]
     │
     ▼
[3. Send question + retrieved documents to LLM]
     │
     ▼
[4. LLM generates answer based on retrieved context]

Without RAG, the LLM answers from its training data (which may be outdated or generic). With RAG, the LLM answers from your specific documents.

Why RAG Beats Fine-Tuning

Scroll to see full table

ApproachCostSetup TimeData FreshnessAccuracy on Your Data
Fine-tuning$100-10,000+Hours to daysStatic (retrain to update)High for trained domain
RAG$1-50/monthMinutesReal-time (just update docs)High with good retrieval

RAG is faster to set up, cheaper to maintain, and always current. Fine-tuning is better only when you need the model to learn a new style or domain deeply.

Setting Up the Document Pipeline

Prerequisites

pip install openai chromadb pypdf python-dotenv tiktoken

Step 1: Load Documents

import os
from pypdf import PdfReader

def load_pdf(file_path: str) -> list[str]:
    reader = PdfReader(file_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text

def load_text(file_path: str) -> str:
    with open(file_path, "r") as f:
        return f.read()

def load_documents(directory: str) -> list[dict]:
    documents = []
    for filename in os.listdir(directory):
        filepath = os.path.join(directory, filename)
        if filename.endswith(".pdf"):
            content = load_pdf(filepath)
        elif filename.endswith(".txt") or filename.endswith(".md"):
            content = load_text(filepath)
        else:
            continue
        
        documents.append({
            "filename": filename,
            "content": content,
            "source": filepath
        })
    return documents

Step 2: Chunk Documents

Large documents need to be split into smaller chunks for effective retrieval:

import tiktoken

def chunk_text(text: str, max_tokens: int = 500, overlap: int = 50) -> list[str]:
    encoding = tiktoken.get_encoding("cl100k_base")
    tokens = encoding.encode(text)
    
    chunks = []
    start = 0
    while start < len(tokens):
        end = start + max_tokens
        chunk_tokens = tokens[start:end]
        chunks.append(encoding.decode(chunk_tokens))
        start += max_tokens - overlap
    
    return chunks

def process_documents(documents: list[dict]) -> list[dict]:
    chunks = []
    for doc in documents:
        doc_chunks = chunk_text(doc["content"])
        for i, chunk in enumerate(doc_chunks):
            chunks.append({
                "text": chunk,
                "source": doc["filename"],
                "chunk_index": i
            })
    return chunks

Why chunking matters: Too large chunks dilute relevance. Too small chunks lose context. 300-500 tokens is a good starting point for most use cases.

Building the Vector Store

Using ChromaDB (Free, Local)

import chromadb
from openai import OpenAI

client = OpenAI()

chroma = chromadb.Client()
collection = chroma.get_or_create_collection(
    name="knowledge_base",
    metadata={"hnsw:space": "cosine"}
)

def get_embeddings(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [item.embedding for item in response.data]

Get AI agent tips in your inbox

Multi-agent workflows, BYOK tips, and product updates. No spam.

def index_documents(chunks: list[dict]): texts = [chunk["text"] for chunk in chunks] embeddings = get_embeddings(texts) ids = [f"{chunk['source']}chunk{chunk['chunk_index']}" for chunk in chunks] metadatas = [{"source": chunk["source"]} for chunk in chunks]

collection.add(
    ids=ids,
    embeddings=embeddings,
    documents=texts,
    metadatas=metadatas
)

### Index Your Documents

```python
documents = load_documents("./my_documents")
chunks = process_documents(documents)
index_documents(chunks)
print(f"Indexed {len(chunks)} chunks from {len(documents)} documents")

Implementing Retrieval

def retrieve(query: str, top_k: int = 5) -> list[dict]:
    query_embedding = get_embeddings([query])[0]
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    
    retrieved = []
    for i in range(len(results["ids"][0])):
        retrieved.append({
            "text": results["documents"][0][i],
            "source": results["metadatas"][0][i]["source"],
            "distance": results["distances"][0][i]
        })
    
    return retrieved

Reranking for Better Results

Basic vector search returns semantically similar documents, but similarity doesn't always mean relevance. Reranking fixes this:

def rerank(query: str, documents: list[dict], top_k: int = 3) -> list[dict]:
    prompt = f"""Rate how relevant each document is to the query on a scale of 1-10.

Query: {query}

Documents:
"""
    for i, doc in enumerate(documents):
        prompt += f"\n[{i+1}] (from {doc['source']}): {doc['text'][:200]}...\n"
    
    prompt += "\nReturn JSON: {\"rankings\": [{\"index\": 1, \"score\": 8}, ...]}"
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    
    import json
    rankings = json.loads(response.choices[0].message.content)["rankings"]
    rankings.sort(key=lambda x: x["score"], reverse=True)
    
    return [documents[r["index"] - 1] for r in rankings[:top_k]]

Building the RAG Agent

Now let's combine retrieval with generation:

def ask_rag_agent(question: str, use_reranking: bool = True) -> str:
    retrieved = retrieve(question, top_k=10)
    
    if use_reranking:
        retrieved = rerank(question, retrieved, top_k=5)
    else:
        retrieved = retrieved[:5]
    
    context = "\n\n".join([
        f"[Source: {doc['source']}]\n{doc['text']}" 
        for doc in retrieved
    ])
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """You are a knowledgeable assistant that answers questions based on the provided documents.

Rules:
1. Only answer based on the provided context
2. If the context doesn't contain the answer, say "I don't have enough information to answer this question"
3. Always cite the source document for each claim
4. Be concise and specific
5. Quote relevant passages when possible"""
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ]
    )
    
    return response.choices[0].message.content

Testing the Agent

answer = ask_rag_agent("What is our company's vacation policy?")
print(answer)

The agent will:

  1. Embed the question
  2. Search for relevant document chunks
  3. Rerank by relevance
  4. Generate an answer citing specific documents

Improving Retrieval Quality

Technique 1: Query Expansion

Generate multiple search queries to improve recall:

def expand_query(query: str) -> list[str]:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Generate 3 different search queries that would find relevant documents for this question. Return JSON array of strings."},
            {"role": "user", "content": query}
        ],
        response_format={"type": "json_object"}
    )
    
    import json
    expanded = json.loads(response.choices[0].message.content)
    return [query] + expanded.get("queries", [])

def retrieve_with_expansion(query: str, top_k: int = 5) -> list[dict]:
    queries = expand_query(query)
    all_results = []
    
    for q in queries:
        results = retrieve(q, top_k=top_k)
        all_results.extend(results)
    
    seen = set()
    unique = []
    for r in all_results:
        key = r["text"][:100]
        if key not in seen:
            seen.add(key)
            unique.append(r)
    
    return unique[:top_k * 2]

Combine vector search with keyword matching:

def hybrid_search(query: str, top_k: int = 5) -> list[dict]:
    vector_results = retrieve(query, top_k=top_k * 2)
    
    query_words = set(query.lower().split())
    scored = []
    for result in vector_results:
        doc_words = set(result["text"].lower().split())
        keyword_overlap = len(query_words & doc_words) / max(len(query_words), 1)
        
        vector_score = 1 - result["distance"]
        hybrid_score = 0.7 * vector_score + 0.3 * keyword_overlap
        
        scored.append({**result, "hybrid_score": hybrid_score})
    
    scored.sort(key=lambda x: x["hybrid_score"], reverse=True)
    return scored[:top_k]

Technique 3: Metadata Filtering

Filter by document type, date, or category:

def retrieve_with_filters(query: str, source_filter: str = None, top_k: int = 5) -> list[dict]:
    query_embedding = get_embeddings([query])[0]
    
    where_filter = {}
    if source_filter:
        where_filter["source"] = {"$contains": source_filter}
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        where=where_filter if where_filter else None
    )
    
    return [{"text": results["documents"][0][i], "source": results["metadatas"][0][i]["source"]} for i in range(len(results["ids"][0]))]

Production Deployment

Scaling the Vector Store

For production, use a persistent vector database:

Scroll to see full table

DatabaseBest ForCost
ChromaDB (local)Development, prototypingFree
PineconeProduction, managed$25+/month
WeaviateSelf-hosted productionFree (self-hosted)
pgvector (Postgres)Already using PostgresFree
QdrantHigh performanceFree (self-hosted)

Production RAG Pipeline

class ProductionRAGAgent:
    def __init__(self, doc_directory: str):
        self.documents = load_documents(doc_directory)
        self.chunks = process_documents(self.documents)
        index_documents(self.chunks)
    
    def ask(self, question: str) -> dict:
        results = hybrid_search(question, top_k=8)
        top_results = rerank(question, results, top_k=5)
        
        context = "\n\n".join([f"[{r['source']}]: {r['text']}" for r in top_results])
        
        answer = ask_rag_agent(question)
        
        return {
            "answer": answer,
            "sources": [r["source"] for r in top_results],
            "chunks_used": len(top_results)
        }

agent = ProductionRAGAgent("./company_docs")
result = agent.ask("What is our refund policy for enterprise customers?")
print(result["answer"])
print(f"Sources: {result['sources']}")

Keeping the Index Updated

import hashlib

def get_file_hash(filepath: str) -> str:
    with open(filepath, "rb") as f:
        return hashlib.md5(f.read()).hexdigest()

def update_index(directory: str, known_hashes: dict):
    current_files = {}
    
    for filename in os.listdir(directory):
        filepath = os.path.join(directory, filename)
        file_hash = get_file_hash(filepath)
        current_files[filename] = file_hash
        
        if filename not in known_hashes or known_hashes[filename] != file_hash:
            doc = load_documents(directory)
            chunks = process_documents(doc)
            index_documents(chunks)
    
    return current_files

Skip the Setup: Use Ivern AI

Building a RAG pipeline from scratch takes hours. Ivern AI provides knowledge-augmented agents out of the box:

  • Upload your documents -- PDFs, text files, markdown
  • Agents search your data -- automatic retrieval and citation
  • No vector database management -- it's handled for you
  • BYOK pricing -- use your API key, no markup on retrieval costs

Try knowledge-augmented agents: ivern.ai/signup

Key Takeaways

  1. RAG = Retrieve + Generate -- search your documents first, then let the LLM answer
  2. Chunking matters -- 300-500 tokens per chunk is the sweet spot for most use cases
  3. Reranking improves relevance -- vector similarity alone isn't enough
  4. Cite sources -- always show where the answer came from
  5. Keep indexes fresh -- rebuild when documents change

Next tutorials: AI Agent Tools Tutorial · Build AI Agent From Scratch · AI Agent Python Tutorial

Want to try multi-agent AI for free?

Generate a blog post, Twitter thread, LinkedIn post, and newsletter from one prompt. No signup required.

Try the Free Demo

AI Content Factory -- Free to Start

One prompt generates blog posts, social media, and emails. Free tier, BYOK, zero markup.

No spam. Unsubscribe anytime.