AI Agent RAG Tutorial: Build a Knowledge Retrieval Agent with Your Own Data

TutorialsBy Ivern AI TeamMay 1, 202615 min read

AI Agent RAG Tutorial: Build a Knowledge Retrieval Agent with Your Own Data

ChatGPT knows about the internet. But it doesn't know about your company's internal docs, your research papers, or your customer database. RAG (Retrieval-Augmented Generation) fixes this by letting AI agents search your own documents before answering.

This tutorial walks you through building a complete RAG agent: from ingesting documents to answering questions with citations. You'll learn the architecture, implement each component, and understand the tradeoffs.

In this tutorial:

How RAG works
Setting up the document pipeline
Building the vector store
Implementing retrieval
Building the RAG agent
Improving retrieval quality
Production deployment

How RAG Works

RAG adds a retrieval step before the AI generates a response:

User Question
     │
     ▼
[1. Embed the question]
     │
     ▼
[2. Search vector database for similar documents]
     │
     ▼
[3. Send question + retrieved documents to LLM]
     │
     ▼
[4. LLM generates answer based on retrieved context]

Without RAG, the LLM answers from its training data (which may be outdated or generic). With RAG, the LLM answers from your specific documents.

Why RAG Beats Fine-Tuning

Scroll to see full table

Approach	Cost	Setup Time	Data Freshness	Accuracy on Your Data
Fine-tuning	$100-10,000+	Hours to days	Static (retrain to update)	High for trained domain
RAG	$1-50/month	Minutes	Real-time (just update docs)	High with good retrieval

RAG is faster to set up, cheaper to maintain, and always current. Fine-tuning is better only when you need the model to learn a new style or domain deeply.

Setting Up the Document Pipeline

Prerequisites

pip install openai chromadb pypdf python-dotenv tiktoken

Step 1: Load Documents

import os
from pypdf import PdfReader

def load_pdf(file_path: str) -> list[str]:
    reader = PdfReader(file_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text

def load_text(file_path: str) -> str:
    with open(file_path, "r") as f:
        return f.read()

def load_documents(directory: str) -> list[dict]:
    documents = []
    for filename in os.listdir(directory):
        filepath = os.path.join(directory, filename)
        if filename.endswith(".pdf"):
            content = load_pdf(filepath)
        elif filename.endswith(".txt") or filename.endswith(".md"):
            content = load_text(filepath)
        else:
            continue
        
        documents.append({
            "filename": filename,
            "content": content,
            "source": filepath
        })
    return documents

Step 2: Chunk Documents

Large documents need to be split into smaller chunks for effective retrieval:

import tiktoken

def chunk_text(text: str, max_tokens: int = 500, overlap: int = 50) -> list[str]:
    encoding = tiktoken.get_encoding("cl100k_base")
    tokens = encoding.encode(text)
    
    chunks = []
    start = 0
    while start < len(tokens):
        end = start + max_tokens
        chunk_tokens = tokens[start:end]
        chunks.append(encoding.decode(chunk_tokens))
        start += max_tokens - overlap
    
    return chunks

def process_documents(documents: list[dict]) -> list[dict]:
    chunks = []
    for doc in documents:
        doc_chunks = chunk_text(doc["content"])
        for i, chunk in enumerate(doc_chunks):
            chunks.append({
                "text": chunk,
                "source": doc["filename"],
                "chunk_index": i
            })
    return chunks

Why chunking matters: Too large chunks dilute relevance. Too small chunks lose context. 300-500 tokens is a good starting point for most use cases.

Building the Vector Store

Using ChromaDB (Free, Local)

import chromadb
from openai import OpenAI

client = OpenAI()

chroma = chromadb.Client()
collection = chroma.get_or_create_collection(
    name="knowledge_base",
    metadata={"hnsw:space": "cosine"}
)

def get_embeddings(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [item.embedding for item in response.data]

Get AI agent tips in your inbox

Multi-agent workflows, BYOK tips, and product updates. No spam.

def index_documents(chunks: list[dict]): texts = [chunk["text"] for chunk in chunks] embeddings = get_embeddings(texts) ids = [f"{chunk['source']}chunk{chunk['chunk_index']}" for chunk in chunks] metadatas = [{"source": chunk["source"]} for chunk in chunks]

collection.add(
    ids=ids,
    embeddings=embeddings,
    documents=texts,
    metadatas=metadatas
)


### Index Your Documents

```python
documents = load_documents("./my_documents")
chunks = process_documents(documents)
index_documents(chunks)
print(f"Indexed {len(chunks)} chunks from {len(documents)} documents")

Implementing Retrieval

Basic Similarity Search

def retrieve(query: str, top_k: int = 5) -> list[dict]:
    query_embedding = get_embeddings([query])[0]
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    
    retrieved = []
    for i in range(len(results["ids"][0])):
        retrieved.append({
            "text": results["documents"][0][i],
            "source": results["metadatas"][0][i]["source"],
            "distance": results["distances"][0][i]
        })
    
    return retrieved

Reranking for Better Results

Basic vector search returns semantically similar documents, but similarity doesn't always mean relevance. Reranking fixes this:

def rerank(query: str, documents: list[dict], top_k: int = 3) -> list[dict]:
    prompt = f"""Rate how relevant each document is to the query on a scale of 1-10.

Query: {query}

Documents:
"""
    for i, doc in enumerate(documents):
        prompt += f"\n[{i+1}] (from {doc['source']}): {doc['text'][:200]}...\n"
    
    prompt += "\nReturn JSON: {\"rankings\": [{\"index\": 1, \"score\": 8}, ...]}"
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    
    import json
    rankings = json.loads(response.choices[0].message.content)["rankings"]
    rankings.sort(key=lambda x: x["score"], reverse=True)
    
    return [documents[r["index"] - 1] for r in rankings[:top_k]]

Building the RAG Agent

Now let's combine retrieval with generation:

def ask_rag_agent(question: str, use_reranking: bool = True) -> str:
    retrieved = retrieve(question, top_k=10)
    
    if use_reranking:
        retrieved = rerank(question, retrieved, top_k=5)
    else:
        retrieved = retrieved[:5]
    
    context = "\n\n".join([
        f"[Source: {doc['source']}]\n{doc['text']}" 
        for doc in retrieved
    ])
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """You are a knowledgeable assistant that answers questions based on the provided documents.

Rules:
1. Only answer based on the provided context
2. If the context doesn't contain the answer, say "I don't have enough information to answer this question"
3. Always cite the source document for each claim
4. Be concise and specific
5. Quote relevant passages when possible"""
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ]
    )
    
    return response.choices[0].message.content

Testing the Agent

answer = ask_rag_agent("What is our company's vacation policy?")
print(answer)

The agent will:

Embed the question
Search for relevant document chunks
Rerank by relevance
Generate an answer citing specific documents

Improving Retrieval Quality

Technique 1: Query Expansion

Generate multiple search queries to improve recall:

def expand_query(query: str) -> list[str]:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Generate 3 different search queries that would find relevant documents for this question. Return JSON array of strings."},
            {"role": "user", "content": query}
        ],
        response_format={"type": "json_object"}
    )
    
    import json
    expanded = json.loads(response.choices[0].message.content)
    return [query] + expanded.get("queries", [])

def retrieve_with_expansion(query: str, top_k: int = 5) -> list[dict]:
    queries = expand_query(query)
    all_results = []
    
    for q in queries:
        results = retrieve(q, top_k=top_k)
        all_results.extend(results)
    
    seen = set()
    unique = []
    for r in all_results:
        key = r["text"][:100]
        if key not in seen:
            seen.add(key)
            unique.append(r)
    
    return unique[:top_k * 2]

Technique 2: Hybrid Search

Combine vector search with keyword matching:

def hybrid_search(query: str, top_k: int = 5) -> list[dict]:
    vector_results = retrieve(query, top_k=top_k * 2)
    
    query_words = set(query.lower().split())
    scored = []
    for result in vector_results:
        doc_words = set(result["text"].lower().split())
        keyword_overlap = len(query_words & doc_words) / max(len(query_words), 1)
        
        vector_score = 1 - result["distance"]
        hybrid_score = 0.7 * vector_score + 0.3 * keyword_overlap
        
        scored.append({**result, "hybrid_score": hybrid_score})
    
    scored.sort(key=lambda x: x["hybrid_score"], reverse=True)
    return scored[:top_k]

Technique 3: Metadata Filtering

Filter by document type, date, or category:

def retrieve_with_filters(query: str, source_filter: str = None, top_k: int = 5) -> list[dict]:
    query_embedding = get_embeddings([query])[0]
    
    where_filter = {}
    if source_filter:
        where_filter["source"] = {"$contains": source_filter}
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        where=where_filter if where_filter else None
    )
    
    return [{"text": results["documents"][0][i], "source": results["metadatas"][0][i]["source"]} for i in range(len(results["ids"][0]))]

Production Deployment

Scaling the Vector Store

For production, use a persistent vector database:

Scroll to see full table

Database	Best For	Cost
ChromaDB (local)	Development, prototyping	Free
Pinecone	Production, managed	$25+/month
Weaviate	Self-hosted production	Free (self-hosted)
pgvector (Postgres)	Already using Postgres	Free
Qdrant	High performance	Free (self-hosted)

Production RAG Pipeline

class ProductionRAGAgent:
    def __init__(self, doc_directory: str):
        self.documents = load_documents(doc_directory)
        self.chunks = process_documents(self.documents)
        index_documents(self.chunks)
    
    def ask(self, question: str) -> dict:
        results = hybrid_search(question, top_k=8)
        top_results = rerank(question, results, top_k=5)
        
        context = "\n\n".join([f"[{r['source']}]: {r['text']}" for r in top_results])
        
        answer = ask_rag_agent(question)
        
        return {
            "answer": answer,
            "sources": [r["source"] for r in top_results],
            "chunks_used": len(top_results)
        }

agent = ProductionRAGAgent("./company_docs")
result = agent.ask("What is our refund policy for enterprise customers?")
print(result["answer"])
print(f"Sources: {result['sources']}")

Keeping the Index Updated

import hashlib

def get_file_hash(filepath: str) -> str:
    with open(filepath, "rb") as f:
        return hashlib.md5(f.read()).hexdigest()

def update_index(directory: str, known_hashes: dict):
    current_files = {}
    
    for filename in os.listdir(directory):
        filepath = os.path.join(directory, filename)
        file_hash = get_file_hash(filepath)
        current_files[filename] = file_hash
        
        if filename not in known_hashes or known_hashes[filename] != file_hash:
            doc = load_documents(directory)
            chunks = process_documents(doc)
            index_documents(chunks)
    
    return current_files

Skip the Setup: Use Ivern AI

Building a RAG pipeline from scratch takes hours. Ivern AI provides knowledge-augmented agents out of the box:

Upload your documents -- PDFs, text files, markdown
Agents search your data -- automatic retrieval and citation
No vector database management -- it's handled for you
BYOK pricing -- use your API key, no markup on retrieval costs

Try knowledge-augmented agents: ivern.ai/signup

Key Takeaways

RAG = Retrieve + Generate -- search your documents first, then let the LLM answer
Chunking matters -- 300-500 tokens per chunk is the sweet spot for most use cases
Reranking improves relevance -- vector similarity alone isn't enough
Cite sources -- always show where the answer came from
Keep indexes fresh -- rebuild when documents change

Next tutorials: AI Agent Tools Tutorial · Build AI Agent From Scratch · AI Agent Python Tutorial

AI Agent API Integration Tutorial: Connect Agents to Any External Service

Step-by-step tutorial for connecting AI agents to external APIs and services. Covers REST API integration, authentication, error handling, rate limiting, and building a tool layer that lets agents interact with any service.

AI Agent Collaboration Tutorial: How to Make Multiple Agents Work Together

Learn how to build collaborative AI agent systems where multiple specialized agents share context, hand off tasks, and produce results together. Covers communication patterns, context sharing, and real implementation examples.

AI Agent JavaScript Tutorial: Build a Web Agent with Node.js and OpenAI

Complete tutorial for building AI agents in JavaScript and Node.js. Covers the Vercel AI SDK, tool calling, streaming responses, and building a web-based agent interface. Includes full code examples.

Want to try multi-agent AI for free?

Generate a blog post, Twitter thread, LinkedIn post, and newsletter from one prompt. No signup required.

Try the Free Demo

AI Content Factory -- Free to Start

One prompt generates blog posts, social media, and emails. Free tier, BYOK, zero markup.

No spam. Unsubscribe anytime.

Back to Blog

AI Agent RAG Tutorial: Build a Knowledge Retrieval Agent with Your Own Data

How RAG Works

Why RAG Beats Fine-Tuning

Setting Up the Document Pipeline

Prerequisites

Step 1: Load Documents

Step 2: Chunk Documents

Building the Vector Store

Using ChromaDB (Free, Local)

Get AI agent tips in your inbox

Implementing Retrieval

Basic Similarity Search

Reranking for Better Results

Building the RAG Agent

Testing the Agent

Improving Retrieval Quality

Technique 1: Query Expansion

Technique 2: Hybrid Search

Technique 3: Metadata Filtering

Production Deployment

Scaling the Vector Store

Production RAG Pipeline

Keeping the Index Updated

Skip the Setup: Use Ivern AI

Key Takeaways

Related Articles

AI Agent API Integration Tutorial: Connect Agents to Any External Service

AI Agent Collaboration Tutorial: How to Make Multiple Agents Work Together

AI Agent JavaScript Tutorial: Build a Web Agent with Node.js and OpenAI

Want to try multi-agent AI for free?

AI Content Factory -- Free to Start