AI Agent RAG Tutorial: Build a Knowledge Retrieval Agent with Your Own Data
AI Agent RAG Tutorial: Build a Knowledge Retrieval Agent with Your Own Data
ChatGPT knows about the internet. But it doesn't know about your company's internal docs, your research papers, or your customer database. RAG (Retrieval-Augmented Generation) fixes this by letting AI agents search your own documents before answering.
This tutorial walks you through building a complete RAG agent: from ingesting documents to answering questions with citations. You'll learn the architecture, implement each component, and understand the tradeoffs.
In this tutorial:
- How RAG works
- Setting up the document pipeline
- Building the vector store
- Implementing retrieval
- Building the RAG agent
- Improving retrieval quality
- Production deployment
Related tutorials: Build AI Agent From Scratch · AI Agent Python Tutorial · AI Agent Tools Tutorial
How RAG Works
RAG adds a retrieval step before the AI generates a response:
User Question
│
▼
[1. Embed the question]
│
▼
[2. Search vector database for similar documents]
│
▼
[3. Send question + retrieved documents to LLM]
│
▼
[4. LLM generates answer based on retrieved context]
Without RAG, the LLM answers from its training data (which may be outdated or generic). With RAG, the LLM answers from your specific documents.
Why RAG Beats Fine-Tuning
Scroll to see full table
| Approach | Cost | Setup Time | Data Freshness | Accuracy on Your Data |
|---|---|---|---|---|
| Fine-tuning | $100-10,000+ | Hours to days | Static (retrain to update) | High for trained domain |
| RAG | $1-50/month | Minutes | Real-time (just update docs) | High with good retrieval |
RAG is faster to set up, cheaper to maintain, and always current. Fine-tuning is better only when you need the model to learn a new style or domain deeply.
Setting Up the Document Pipeline
Prerequisites
pip install openai chromadb pypdf python-dotenv tiktoken
Step 1: Load Documents
import os
from pypdf import PdfReader
def load_pdf(file_path: str) -> list[str]:
reader = PdfReader(file_path)
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
return text
def load_text(file_path: str) -> str:
with open(file_path, "r") as f:
return f.read()
def load_documents(directory: str) -> list[dict]:
documents = []
for filename in os.listdir(directory):
filepath = os.path.join(directory, filename)
if filename.endswith(".pdf"):
content = load_pdf(filepath)
elif filename.endswith(".txt") or filename.endswith(".md"):
content = load_text(filepath)
else:
continue
documents.append({
"filename": filename,
"content": content,
"source": filepath
})
return documents
Step 2: Chunk Documents
Large documents need to be split into smaller chunks for effective retrieval:
import tiktoken
def chunk_text(text: str, max_tokens: int = 500, overlap: int = 50) -> list[str]:
encoding = tiktoken.get_encoding("cl100k_base")
tokens = encoding.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + max_tokens
chunk_tokens = tokens[start:end]
chunks.append(encoding.decode(chunk_tokens))
start += max_tokens - overlap
return chunks
def process_documents(documents: list[dict]) -> list[dict]:
chunks = []
for doc in documents:
doc_chunks = chunk_text(doc["content"])
for i, chunk in enumerate(doc_chunks):
chunks.append({
"text": chunk,
"source": doc["filename"],
"chunk_index": i
})
return chunks
Why chunking matters: Too large chunks dilute relevance. Too small chunks lose context. 300-500 tokens is a good starting point for most use cases.
Building the Vector Store
Using ChromaDB (Free, Local)
import chromadb
from openai import OpenAI
client = OpenAI()
chroma = chromadb.Client()
collection = chroma.get_or_create_collection(
name="knowledge_base",
metadata={"hnsw:space": "cosine"}
)
def get_embeddings(texts: list[str]) -> list[list[float]]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
return [item.embedding for item in response.data]
Get AI agent tips in your inbox
Multi-agent workflows, BYOK tips, and product updates. No spam.
def index_documents(chunks: list[dict]): texts = [chunk["text"] for chunk in chunks] embeddings = get_embeddings(texts) ids = [f"{chunk['source']}chunk{chunk['chunk_index']}" for chunk in chunks] metadatas = [{"source": chunk["source"]} for chunk in chunks]
collection.add(
ids=ids,
embeddings=embeddings,
documents=texts,
metadatas=metadatas
)
### Index Your Documents
```python
documents = load_documents("./my_documents")
chunks = process_documents(documents)
index_documents(chunks)
print(f"Indexed {len(chunks)} chunks from {len(documents)} documents")
Implementing Retrieval
Basic Similarity Search
def retrieve(query: str, top_k: int = 5) -> list[dict]:
query_embedding = get_embeddings([query])[0]
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
retrieved = []
for i in range(len(results["ids"][0])):
retrieved.append({
"text": results["documents"][0][i],
"source": results["metadatas"][0][i]["source"],
"distance": results["distances"][0][i]
})
return retrieved
Reranking for Better Results
Basic vector search returns semantically similar documents, but similarity doesn't always mean relevance. Reranking fixes this:
def rerank(query: str, documents: list[dict], top_k: int = 3) -> list[dict]:
prompt = f"""Rate how relevant each document is to the query on a scale of 1-10.
Query: {query}
Documents:
"""
for i, doc in enumerate(documents):
prompt += f"\n[{i+1}] (from {doc['source']}): {doc['text'][:200]}...\n"
prompt += "\nReturn JSON: {\"rankings\": [{\"index\": 1, \"score\": 8}, ...]}"
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
import json
rankings = json.loads(response.choices[0].message.content)["rankings"]
rankings.sort(key=lambda x: x["score"], reverse=True)
return [documents[r["index"] - 1] for r in rankings[:top_k]]
Building the RAG Agent
Now let's combine retrieval with generation:
def ask_rag_agent(question: str, use_reranking: bool = True) -> str:
retrieved = retrieve(question, top_k=10)
if use_reranking:
retrieved = rerank(question, retrieved, top_k=5)
else:
retrieved = retrieved[:5]
context = "\n\n".join([
f"[Source: {doc['source']}]\n{doc['text']}"
for doc in retrieved
])
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """You are a knowledgeable assistant that answers questions based on the provided documents.
Rules:
1. Only answer based on the provided context
2. If the context doesn't contain the answer, say "I don't have enough information to answer this question"
3. Always cite the source document for each claim
4. Be concise and specific
5. Quote relevant passages when possible"""
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}
]
)
return response.choices[0].message.content
Testing the Agent
answer = ask_rag_agent("What is our company's vacation policy?")
print(answer)
The agent will:
- Embed the question
- Search for relevant document chunks
- Rerank by relevance
- Generate an answer citing specific documents
Improving Retrieval Quality
Technique 1: Query Expansion
Generate multiple search queries to improve recall:
def expand_query(query: str) -> list[str]:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Generate 3 different search queries that would find relevant documents for this question. Return JSON array of strings."},
{"role": "user", "content": query}
],
response_format={"type": "json_object"}
)
import json
expanded = json.loads(response.choices[0].message.content)
return [query] + expanded.get("queries", [])
def retrieve_with_expansion(query: str, top_k: int = 5) -> list[dict]:
queries = expand_query(query)
all_results = []
for q in queries:
results = retrieve(q, top_k=top_k)
all_results.extend(results)
seen = set()
unique = []
for r in all_results:
key = r["text"][:100]
if key not in seen:
seen.add(key)
unique.append(r)
return unique[:top_k * 2]
Technique 2: Hybrid Search
Combine vector search with keyword matching:
def hybrid_search(query: str, top_k: int = 5) -> list[dict]:
vector_results = retrieve(query, top_k=top_k * 2)
query_words = set(query.lower().split())
scored = []
for result in vector_results:
doc_words = set(result["text"].lower().split())
keyword_overlap = len(query_words & doc_words) / max(len(query_words), 1)
vector_score = 1 - result["distance"]
hybrid_score = 0.7 * vector_score + 0.3 * keyword_overlap
scored.append({**result, "hybrid_score": hybrid_score})
scored.sort(key=lambda x: x["hybrid_score"], reverse=True)
return scored[:top_k]
Technique 3: Metadata Filtering
Filter by document type, date, or category:
def retrieve_with_filters(query: str, source_filter: str = None, top_k: int = 5) -> list[dict]:
query_embedding = get_embeddings([query])[0]
where_filter = {}
if source_filter:
where_filter["source"] = {"$contains": source_filter}
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k,
where=where_filter if where_filter else None
)
return [{"text": results["documents"][0][i], "source": results["metadatas"][0][i]["source"]} for i in range(len(results["ids"][0]))]
Production Deployment
Scaling the Vector Store
For production, use a persistent vector database:
Scroll to see full table
| Database | Best For | Cost |
|---|---|---|
| ChromaDB (local) | Development, prototyping | Free |
| Pinecone | Production, managed | $25+/month |
| Weaviate | Self-hosted production | Free (self-hosted) |
| pgvector (Postgres) | Already using Postgres | Free |
| Qdrant | High performance | Free (self-hosted) |
Production RAG Pipeline
class ProductionRAGAgent:
def __init__(self, doc_directory: str):
self.documents = load_documents(doc_directory)
self.chunks = process_documents(self.documents)
index_documents(self.chunks)
def ask(self, question: str) -> dict:
results = hybrid_search(question, top_k=8)
top_results = rerank(question, results, top_k=5)
context = "\n\n".join([f"[{r['source']}]: {r['text']}" for r in top_results])
answer = ask_rag_agent(question)
return {
"answer": answer,
"sources": [r["source"] for r in top_results],
"chunks_used": len(top_results)
}
agent = ProductionRAGAgent("./company_docs")
result = agent.ask("What is our refund policy for enterprise customers?")
print(result["answer"])
print(f"Sources: {result['sources']}")
Keeping the Index Updated
import hashlib
def get_file_hash(filepath: str) -> str:
with open(filepath, "rb") as f:
return hashlib.md5(f.read()).hexdigest()
def update_index(directory: str, known_hashes: dict):
current_files = {}
for filename in os.listdir(directory):
filepath = os.path.join(directory, filename)
file_hash = get_file_hash(filepath)
current_files[filename] = file_hash
if filename not in known_hashes or known_hashes[filename] != file_hash:
doc = load_documents(directory)
chunks = process_documents(doc)
index_documents(chunks)
return current_files
Skip the Setup: Use Ivern AI
Building a RAG pipeline from scratch takes hours. Ivern AI provides knowledge-augmented agents out of the box:
- Upload your documents -- PDFs, text files, markdown
- Agents search your data -- automatic retrieval and citation
- No vector database management -- it's handled for you
- BYOK pricing -- use your API key, no markup on retrieval costs
Try knowledge-augmented agents: ivern.ai/signup
Key Takeaways
- RAG = Retrieve + Generate -- search your documents first, then let the LLM answer
- Chunking matters -- 300-500 tokens per chunk is the sweet spot for most use cases
- Reranking improves relevance -- vector similarity alone isn't enough
- Cite sources -- always show where the answer came from
- Keep indexes fresh -- rebuild when documents change
Next tutorials: AI Agent Tools Tutorial · Build AI Agent From Scratch · AI Agent Python Tutorial
Related Articles
AI Agent API Integration Tutorial: Connect Agents to Any External Service
Step-by-step tutorial for connecting AI agents to external APIs and services. Covers REST API integration, authentication, error handling, rate limiting, and building a tool layer that lets agents interact with any service.
AI Agent Collaboration Tutorial: How to Make Multiple Agents Work Together
Learn how to build collaborative AI agent systems where multiple specialized agents share context, hand off tasks, and produce results together. Covers communication patterns, context sharing, and real implementation examples.
AI Agent JavaScript Tutorial: Build a Web Agent with Node.js and OpenAI
Complete tutorial for building AI agents in JavaScript and Node.js. Covers the Vercel AI SDK, tool calling, streaming responses, and building a web-based agent interface. Includes full code examples.
Want to try multi-agent AI for free?
Generate a blog post, Twitter thread, LinkedIn post, and newsletter from one prompt. No signup required.
Try the Free DemoAI Content Factory -- Free to Start
One prompt generates blog posts, social media, and emails. Free tier, BYOK, zero markup.
No spam. Unsubscribe anytime.