How-to · Practical AI

How to Build Your First RAG App in Python (Step-by-Step)

If you can write basic Python, you can ship a working Retrieval-Augmented Generation (RAG) app in an afternoon. This is the same architecture we teach in the 8-Week Python + AI Systems Lab — no fluff, no buzzwords, just the seven steps that actually matter.

By the ThinkPythonAI TeamUpdated May 2026Live cohorts on Zoom

What is RAG, in one paragraph?

A RAG app answers questions using your own documents instead of just the LLM's baked-in knowledge. You break your documents into chunks, turn each chunk into a numeric vector, store them in a vector database, and when a user asks a question you find the most relevant chunks and feed them to an LLM as context. The LLM then writes an answer grounded in your data. That's it. The rest is engineering.

Step 1: Set up your Python environment

You need Python 3.10 or newer, a virtual environment, and a handful of libraries. Open a terminal and run:

python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate
pip install langchain langchain-openai langchain-community faiss-cpu pypdf python-dotenv

Create a .env file with OPENAI_API_KEY=sk-... and add .env to your .gitignore. Never commit keys.

Step 2: Load and chunk your document

LLMs have context limits and care about relevance, not volume. Smaller chunks improve recall. A safe default is 500–1000 tokens per chunk with a 50–100 token overlap so context isn't cut mid-thought.

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

docs = PyPDFLoader("your-document.pdf").load()
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=80)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks")

Step 3: Generate embeddings

An embedding is a vector of numbers that captures the meaning of a chunk. Two chunks about the same topic land near each other in vector space. OpenAI's text-embedding-3-small is fast, cheap, and very strong for English. Free local alternatives: sentence-transformers/all-MiniLM-L6-v2 via HuggingFace.

Step 4: Store vectors in a vector DB

For a first project, FAISS (in-memory) or Chroma (local file) is plenty. Move to Pinecone or Weaviate only when you need horizontal scaling.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(chunks, embeddings)
vectorstore.save_local("rag_index")

Step 5: Retrieve relevant chunks

When a question comes in, embed it the same way and ask the store for the top-k most similar chunks. Start with k=4. If answers feel thin, raise to 6; if they hallucinate, lower to 3.

Step 6: Generate a grounded answer

The system prompt is where most beginners go wrong. Be strict. Tell the model what to do when context is missing — otherwise it will guess.

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True,
)

result = qa.invoke({"query": "What does the document say about refunds?"})
print(result["result"])
for doc in result["source_documents"]:
    print("-", doc.metadata.get("source"), doc.metadata.get("page"))

Step 7: Evaluate and iterate

Build a small set of 10–20 real questions with known correct answers. Run them every time you change chunk size, top-k, or the prompt. Two levers do 80% of the work:

Chunking strategy. If retrieval is missing answers that exist in the doc, your chunks are too big or your overlap is too small.
Prompt discipline. "Answer using only the context below. If the answer is not present, say I don't know" — this single sentence eliminates most hallucinations.

Reach for fine-tuning only after you've maxed out retrieval and prompting. For most apps, you never will.

Common mistakes to avoid

Letting chunks get huge (2000+ tokens). Retrieval accuracy collapses.
Using a different embeddings model for indexing vs querying. Always the same.
Storing the LLM API key in your code or committing the vector index to a public repo.
Skipping evaluation. Without a test set, you're tuning blind.

Where to go next

Once your basic RAG works, the next upgrades — in order of impact — are: hybrid search (BM25 + vector), re-ranking with a cross-encoder, query rewriting, and finally agents that can call tools and decide when to retrieve. We cover all of these end-to-end in the 8-Week Python + AI Systems Lab.

Want to build this with live guidance?

ThinkPythonAI runs small live cohorts where you build real Python + AI projects with direct feedback. Most professionals go directly into the 8-Week Python + AI Systems Lab. Kids (Grades 5-12) have their own track.

See pricing Join the next live demo Browse courses