📖 In This Tutorial
Retrieval-Augmented Generation (RAG) has quickly become the dominant architecture for grounding large language models in custom enterprise data. By combining the generative power of LLMs with a retrieval step over your own documents, RAG effectively addresses two critical limitations: hallucination and the knowledge cutoff problem. This RAG pipeline tutorial walks you through building a production-ready system using LangChain and OpenAI.
Whether you're building an AI assistant for customer support, an internal knowledge base, or a research tool, the pattern you'll learn here applies directly. Demand for practical guides like this has surged as companies race to deploy AI assistants that actually know their data.
By the end of this guide, you'll have a fully functional RAG pipeline that can ingest documents, index them into a vector store, and answer natural-language questions with citations. Let's dive in.
Prerequisites
Before you begin, ensure you have the following:
- Python 3.10+ installed
- An OpenAI API key (set as
OPENAI_API_KEYenvironment variable) - Basic familiarity with Python and command-line tools
- LangChain (
langchain,langchain-openai,langchain-community) installed via pip - A vector store — we'll use ChromaDB (lightweight and local)
python -m venv rag-env && source rag-env/bin/activate to get started.
What is RAG?
Retrieval-Augmented Generation is a technique that enhances LLM outputs by first retrieving relevant information from a knowledge base, then feeding that context into the model's prompt. Instead of relying solely on the LLM's internal knowledge, RAG grounds every answer in your data.
The core workflow is:
- Ingest — Load documents from PDFs, websites, databases, etc.
- Chunk & Embed — Split documents into manageable pieces and convert them to vector embeddings.
- Index — Store embeddings in a vector database for fast similarity search.
- Retrieve — When a user asks a question, find the most relevant chunks via semantic search.
- Generate — Feed the retrieved context + the question to an LLM to produce a grounded answer.
This approach is significantly more reliable than fine-tuning for most use cases, because you can update the knowledge base without retraining the model.
RAG Pipeline Architecture
Our pipeline will use LangChain as the orchestration layer — it gives us reusable components for document loading, text splitting, embeddings, vector stores, and prompt chaining. OpenAI provides the embeddings model (text-embedding-3-small) and the generation model (gpt-4o or gpt-3.5-turbo).
Here's the high-level flow:
User Question
│
▼
[Embedder] → similarity search → Vector DB (Chroma)
│
▼
[Retrieved Chunks] ──┐
│
┌──────────────┘
▼
[Prompt Template] ──→ [LLM (OpenAI)] ──→ Grounded Answer
Each component is modular and swappable — you can replace Chroma with Pinecone or Weaviate, or swap OpenAI for Anthropic or a local model.
Step-by-Step Implementation
Step 1: Install Dependencies
pip install langchain langchain-openai langchain-community chromadb pypdf tiktoken
We'll use pypdf for PDF loading and tiktoken for token-aware chunking.
Step 2: Load and Split Documents
Use LangChain's PyPDFLoader to ingest PDFs, then split them into overlapping chunks with RecursiveCharacterTextSplitter.
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
loader = PyPDFLoader("enterprise_handbook.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ".", " "],
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks")
Step 3: Create Embeddings and Vector Store
We'll embed each chunk using OpenAI's embedding model and store them in Chroma.
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
text-embedding-3-small is cost-effective and produces 1536-dimensional vectors. For higher accuracy, use text-embedding-3-large (3072 dimensions).
Step 4: Build the Retrieval QA Chain
Now connect the retriever to the OpenAI LLM using LangChain's create_retrieval_chain pattern.
from langchain_openai import ChatOpenAI
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
# Prompt template that forces grounded answers
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant. Use the provided context to answer the user's question. "
"If you cannot find the answer in the context, say so clearly. Cite relevant passages."),
("human", "Context: {context}\n\nQuestion: {input}"),
])
combine_docs_chain = create_stuff_documents_chain(llm, prompt)
retrieval_chain = create_retrieval_chain(
vectorstore.as_retriever(search_kwargs={"k": 4}),
combine_docs_chain
)
# Test it
response = retrieval_chain.invoke({"input": "What is the company's remote work policy?"})
print(response["answer"])
Step 5: Add Memory (for conversations)
For multi-turn conversations, wrap the chain with ConversationBufferMemory.
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
memory_key="chat_history",
return_messages=True
)
convo_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=vectorstore.as_retriever(),
memory=memory
)
result = convo_chain({"question": "What about travel reimbursement?"})
print(result["answer"])
You now have a fully functional RAG pipeline with conversational memory. Each turn retrieves fresh context and maintains conversation history.
Optimization & Best Practices
A basic RAG pipeline works, but production systems demand more. Here are key optimizations:
- Chunking strategy: Experiment with chunk sizes between 500 and 1500 tokens. Overlap of 10–15% prevents context fragmentation.
- Metadata filtering: Store document source, page number, and date in metadata. Use
filterin the retriever for scoped queries. - Hybrid search: Combine vector similarity with keyword (BM25) for better recall. Chroma supports this via
langchain_community.retrievers.BM25Retriever. - Re-ranking: Use a cross-encoder (e.g., Cohere or
cross-encoder/ms-marco-MiniLM-L-6-v2) to re-rank retrieved chunks and keep only the top 2–3 most relevant. - Prompt engineering: Explicitly instruct the LLM to say "I don't know" when context is insufficient. This drastically reduces hallucination.
- Caching: Cache embeddings and LLM responses for repeated queries to reduce latency and cost.
Conclusion & Next Steps
In this RAG pipeline tutorial, you built a complete retrieval-augmented generation system using LangChain and OpenAI. You learned how to ingest documents, create embeddings, store them in a vector database, and retrieve relevant context to generate grounded, trustworthy answers.
This architecture is the foundation for enterprise AI assistants, knowledge management systems, and customer-facing chatbots. The same pattern scales from a single PDF to millions of documents with the right vector store and infrastructure.
Next steps to deepen your RAG expertise:
- Explore RAG evaluation metrics — faithfulness, relevance, and precision.
- Learn about agentic RAG — combining retrieval with tool use for multi-step reasoning.
- Check our advanced LangChain patterns for conditional retrieval and dynamic chunking.
- Deploy your pipeline with Docker and FastAPI for production serving.
RAG is evolving fast — stay tuned to AI Science Hub for the latest research, benchmarks, and practical guides. If you built something cool with this tutorial, let us know — we'd love to feature it.