Retrieval-Augmented Generation (RAG) has quickly become the dominant architecture for grounding large language models in custom enterprise data. By combining the generative power of LLMs with a retrieval step over your own documents, RAG effectively addresses two critical limitations: hallucination and the knowledge cutoff problem. This RAG pipeline tutorial walks you through building a production-ready system using LangChain and OpenAI.

Whether you're building an AI assistant for customer support, an internal knowledge base, or a research tool, the pattern you'll learn here applies directly. Demand for practical guides like this has surged as companies race to deploy AI assistants that actually know their data.

By the end of this guide, you'll have a fully functional RAG pipeline that can ingest documents, index them into a vector store, and answer natural-language questions with citations. Let's dive in.

Prerequisites

Before you begin, ensure you have the following:

  • Python 3.10+ installed
  • An OpenAI API key (set as OPENAI_API_KEY environment variable)
  • Basic familiarity with Python and command-line tools
  • LangChain (langchain, langchain-openai, langchain-community) installed via pip
  • A vector store — we'll use ChromaDB (lightweight and local)
💡 Pro tip: Use a Python virtual environment to keep dependencies clean. Run python -m venv rag-env && source rag-env/bin/activate to get started.

What is RAG?

Retrieval-Augmented Generation is a technique that enhances LLM outputs by first retrieving relevant information from a knowledge base, then feeding that context into the model's prompt. Instead of relying solely on the LLM's internal knowledge, RAG grounds every answer in your data.

The core workflow is:

  1. Ingest — Load documents from PDFs, websites, databases, etc.
  2. Chunk & Embed — Split documents into manageable pieces and convert them to vector embeddings.
  3. Index — Store embeddings in a vector database for fast similarity search.
  4. Retrieve — When a user asks a question, find the most relevant chunks via semantic search.
  5. Generate — Feed the retrieved context + the question to an LLM to produce a grounded answer.

This approach is significantly more reliable than fine-tuning for most use cases, because you can update the knowledge base without retraining the model.

RAG Pipeline Architecture

Our pipeline will use LangChain as the orchestration layer — it gives us reusable components for document loading, text splitting, embeddings, vector stores, and prompt chaining. OpenAI provides the embeddings model (text-embedding-3-small) and the generation model (gpt-4o or gpt-3.5-turbo).

Here's the high-level flow:

User Question
       │
       ▼
[Embedder] → similarity search → Vector DB (Chroma)
       │
       ▼
[Retrieved Chunks] ──┐
                      │
       ┌──────────────┘
       ▼
[Prompt Template] ──→ [LLM (OpenAI)] ──→ Grounded Answer

Each component is modular and swappable — you can replace Chroma with Pinecone or Weaviate, or swap OpenAI for Anthropic or a local model.

Step-by-Step Implementation

Step 1: Install Dependencies

pip install langchain langchain-openai langchain-community chromadb pypdf tiktoken

We'll use pypdf for PDF loading and tiktoken for token-aware chunking.

Step 2: Load and Split Documents

Use LangChain's PyPDFLoader to ingest PDFs, then split them into overlapping chunks with RecursiveCharacterTextSplitter.

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = PyPDFLoader("enterprise_handbook.pdf")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", " "],
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks")

Step 3: Create Embeddings and Vector Store

We'll embed each chunk using OpenAI's embedding model and store them in Chroma.

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
🔍 Note: text-embedding-3-small is cost-effective and produces 1536-dimensional vectors. For higher accuracy, use text-embedding-3-large (3072 dimensions).

Step 4: Build the Retrieval QA Chain

Now connect the retriever to the OpenAI LLM using LangChain's create_retrieval_chain pattern.

from langchain_openai import ChatOpenAI
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o", temperature=0.1)

# Prompt template that forces grounded answers
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Use the provided context to answer the user's question. "
               "If you cannot find the answer in the context, say so clearly. Cite relevant passages."),
    ("human", "Context: {context}\n\nQuestion: {input}"),
])

combine_docs_chain = create_stuff_documents_chain(llm, prompt)
retrieval_chain = create_retrieval_chain(
    vectorstore.as_retriever(search_kwargs={"k": 4}),
    combine_docs_chain
)

# Test it
response = retrieval_chain.invoke({"input": "What is the company's remote work policy?"})
print(response["answer"])

Step 5: Add Memory (for conversations)

For multi-turn conversations, wrap the chain with ConversationBufferMemory.

from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

convo_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=vectorstore.as_retriever(),
    memory=memory
)

result = convo_chain({"question": "What about travel reimbursement?"})
print(result["answer"])

You now have a fully functional RAG pipeline with conversational memory. Each turn retrieves fresh context and maintains conversation history.

Optimization & Best Practices

A basic RAG pipeline works, but production systems demand more. Here are key optimizations:

  • Chunking strategy: Experiment with chunk sizes between 500 and 1500 tokens. Overlap of 10–15% prevents context fragmentation.
  • Metadata filtering: Store document source, page number, and date in metadata. Use filter in the retriever for scoped queries.
  • Hybrid search: Combine vector similarity with keyword (BM25) for better recall. Chroma supports this via langchain_community.retrievers.BM25Retriever.
  • Re-ranking: Use a cross-encoder (e.g., Cohere or cross-encoder/ms-marco-MiniLM-L-6-v2) to re-rank retrieved chunks and keep only the top 2–3 most relevant.
  • Prompt engineering: Explicitly instruct the LLM to say "I don't know" when context is insufficient. This drastically reduces hallucination.
  • Caching: Cache embeddings and LLM responses for repeated queries to reduce latency and cost.
📈 Pro tip: Set up monitoring with LangSmith to trace retrievals, track latency, and debug failures in production.

Conclusion & Next Steps

In this RAG pipeline tutorial, you built a complete retrieval-augmented generation system using LangChain and OpenAI. You learned how to ingest documents, create embeddings, store them in a vector database, and retrieve relevant context to generate grounded, trustworthy answers.

This architecture is the foundation for enterprise AI assistants, knowledge management systems, and customer-facing chatbots. The same pattern scales from a single PDF to millions of documents with the right vector store and infrastructure.

Next steps to deepen your RAG expertise:

RAG is evolving fast — stay tuned to AI Science Hub for the latest research, benchmarks, and practical guides. If you built something cool with this tutorial, let us know — we'd love to feature it.