Inside the Cookbook
- What is RAG?
- How to Build a PDF Q & A Rag:
- Tools We’ll Use
- Let’s Build It Step-by-Step
- Step 1: Install Dependencies
- Step 2: Load the PDF Document
- Step 3: Chunk the Text
- Step 4: Convert Text Chunks into Embeddings
- Step 5: Store Chunks in a Vector Database (Chroma)
- Step 6: Ask Questions Using
- Try it out!
- Try Gemini or Hugging Face Instead of OpenAI
- Skip the complexity with QuickML RAG
What is RAG?
Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with language generation to create responses that are both fluent and grounded in real data
At its core, RAG works in two steps:
- Retrieve: It searches a collection of documents like PDFs, reports, or notes to find relevant content based on a user’s query.
- Generate: It feeds that retrieved content into a language model to generate an answer that is accurate, relevant, and context-aware.
- This approach allows developers to build applications where AI doesn’t just generate language, it reasons over real, user-provided information.
RAG is especially useful when:
- You need responses based on specific internal data
- You want to ensure answers are factual and source grounded
- You’re building tools like document assistants, custom chatbots, or enterprise search interfaces
By combining the strengths of search systems and language models, RAG provides a powerful foundation for building intelligent, context-aware AI applications.
Let’s break it down:
| Step | What Happens | Example |
|---|---|---|
| Retrieve | Use a search system to find relevant pieces of your data (e.g. text chunks from a PDF) | Find the paragraph in a report that lists the financial risks |
| Augment | Add that info to the LLM prompt so the model knows what to talk about | “Here is the text from the report… Now answer this question:” |
| Generate | The LLM writes an answer using that context | “Top risks include: market volatility, supply chain instability…” |
How to Build a PDF Q & A Rag
- Load a PDF document (like a company report/Manual)
- Break it into small text chunks
- Store those chunks in a vector database
- Convert both the text and your questions into embeddings
- Use those embeddings to find the most relevant chunks
- Ask the LLM to generate a factual answer using only those chunks
Tools We’ll Use
| Tool | Purpose |
|---|---|
| LangChain | LangChain is a framework for developing applications powered by large language models (LLMs). Connects all parts of the pipeline (retrieval + generation) |
| Chroma | Lightweight vector store for storing and searching documents |
| OpenAI / Embeddings | Converts text into embeddings for semantic search |
| PyPDFLoader | Extracts text from PDF pages |
| Text Splitter | Splits long documents into smaller parts |
Let’s Build It Step-by-Step
Step 1: Install DependenciesStep 1: Install Dependencies
Step 1: Install Dependencies
Before writing any code, let’s install the required libraries.
Step 2: Load the PDF Document
PDFs aren’t plain text they contain structured data, formatting, and pages. So we need a tool to extract the actual readable text from each page.
LangChain provides a plugin called PyPDFLoader that turns PDFs into LangChain-friendly Document objects
- "documents"is now a list of pages, each as a text chunk we can process.
- Stratus is Catalyst’s scalable object storage service for storing files like PDFs, images, or documents. It integrates seamlessly with serverless functions, enabling automated workflows and secure, cloud-based file management.
- Check out : Stratus Catalyst Native Object Storage
Step 3: Chunk the Text
Why chunk the document?
LLMs can’t handle huge amounts of text all at once. If we try to pass a 100-page document to GPT-4, it will either fail or truncate.
So we break the text into smaller, overlapping chunks just like scanning through parts of a book instead of reading the whole thing at once.
Step 4: Convert Text Chunks into Embeddings
What are embeddings?
Think of embeddings as a way to represent text as numbers in a high-dimensional space
Similar pieces of text will have similar embeddings. This lets us search semantically, not just by keyword.
from langchain.embeddings import OpenAIEmbeddings
embedding_model = OpenAIEmbeddings(openai_api_key="your-openai-key")
Now we can convert any text (a chunk or a question) into a numerical vector for comparison.
Step 5: Store Chunks in a Vector Database (Chroma)
What is a vector store?
A vector store lets you store embeddings and search by similarity. When the user asks a question, we convert that into an embedding and search the database for chunks that are “close” to it in vector space.
We’ll use Chroma, a lightweight and fast vector database.
from langchain.vectorstores import Chroma
db = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory="vector_db"
)
db.persist() # saves the data to disk
Step 6: Ask Questions Using
The heart of RAG
Now, we can build a function that:
- Embeds the user’s question
- Finds the most relevant chunks from the database
- Sends those chunks (as context) to the LLM
- Returns the answer
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
def answer_question(question: str):
# Step 1: Search the most relevant document chunks
results = db.similarity_search(question, k=4)
context = "\n\n".join([doc.page_content for doc in results])
# Step 2: Create a clear prompt with the context and question
prompt_template = """
Use ONLY the information in the context to answer the question.
Do not make anything up or include external knowledge.
Context:
{context}
Question:
{question}
Answer:
"""
prompt = PromptTemplate.from_template(prompt_template).format(
context=context,
question=question
)
# Step 3: Send prompt to LLM
model = ChatOpenAI(openai_api_key="your-openai-key")
return model.predict(prompt)
Let’s Build It Step-by-Step
Try it out!
response = answer_question("What risks were listed in the 2023 financial report?") print(response)
You’ll get a concise, factual answer pulled directly from the PDF content.
Response
This PDF is about **technological advancements, specifically focusing on chatbots**.
It discusses:
- The impact of technology (like smartphones) on daily life.
- The utility and functions of chatbots (e.g., consumer satisfaction, healthcare duties, data collection).
- The developer's perspective on building chatbots.
- The definition, capabilities, training, and embedding of chatbots within applications, mentioning "Catalyst" in this context.

Try Gemini or Hugging Face Instead of OpenAI
You’re not locked into OpenAI. Here’s how to swap out providers.
Gemini (Google)
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
embedding_model = GoogleGenerativeAIEmbeddings(
model="models/embedding-001", google_api_key="your-gemini-key"
)
llm = ChatGoogleGenerativeAI(model="gemini-pro", google_api_key="your-gemini-key")
Hugging Face
from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings
from langchain.llms import HuggingFaceHub
embedding_model = HuggingFaceInferenceAPIEmbeddings(
api_key="your-hf-token",
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
llm = HuggingFaceHub(
repo_id="mistralai/Mistral-7B-Instruct-v0.1",
model_kwargs={"temperature": 0.5}
)
Skip the complexity with QuickML RAG
After going through all these complex steps, you might wonder is there an easier way? Absolutely. With Catalyst’s QuickML RAG feature, you don’t need to piece together multiple frameworks, manage infrastructure, or write lengthy pipelines.
In just one click, upload your documents and instantly enable context-aware RAG for your applications. Build and deploy a production-ready RAG pipeline in minutes, saving time, reducing complexity, and ensuring scalability effortlessly.
Build and deploy a production-ready RAG pipeline in minutes.Save time and reduce complexity
Excited to try this? Contact our support team: support@zohocatalyst.com