HOME
Data Dialogues: Episode 3 RAG in production: The parts you'll learn the hard way

Data Dialogues: Episode 3 RAG in production: The parts you'll learn the hard way

Poonam Singh
Last Updated : January 29, 2026
98 Views
8 Min Read

Data Dialogues brings unfiltered AI insights. Join hosts Poonam Singh and Ramki R as they break down machine learning—from data prep to deployment—with real-world stories and expert perspectives.

🔹 Poonam Singh leads marketing for Catalyst, making complex tech feel simple (and actually useful). With 18+ years in B2B marketing, she's all about cutting through the hype and helping businesses move beyond buzzwords to real-world impact.

🔹 Ramki R is a machine learning expert at Catalyst with nearly a decade in AI, data science, and automation. From coding to cracking AI challenges, he's played a pivotal role in making ML accessible for businesses. He's also the Head of AI at Zoho CRM and Catalyst.

Why RAG exists (and why it's misunderstood)

Poonam: Alright, Ramki, we're back for episode three. Today we're tackling something that's gotten way too much hype lately: RAG. Retrieval-augmented generation. Everyone's talking about it like it's the ultimate cure for LLM hallucinations. So, as someone who's actually built these systems in production, tell me: is RAG the real deal, or is it just another trendy buzzword?

Ramki: (laughs) Oh, RAG is absolutely real. But here's the thing: it's not magic. I've seen too many teams slap a vector database on top of an LLM, call it RAG, and then wonder why their chatbot is still making stuff up or returning completely irrelevant answers.

Poonam: Okay, so for the uninitiated: what actually is RAG?

Ramki: Right. So, here's the core problem: LLMs are trained on massive datasets, but that training occurred months or years ago. They don't know about your internal documents, your latest product specs, or anything that happened after their training cutoff. And when you ask them something they don't know? They hallucinate—confidently! That's the scary part. It’s terrifying for a business.

RAG is basically saying: "Before you answer, let me get you the actual documents that might contain the answer." You take the user's question, search your knowledge base, retrieve the relevant chunks, and feed those to the LLM as context. Then the LLM generates an answer grounded in that retrieved content.

Poonam: So it's like giving the LLM an open-book exam instead of making it rely on memory?

Ramki: Exactly! And that's powerful. But here's where teams go wrong: They think RAG is just embedding documents, throwing them in a vector DB, and doing a similarity search. That's maybe 30% of the solution. The other 70% is all the stuff nobody talks about in the blog posts.

The parts you'll learn the hard way

Poonam: Alright, I'm listening. What are the parts nobody talks about?

The indexing problem

Ramki: Let's start with indexing. It's not just throwing your documents into a vector database. Real indexing is a multi-stage pipeline, and most teams mess up at least three of the stages.

Poonam: Wait, stages? Can you walk us through those?

Ramki: Sure. First, you've got document chunking. Say you have a 150-page technical manual. You need to split it into chunks that are small enough to be relevant but large enough to preserve context. You can't just chunk blindly. You need overlapping chunks with contextual links. Each chunk needs to know what came before it and what comes after. Otherwise, when someone asks a question that spans multiple sections, your system can't piece together the answer.

Poonam: So, chunks need to be aware of each other?

Ramki: Exactly. And that's just step one. Step two is question generation. For each chunk, you generate targeted questions that the chunk could answer. That’s done because users don't query using the same language as your documents. A user would ask, "How do I reset my password?" but your docs say "Account credential management." If you embed questions alongside content, your retrieval gets way more accurate. Each chunk is enriched with targeted, context-aware questions to improve recall for precise queries.

Poonam: That's clever, but doesn't that mean you're indexing a lot more data?

Ramki: You are. And step three, which is summarization, adds even more. Each chunk gets a document-level summary attached to it. This way, the embedding captures both local details and global context. A chunk about "thermal tolerance specifications" is much more useful when the embedding also knows it's from a "manufacturing quality control manual."

Poonam: Okay, so chunking, question generation, and summarization. That's already way more complex than just embedding your docs.

Ramki: And we're not done. Step four is embedding generation, where all content, including questions and summaries, is converted into embeddings using the Snowflake English monolingual model.

Poonam: And then finally into the vector database?

Ramki: Step five, yes. These embeddings are batch-inserted into Weaviate with a controlled retry mechanism that ensures base chunks are stored before their associated question chunks. But even database ingestion has catches. You can't just batch-insert everything. You need controlled retry mechanisms.

Poonam: So indexing is actually five different problems?

Ramki: At minimum. And here's what I’ve noticed: most RAG tutorials out there just stop at chunking your documents and embedding them. They completely skip the crucial steps of question generation, summarization, contextual linking, and proper ingestion. They're missing all these important pieces that really make the system work effectively. Then teams wonder why their retrieval accuracy is terrible.

The retrieval problem

Ramki: Then you've got the retrieval problem itself. Semantic similarity search is great, but it's not perfect. Like, the user asks "What's your refund policy?" and it returns documents about our return shipping process because the embeddings are close. Semantically similar, contextually useless.

Poonam: How do you fix that?

Ramki: Hybrid search. Combine dense vector search with traditional keyword search. Use re-ranking models to sort the results. Implement metadata filters such as date ranges, document types, and departments. We also do query expansion sometimes, where you rephrase the user's question multiple ways before searching. And honestly? Sometimes you need to fine-tune your embedding model on your domain-specific data.

Poonam: Wait, so you're saying the out-of-the-box embeddings from OpenAI or wherever aren't good enough?

Ramki: Not always. Look, general-purpose embeddings are trained on internet text. If your domain is highly specialized like legal, medical, or manufacturing, you'll get better results with domain-adapted embeddings. We've seen retrieval accuracy jump significantly just from fine-tuning embeddings on industry-specific documents.

The prompt engineering problem

Poonam: Okay, but even if retrieval is perfect, doesn't the LLM still need to, you know, actually use the context correctly?

Ramki: (laughs) Yeah, this is where prompt engineering becomes critical. You can't just dump 10 retrieved documents into the context and say, "Answer this." The LLM needs instructions: "Only use the following documents. If the answer isn't in the documents, say 'I don't have that information.' Cite which document you're using."

For example, we built a customer support RAG system, and the retrieval was working perfectly, pulling exactly the right documents every time. But the bot kept giving confident answers that contradicted our documentation. Turns out, the LLM was ignoring the retrieved context and just using its pre-trained knowledge instead. We had to completely restructure the prompt to explicitly say "ONLY answer from the provided documents. Your training data may be outdated; ignore it." That simple change dropped our hallucination rate by 80%.

The data pipeline problem

Poonam: Ouch. So it's not just the AI; it's the entire data pipeline.

Ramki: Exactly! RAG isn't an AI problem; it's a data engineering problem. You need document ingestion pipelines, version control, access control, and refresh schedules. If your knowledge base has stale data, RAG will confidently serve stale answers.

The cost and latency problem

Poonam: What about the practical stuff: cost, latency, and scale?

Ramki: Glad you asked. Every RAG call is multiple API calls: embedding the query, searching the vector DB, embedding the retrieved docs if needed, and running the LLM inference. That adds up fast. We've seen systems where RAG was costing 10 times more per query than a simple LLM call because the retrieval step was pulling 50 documents every time.

Latency is the other killer. Users expect instant responses. If your retrieval takes two seconds and your LLM takes three, that's way too slow for a chatbot. You need to optimize every step: fast vector DB, efficient embedding, and streaming responses from the LLM.

What good RAG actually looks like

Poonam: So what's your advice for teams implementing RAG for the first time?

Ramki: First, start small. Don't try to index your entire company's knowledge base on day one. Pick one well-defined domain, like your support docs or product FAQs. Get that working well before expanding.

Second, measure everything. Track retrieval accuracy separately from answer quality. If your system is giving bad answers, is it because retrieval failed or because the LLM misunderstood the context? You can't fix what you can't measure.

Third, implement transparency. Show users which documents were used to generate the answer. Let them verify the sources. This builds trust and makes debugging way easier. When someone reports a bad answer, you can trace back exactly which documents the system retrieved.

Poonam: That actually brings up something I saw in the QuickML announcement. They're doing exactly that with their RAG feature: transparent breakdowns showing which documents contributed to each answer.

Ramki: Yeah, and that's the right approach. Look, RAG is powerful when done right, but "done right" means treating it as a full-stack problem: data engineering, retrieval optimization, prompt engineering, evaluation, and monitoring. It's not a weekend project.

The teams that succeed with RAG are the ones who understand it's not about the algorithm; it's about the entire system. You need clean data, good chunking, hybrid searching, metadata filtering, re-ranking, careful prompting, proper evaluation, version control, access control, monitoring, and continuous improvement.

Poonam: So, bottom line: RAG isn't overhyped, but it's underhyped in terms of complexity?

Ramki: (laughs) That's a good way to put it. Grounding LLMs in real documents solves real problems, but the implementation is messy. If someone tells you they built a production RAG system in two weeks, either they're lying, or it's going to break spectacularly in production.

Poonam: And I'm guessing you've seen both scenarios?

Ramki: Oh, multiple times. But when you get it right... When retrieval is tight, the LLM is well-prompted, and the data pipeline is solid, RAG is genuinely transformative. You can build systems that answer questions with accuracy and transparency that pure LLMs can't match.

When not to use RAG

Poonam: Before we wrap up, you mentioned earlier that RAG isn't always the answer. When should teams not use RAG?

Ramki: Great question. Don't use RAG when:

Your use case doesn't need external knowledge. If you just need text summarization or creative writing, RAG adds complexity without value.
Your document corpus is tiny, say under 50 documents. Just include them in the system prompt. The overhead of RAG isn't worth it.
You need real-time data that changes by the second. Stock prices, live sports scores... RAG's retrieval won't be fast enough. You need direct API integration.
Your documents are highly unstructured or of low quality. RAG amplifies your data quality problems. If your docs are poorly written or inconsistent, fix that first.

Poonam: So RAG is powerful but not universal.

Ramki: Right. And honestly, we built LLM serving in QuickML alongside RAG specifically for this reason. Sometimes you just need a tuned LLM without retrieval. The key is picking the right tool for the problem, not forcing RAG into every solution because it's trendy.

Poonam: Alright, any parting wisdom for the developers trying to implement this?

Ramki: Yeah! Be skeptical of demo-ware. Every vendor will show you a slick demo where RAG works perfectly. Ask them about document versioning, retrieval failures, conflicting information in documents, access control, cost at scale, and latency under load. The real work isn't in the happy path; it's in handling all the edge cases.

If you're just starting out, use a platform that's thought through these problems already.

Poonam: Well, that's probably the most honest conversation about RAG I've heard in a while. Thanks for keeping it real, Ramki.

Ramki: Anytime. Just doing my part to save people from avoidable 3 AM debugging sessions.

Poonam: (laughs) A noble cause. Alright, everyone, if you want to explore RAG, check out QuickML's RAG feature.

Want to build RAG systems that actually work?

Try QuickML and connect your LLMs to real knowledge.

Poonam Singh

Your email address will not be published. Required fields are marked

Data Dialogues: Episode 3 RAG in production: The parts you'll learn the hard way

Leave a Reply