RAG Explained: How AI Answers From Your Documents
A plain-English explanation of Retrieval-Augmented Generation. How RAG works, why it beats fine-tuning, and what makes a good RAG pipeline.
You've probably used ChatGPT. You type a question, it gives you an answer from its training data. But what if you want AI to answer questions about your documents? Your company's policies, your product documentation, your internal wiki?
You have two options: fine-tune a model on your data, or use RAG. Here's why RAG wins for most use cases, and how it actually works.
What is RAG?
RAG stands for Retrieval-Augmented Generation. Instead of training the AI on your documents (expensive, slow, hard to update), you:
- Store your documents in a searchable format
- Retrieve the most relevant passages when someone asks a question
- Generate an answer using the LLM, but only from those retrieved passages
The AI doesn't memorize your documents. It reads the relevant parts on demand, like a researcher looking up sources before writing an answer.
Why not just fine-tune?
Fine-tuning means training the AI model on your specific data. It sounds like the right approach, but:
| Factor | Fine-tuning | RAG |
|---|---|---|
| Cost | $100s-$1000s per training run | Near zero (just storage + API calls) |
| Update speed | Hours to retrain | Seconds (re-index the document) |
| Source citations | Not possible | Built-in (you know which passage was used) |
| Hallucination control | Hard (model might blend training data) | Easier (answer is grounded in specific passages) |
| Data freshness | Stale until retrained | Always current |
| Scale | Each model is a separate training job | Add documents without retraining |
RAG gives you grounded answers with citations. Fine-tuning gives you a model that "knows" your data but can't tell you where it learned something.
How RAG works, step by step
Step 1: Document ingestion
Your documents (PDFs, Word files, web pages) are split into small passages called "chunks." Each chunk is typically 300-500 characters, roughly a paragraph.
Why split? Because when someone asks "What's the return policy?", you don't want to send the entire 50-page handbook to the AI. You want to send the 2-3 paragraphs that actually discuss returns.
Step 2: Embedding
Each chunk is converted into a vector, a list of numbers that captures the meaning of the text. The word "vacation" and the phrase "paid time off" produce similar vectors because they mean similar things, even though the words are different.
These vectors are stored in a database with a vector index (like pgvector for PostgreSQL) that enables fast similarity search.
Step 3: Question arrives
A visitor types: "What is the PTO policy?"
The question is also converted to a vector using the same embedding model.
Step 4: Retrieval
The system searches for chunks whose vectors are most similar to the question's vector. This is "semantic search" because it searches by meaning, not keywords.
A good RAG system also runs keyword search in parallel (for exact terms like product names or policy numbers) and fuses the results. This hybrid approach catches things that either search alone would miss.
Step 5: Reranking
The initial search returns 10-20 candidates. A specialized neural model (cross-encoder) re-scores each one by reading the question and passage together. This is more accurate than the initial search because it considers the full interaction between question and passage.
Step 6: Generation
The top passages, along with the system prompt and conversation history, are sent to the LLM (like Gemini or Claude). The LLM generates an answer grounded in those passages.
The key: the LLM is instructed to answer only from the provided context. If the passages don't contain the answer, it says so rather than making something up.
Step 7: Confidence scoring
The system evaluates how well the retrieved passages support the answer using multiple signals: cross-encoder relevance scores and vector similarity. This produces a confidence badge (Verified, Grounded, Mixed, or AI Generated) so the visitor knows how trustworthy the answer is.
What makes a good RAG pipeline?
Not all RAG implementations are equal. The difference between "demo quality" and "production quality" is significant:
Hybrid search over single-signal
Using only vector search misses exact keyword matches. Using only keyword search misses semantic matches. Combining both with Reciprocal Rank Fusion (RRF) catches what either alone would miss.
Cross-encoder reranking
The initial search is fast but approximate. A cross-encoder that reads the question and passage together produces dramatically better precision, especially for nuanced questions.
Neighbor expansion
Documents are chunked for search accuracy, but the AI needs full context. After finding the best chunks, grabbing the surrounding chunks ensures the AI sees complete paragraphs and sections, not isolated fragments.
Multi-signal confidence
A single similarity score can be misleading. Combining cross-encoder relevance (good for factoid questions) with vector similarity (good for topical questions) avoids blind spots.
Guardrails
Production RAG needs input protection (rate limiting, content filtering, prompt injection detection) and output protection (PII redaction, hallucination flagging, response length limits).
Common misconceptions
"RAG is just vector search + ChatGPT" That's the demo version. Production RAG adds hybrid search, reranking, neighbor expansion, confidence scoring, caching, guardrails, and failover. The vector search part is maybe 10% of the work.
"Bigger context windows replace RAG" Models now accept 100K+ tokens, so why not paste the entire document? Cost (you pay per token), latency (more tokens = slower), and precision (the model performs worse with more irrelevant context). RAG retrieves only what's relevant.
"RAG hallucinates less than fine-tuning" Partially true. RAG grounds answers in specific passages, which helps. But the LLM can still misinterpret or overextend from the context. Confidence scoring and hallucination flagging address this.
"You need a GPU for RAG" Not for search or generation (those use APIs). Local embedding models benefit from a GPU, but API-based embeddings (Gemini, OpenAI) work on any server with minimal CPU usage.
Try it yourself
Ask360 is a RAG platform that handles the entire pipeline: document ingestion, chunking, embedding, hybrid search, reranking, confidence scoring, and guardrails. Upload your documents and see RAG in action with the free tier.