Retrieval Augmented Generation (RAG) is the most practical architecture for building enterprise AI applications today. It lets you combine the power of LLMs with your organisation's proprietary data — without fine-tuning a model.
If you're building an AI-powered internal search, document Q&A system, or customer support bot, RAG is almost certainly the right starting point.
What is RAG?
RAG is a two-step process:
- Retrieve — Find the most relevant documents/chunks from your knowledge base
- Generate — Feed those documents to an LLM along with the user's question, and let the LLM generate an answer grounded in your data
User Question → Retrieval (Vector DB) → Relevant Chunks → LLM → Grounded AnswerThis solves the two biggest problems with using LLMs in enterprise:
- Hallucination — The LLM is grounded in real documents, not just its training data
- Data freshness — Your knowledge base can be updated without retraining the model
RAG Architecture Components
1. Document Ingestion Pipeline
Your documents need to be processed before they can be retrieved:
Raw Documents → Chunking → Embedding → Vector DatabaseChunking strategies:
- Fixed-size chunks (e.g., 500 tokens with 50 token overlap) — Simple, works well for homogeneous documents
- Semantic chunking — Split on paragraph/section boundaries for better context preservation
- Recursive splitting — Try larger chunks first, split further only if needed
[TIP] Chunk size matters more than most teams realise. Too small and you lose context. Too large and you dilute relevance. Start with 500-800 tokens and experiment.
2. Embedding Model
Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts produce similar vectors.
| Model | Dimensions | Best For |
OpenAI text-embedding-3-small | 1536 | General purpose, easy to start |
Cohere embed-v3 | 1024 | Multilingual (good for Indian languages) |
| BGE / E5 (open source) | 768-1024 | Self-hosted, no API costs |
3. Vector Database
Stores embeddings and enables fast similarity search.
| Database | Type | Good For |
| Pinecone | Managed cloud | Quick start, zero ops |
| Weaviate | Self-hosted/cloud | Hybrid search (vector + keyword) |
| pgvector | PostgreSQL extension | Already using PostgreSQL |
| ChromaDB | Embedded | Prototyping, small datasets |
4. Retrieval
When a user asks a question:
- Embed the question using the same embedding model
- Search the vector database for the K most similar chunks
- Return those chunks as context
Advanced retrieval techniques:
- Hybrid search — Combine vector similarity with keyword matching (BM25)
- Re-ranking — Use a cross-encoder to re-rank the top results for better relevance
- Query expansion — Rewrite the user's query to improve retrieval
5. Generation
Pass the retrieved context + user question to the LLM:
System: You are a helpful assistant. Answer based on the provided context.
If the answer is not in the context, say "I don't have information about that."
Context:
{retrieved_chunks}
User: {question}Implementation Example
Here's a simplified RAG pipeline in Python:
from openai import OpenAI
import chromadb
client = OpenAI()
db = chromadb.Client()
collection = db.get_or_create_collection("knowledge_base")
def ingest(documents):
for i, doc in enumerate(documents):
response = client.embeddings.create(
model="text-embedding-3-small",
input=doc["text"]
)
collection.add(
ids=[f"doc_{i}"],
embeddings=[response.data[0].embedding],
documents=[doc["text"]],
metadatas=[{"source": doc["source"]}]
)
def query(question, k=5):
# Embed the question
q_embedding = client.embeddings.create(
model="text-embedding-3-small",
input=question
).data[0].embedding
# Retrieve relevant chunks
results = collection.query(
query_embeddings=[q_embedding],
n_results=k
)
context = "\n\n".join(results["documents"][0])
# Generate answer
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": f"Answer based on this context:\n\n{context}"},
{"role": "user", "content": question}
]
)
return response.choices[0].message.contentCommon RAG Pitfalls
1. Poor Chunking
If your chunks split mid-sentence or mid-paragraph, the LLM gets fragmented context. Always use overlap and respect document structure.
2. Ignoring Metadata
Don't just store text — store metadata (source document, date, author, section). This enables filtering and attribution.
3. No Evaluation Framework
You need to measure RAG quality systematically:
- Retrieval quality — Are the right chunks being retrieved? (Precision/Recall)
- Generation quality — Is the answer accurate and well-formed? (Faithfulness, Relevance)
- End-to-end — Does the system answer user questions correctly?
4. Skipping Hybrid Search
Pure vector search misses exact keyword matches. If a user asks about "SEBI circular 2024", vector search might return conceptually similar but wrong documents. Adding keyword search catches exact matches.
[WARNING] Don't go straight to production without evaluation. Build a test set of 50-100 question-answer pairs and measure your system's accuracy before deploying.
RAG vs Fine-Tuning
| Aspect | RAG | Fine-Tuning |
| Data freshness | Real-time (update knowledge base) | Requires retraining |
| Cost | Lower (no training compute) | Higher (GPU hours) |
| Transparency | Can show source documents | Black box |
| Setup complexity | Moderate | High |
| Best for | Q&A, search, document analysis | Style/tone adaptation, specialised tasks |
For most enterprise use cases, start with RAG. Only fine-tune if RAG isn't sufficient for your specific requirements.
Production Considerations
When moving RAG to production in an Indian enterprise context:
- Latency — Host your vector database in the same region (ap-south-1 for AWS Mumbai) to minimise retrieval latency
- Security — Ensure document-level access controls. Not every user should see every document.
- Cost — Embedding and LLM API calls add up. Cache frequent queries and batch embed during off-peak hours.
- Multilingual — If your documents are in Hindi or regional languages, choose embedding models with strong multilingual support.
- Monitoring — Log every query, retrieval result, and generated answer for debugging and improvement.
RAG is not a set-and-forget system. Plan for continuous improvement based on user feedback and evaluation metrics.


