RAG Architecture: Building AI Apps with Your Data

Retrieval Augmented Generation (RAG) is the most practical architecture for building enterprise AI applications today. It lets you combine the power of LLMs with your organisation's proprietary data — without fine-tuning a model.

If you're building an AI-powered internal search, document Q&A system, or customer support bot, RAG is almost certainly the right starting point.

What is RAG?

RAG is a two-step process:

Retrieve — Find the most relevant documents/chunks from your knowledge base
Generate — Feed those documents to an LLM along with the user's question, and let the LLM generate an answer grounded in your data

User Question → Retrieval (Vector DB) → Relevant Chunks → LLM → Grounded Answer

This solves the two biggest problems with using LLMs in enterprise:

Hallucination — The LLM is grounded in real documents, not just its training data
Data freshness — Your knowledge base can be updated without retraining the model

RAG Architecture Components

1. Document Ingestion Pipeline

Your documents need to be processed before they can be retrieved:

Raw Documents → Chunking → Embedding → Vector Database

Chunking strategies:

Fixed-size chunks (e.g., 500 tokens with 50 token overlap) — Simple, works well for homogeneous documents
Semantic chunking — Split on paragraph/section boundaries for better context preservation
Recursive splitting — Try larger chunks first, split further only if needed

[TIP] Chunk size matters more than most teams realise. Too small and you lose context. Too large and you dilute relevance. Start with 500-800 tokens and experiment.

2. Embedding Model

Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts produce similar vectors.

Model	Dimensions	Best For
OpenAI `text-embedding-3-small`	1536	General purpose, easy to start
Cohere `embed-v3`	1024	Multilingual (good for Indian languages)
BGE / E5 (open source)	768-1024	Self-hosted, no API costs

3. Vector Database

Stores embeddings and enables fast similarity search.

Database	Type	Good For
Pinecone	Managed cloud	Quick start, zero ops
Weaviate	Self-hosted/cloud	Hybrid search (vector + keyword)
pgvector	PostgreSQL extension	Already using PostgreSQL
ChromaDB	Embedded	Prototyping, small datasets

4. Retrieval

When a user asks a question:

Embed the question using the same embedding model
Search the vector database for the K most similar chunks
Return those chunks as context

Advanced retrieval techniques:

Hybrid search — Combine vector similarity with keyword matching (BM25)
Re-ranking — Use a cross-encoder to re-rank the top results for better relevance
Query expansion — Rewrite the user's query to improve retrieval

5. Generation

Pass the retrieved context + user question to the LLM:

System: You are a helpful assistant. Answer based on the provided context.
If the answer is not in the context, say "I don't have information about that."

Context:
{retrieved_chunks}

User: {question}

Implementation Example

Here's a simplified RAG pipeline in Python:

from openai import OpenAI
import chromadb

client = OpenAI()
db = chromadb.Client()
collection = db.get_or_create_collection("knowledge_base")

def ingest(documents):
    for i, doc in enumerate(documents):
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=doc["text"]
        )
        collection.add(
            ids=[f"doc_{i}"],
            embeddings=[response.data[0].embedding],
            documents=[doc["text"]],
            metadatas=[{"source": doc["source"]}]
        )

def query(question, k=5):
    # Embed the question
    q_embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=question
    ).data[0].embedding

    # Retrieve relevant chunks
    results = collection.query(
        query_embeddings=[q_embedding],
        n_results=k
    )

    context = "\n\n".join(results["documents"][0])

    # Generate answer
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": f"Answer based on this context:\n\n{context}"},
            {"role": "user", "content": question}
        ]
    )

    return response.choices[0].message.content

Common RAG Pitfalls

1. Poor Chunking

If your chunks split mid-sentence or mid-paragraph, the LLM gets fragmented context. Always use overlap and respect document structure.

2. Ignoring Metadata

Don't just store text — store metadata (source document, date, author, section). This enables filtering and attribution.

3. No Evaluation Framework

You need to measure RAG quality systematically:

Retrieval quality — Are the right chunks being retrieved? (Precision/Recall)
Generation quality — Is the answer accurate and well-formed? (Faithfulness, Relevance)
End-to-end — Does the system answer user questions correctly?

4. Skipping Hybrid Search

Pure vector search misses exact keyword matches. If a user asks about "SEBI circular 2024", vector search might return conceptually similar but wrong documents. Adding keyword search catches exact matches.

[WARNING] Don't go straight to production without evaluation. Build a test set of 50-100 question-answer pairs and measure your system's accuracy before deploying.

RAG vs Fine-Tuning

Aspect	RAG	Fine-Tuning
Data freshness	Real-time (update knowledge base)	Requires retraining
Cost	Lower (no training compute)	Higher (GPU hours)
Transparency	Can show source documents	Black box
Setup complexity	Moderate	High
Best for	Q&A, search, document analysis	Style/tone adaptation, specialised tasks

For most enterprise use cases, start with RAG. Only fine-tune if RAG isn't sufficient for your specific requirements.

Production Considerations

When moving RAG to production in an Indian enterprise context:

Latency — Host your vector database in the same region (ap-south-1 for AWS Mumbai) to minimise retrieval latency
Security — Ensure document-level access controls. Not every user should see every document.
Cost — Embedding and LLM API calls add up. Cache frequent queries and batch embed during off-peak hours.
Multilingual — If your documents are in Hindi or regional languages, choose embedding models with strong multilingual support.
Monitoring — Log every query, retrieval result, and generated answer for debugging and improvement.

RAG is not a set-and-forget system. Plan for continuous improvement based on user feedback and evaluation metrics.

RAG is the 80/20 of enterprise AI — it gets you 80% of the value with 20% of the complexity of fine-tuning. Start with RAG, build a solid evaluation framework, and only consider fine-tuning if RAG hits a ceiling for your specific use case.

RAG Architecture: Building AI Apps with Your Data

What is RAG?

RAG Architecture Components

1. Document Ingestion Pipeline

2. Embedding Model

3. Vector Database

4. Retrieval

5. Generation

Implementation Example

Common RAG Pitfalls

1. Poor Chunking

2. Ignoring Metadata

3. No Evaluation Framework

4. Skipping Hybrid Search

RAG vs Fine-Tuning

Production Considerations

AI Engineering

Train your team

Explore our products

Related Articles

The Coding Agent Revolution: Why History Is Repeating Itself

Introducing Plan — Strategic Analysis in Hours, Not Weeks

I Built an AI for People Who Hate Writing Emails