Knowledge Base
TechnicalLLMs·6 min read

RAG Architecture: Building AI Apps with Your Data

A technical guide to Retrieval Augmented Generation for enterprise applications.

RV

Ritesh Vajariya

10 January 2025 · Updated 15 February 2025

Retrieval Augmented Generation (RAG) is the most practical architecture for building enterprise AI applications today. It lets you combine the power of LLMs with your organisation's proprietary data — without fine-tuning a model.

If you're building an AI-powered internal search, document Q&A system, or customer support bot, RAG is almost certainly the right starting point.

What is RAG?

RAG is a two-step process:

  1. Retrieve — Find the most relevant documents/chunks from your knowledge base
  2. Generate — Feed those documents to an LLM along with the user's question, and let the LLM generate an answer grounded in your data
User Question → Retrieval (Vector DB) → Relevant Chunks → LLM → Grounded Answer

This solves the two biggest problems with using LLMs in enterprise:

  • Hallucination — The LLM is grounded in real documents, not just its training data
  • Data freshness — Your knowledge base can be updated without retraining the model

RAG Architecture Components

1. Document Ingestion Pipeline

Your documents need to be processed before they can be retrieved:

Raw Documents → Chunking → Embedding → Vector Database

Chunking strategies:

  • Fixed-size chunks (e.g., 500 tokens with 50 token overlap) — Simple, works well for homogeneous documents
  • Semantic chunking — Split on paragraph/section boundaries for better context preservation
  • Recursive splitting — Try larger chunks first, split further only if needed

Pro Tip

Chunk size matters more than most teams realise. Too small and you lose context. Too large and you dilute relevance. Start with 500-800 tokens and experiment.

2. Embedding Model

Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts produce similar vectors.

ModelDimensionsBest For
OpenAI text-embedding-3-small1536General purpose, easy to start
Cohere embed-v31024Multilingual (good for Indian languages)
BGE / E5 (open source)768-1024Self-hosted, no API costs

3. Vector Database

Stores embeddings and enables fast similarity search.

DatabaseTypeGood For
PineconeManaged cloudQuick start, zero ops
WeaviateSelf-hosted/cloudHybrid search (vector + keyword)
pgvectorPostgreSQL extensionAlready using PostgreSQL
ChromaDBEmbeddedPrototyping, small datasets

4. Retrieval

When a user asks a question:

  1. Embed the question using the same embedding model
  2. Search the vector database for the K most similar chunks
  3. Return those chunks as context

Advanced retrieval techniques:

  • Hybrid search — Combine vector similarity with keyword matching (BM25)
  • Re-ranking — Use a cross-encoder to re-rank the top results for better relevance
  • Query expansion — Rewrite the user's query to improve retrieval

5. Generation

Pass the retrieved context + user question to the LLM:

System: You are a helpful assistant. Answer based on the provided context.
If the answer is not in the context, say "I don't have information about that."

Context:
{retrieved_chunks}

User: {question}

Implementation Example

Here's a simplified RAG pipeline in Python:

from openai import OpenAI
import chromadb

client = OpenAI()
db = chromadb.Client()
collection = db.get_or_create_collection("knowledge_base")

def ingest(documents):
    for i, doc in enumerate(documents):
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=doc["text"]
        )
        collection.add(
            ids=[f"doc_{i}"],
            embeddings=[response.data[0].embedding],
            documents=[doc["text"]],
            metadatas=[{"source": doc["source"]}]
        )

def query(question, k=5):
    # Embed the question
    q_embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=question
    ).data[0].embedding

    # Retrieve relevant chunks
    results = collection.query(
        query_embeddings=[q_embedding],
        n_results=k
    )

    context = "\n\n".join(results["documents"][0])

    # Generate answer
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": f"Answer based on this context:\n\n{context}"},
            {"role": "user", "content": question}
        ]
    )

    return response.choices[0].message.content

Common RAG Pitfalls

1. Poor Chunking

If your chunks split mid-sentence or mid-paragraph, the LLM gets fragmented context. Always use overlap and respect document structure.

2. Ignoring Metadata

Don't just store text — store metadata (source document, date, author, section). This enables filtering and attribution.

3. No Evaluation Framework

You need to measure RAG quality systematically:

  • Retrieval quality — Are the right chunks being retrieved? (Precision/Recall)
  • Generation quality — Is the answer accurate and well-formed? (Faithfulness, Relevance)
  • End-to-end — Does the system answer user questions correctly?

4. Skipping Hybrid Search

Pure vector search misses exact keyword matches. If a user asks about "SEBI circular 2024", vector search might return conceptually similar but wrong documents. Adding keyword search catches exact matches.

Warning

Don't go straight to production without evaluation. Build a test set of 50-100 question-answer pairs and measure your system's accuracy before deploying.

RAG vs Fine-Tuning

AspectRAGFine-Tuning
Data freshnessReal-time (update knowledge base)Requires retraining
CostLower (no training compute)Higher (GPU hours)
TransparencyCan show source documentsBlack box
Setup complexityModerateHigh
Best forQ&A, search, document analysisStyle/tone adaptation, specialised tasks

For most enterprise use cases, start with RAG. Only fine-tune if RAG isn't sufficient for your specific requirements.

Production Considerations

When moving RAG to production in an Indian enterprise context:

  1. Latency — Host your vector database in the same region (ap-south-1 for AWS Mumbai) to minimise retrieval latency
  2. Security — Ensure document-level access controls. Not every user should see every document.
  3. Cost — Embedding and LLM API calls add up. Cache frequent queries and batch embed during off-peak hours.
  4. Multilingual — If your documents are in Hindi or regional languages, choose embedding models with strong multilingual support.
  5. Monitoring — Log every query, retrieval result, and generated answer for debugging and improvement.

RAG is not a set-and-forget system. Plan for continuous improvement based on user feedback and evaluation metrics.

Key Takeaway

RAG is the 80/20 of enterprise AI — it gets you 80% of the value with 20% of the complexity of fine-tuning. Start with RAG, build a solid evaluation framework, and only consider fine-tuning if RAG hits a ceiling for your specific use case.

Go Deeper

AI Engineering

Move from reading to doing — hands-on, instructor-led training with real enterprise case studies.

View Program Details

Tags

RAGLLMArchitectureEngineeringVector Database

Want to go deeper?

Check out our AI Engineering training program for hands-on enterprise training.

Learn More