Build6 min readTechnical

RAG Architecture: Building AI Apps with Your Data

A technical guide to Retrieval Augmented Generation for enterprise applications.

Ritesh Vajariya

RAG Architecture: Building AI Apps with Your DataBUILDAI Guru® Insights

Retrieval Augmented Generation (RAG) is the most practical architecture for building enterprise AI applications today. It lets you combine the power of LLMs with your organisation's proprietary data — without fine-tuning a model.

If you're building an AI-powered internal search, document Q&A system, or customer support bot, RAG is almost certainly the right starting point.

What is RAG?

RAG is a two-step process:

  1. Retrieve — Find the most relevant documents/chunks from your knowledge base
  2. Generate — Feed those documents to an LLM along with the user's question, and let the LLM generate an answer grounded in your data
User Question → Retrieval (Vector DB) → Relevant Chunks → LLM → Grounded Answer

This solves the two biggest problems with using LLMs in enterprise:

  • Hallucination — The LLM is grounded in real documents, not just its training data
  • Data freshness — Your knowledge base can be updated without retraining the model

RAG Architecture Components

1. Document Ingestion Pipeline

Your documents need to be processed before they can be retrieved:

Raw Documents → Chunking → Embedding → Vector Database

Chunking strategies:

  • Fixed-size chunks (e.g., 500 tokens with 50 token overlap) — Simple, works well for homogeneous documents
  • Semantic chunking — Split on paragraph/section boundaries for better context preservation
  • Recursive splitting — Try larger chunks first, split further only if needed

[TIP] Chunk size matters more than most teams realise. Too small and you lose context. Too large and you dilute relevance. Start with 500-800 tokens and experiment.

2. Embedding Model

Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts produce similar vectors.

ModelDimensionsBest For
OpenAI text-embedding-3-small1536General purpose, easy to start
Cohere embed-v31024Multilingual (good for Indian languages)
BGE / E5 (open source)768-1024Self-hosted, no API costs

3. Vector Database

Stores embeddings and enables fast similarity search.

DatabaseTypeGood For
PineconeManaged cloudQuick start, zero ops
WeaviateSelf-hosted/cloudHybrid search (vector + keyword)
pgvectorPostgreSQL extensionAlready using PostgreSQL
ChromaDBEmbeddedPrototyping, small datasets

4. Retrieval

When a user asks a question:

  1. Embed the question using the same embedding model
  2. Search the vector database for the K most similar chunks
  3. Return those chunks as context

Advanced retrieval techniques:

  • Hybrid search — Combine vector similarity with keyword matching (BM25)
  • Re-ranking — Use a cross-encoder to re-rank the top results for better relevance
  • Query expansion — Rewrite the user's query to improve retrieval

5. Generation

Pass the retrieved context + user question to the LLM:

System: You are a helpful assistant. Answer based on the provided context.
If the answer is not in the context, say "I don't have information about that."

Context:
{retrieved_chunks}

User: {question}

Implementation Example

Here's a simplified RAG pipeline in Python:

from openai import OpenAI
import chromadb

client = OpenAI()
db = chromadb.Client()
collection = db.get_or_create_collection("knowledge_base")

def ingest(documents):
    for i, doc in enumerate(documents):
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=doc["text"]
        )
        collection.add(
            ids=[f"doc_{i}"],
            embeddings=[response.data[0].embedding],
            documents=[doc["text"]],
            metadatas=[{"source": doc["source"]}]
        )

def query(question, k=5):
    # Embed the question
    q_embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=question
    ).data[0].embedding

    # Retrieve relevant chunks
    results = collection.query(
        query_embeddings=[q_embedding],
        n_results=k
    )

    context = "\n\n".join(results["documents"][0])

    # Generate answer
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": f"Answer based on this context:\n\n{context}"},
            {"role": "user", "content": question}
        ]
    )

    return response.choices[0].message.content

Common RAG Pitfalls

1. Poor Chunking

If your chunks split mid-sentence or mid-paragraph, the LLM gets fragmented context. Always use overlap and respect document structure.

2. Ignoring Metadata

Don't just store text — store metadata (source document, date, author, section). This enables filtering and attribution.

3. No Evaluation Framework

You need to measure RAG quality systematically:

  • Retrieval quality — Are the right chunks being retrieved? (Precision/Recall)
  • Generation quality — Is the answer accurate and well-formed? (Faithfulness, Relevance)
  • End-to-end — Does the system answer user questions correctly?

4. Skipping Hybrid Search

Pure vector search misses exact keyword matches. If a user asks about "SEBI circular 2024", vector search might return conceptually similar but wrong documents. Adding keyword search catches exact matches.

[WARNING] Don't go straight to production without evaluation. Build a test set of 50-100 question-answer pairs and measure your system's accuracy before deploying.

RAG vs Fine-Tuning

AspectRAGFine-Tuning
Data freshnessReal-time (update knowledge base)Requires retraining
CostLower (no training compute)Higher (GPU hours)
TransparencyCan show source documentsBlack box
Setup complexityModerateHigh
Best forQ&A, search, document analysisStyle/tone adaptation, specialised tasks

For most enterprise use cases, start with RAG. Only fine-tune if RAG isn't sufficient for your specific requirements.

Production Considerations

When moving RAG to production in an Indian enterprise context:

  1. Latency — Host your vector database in the same region (ap-south-1 for AWS Mumbai) to minimise retrieval latency
  2. Security — Ensure document-level access controls. Not every user should see every document.
  3. Cost — Embedding and LLM API calls add up. Cache frequent queries and batch embed during off-peak hours.
  4. Multilingual — If your documents are in Hindi or regional languages, choose embedding models with strong multilingual support.
  5. Monitoring — Log every query, retrieval result, and generated answer for debugging and improvement.

RAG is not a set-and-forget system. Plan for continuous improvement based on user feedback and evaluation metrics.

RAG is the 80/20 of enterprise AI — it gets you 80% of the value with 20% of the complexity of fine-tuning. Start with RAG, build a solid evaluation framework, and only consider fine-tuning if RAG hits a ceiling for your specific use case.
Tags:
RAGLLMArchitectureEngineeringVector Databaselevel:technical

Enjoyed this article?

Share it with your network!

Related Training

AI Engineering

Hands-on engineering program for building AI systems

Learn More

Related Articles