Home
Skip to main content
xStryk™

Decision Intelligence for AI in production — guardrails, traceability & evaluation.

xTheus

RAG in Production: Retrieval-Augmented Generation Architectures for Enterprise

Why RAG and Not Fine-Tuning

Fine-tuning an LLM is expensive, hard to update, and prone to hallucinations about data not in the training set. RAG (Retrieval-Augmented Generation) solves this by separating knowledge from reasoning: the LLM reasons over dynamically retrieved documents, not static memorization. For enterprise data that changes frequently (policies, manuals, regulations), RAG is the right architecture.

RAG Pipeline: From Document to Verified Answer
Ingestion & Preparation
Documents
Chunking
Embeddings
Vector Store
Query & Generation
Query
Hybrid Search
Re-Ranking
LLM + Context
Grounding Check
Recall@5
Retrieval metric
MRR
Relevant ranking
NDCG
Ranking quality

Chunking, Embeddings, and Retrieval

RAG quality depends on retrieval, not the LLM. Incorrect chunking (too large loses precision, too small loses context) is the main cause of low-quality responses. Strategies: sentence-based chunking with overlap, recursive splitting by headers, and parent-child chunking where the relevant chunk is retrieved but the full parent document is sent to the LLM. Hybrid search (dense embeddings + sparse BM25) consistently outperforms dense-only.

Retrieval Evaluation and Hallucination Detection

Retrieval metrics: Recall@k (were relevant documents retrieved?), MRR (is the most relevant one first?), and NDCG (complete ranking quality). For hallucinations, citation grounding verifies that each LLM claim is supported by a retrieved chunk. Claims without support are flagged as unverified. This requires a post-generation step with an evaluator model (a second LLM or lightweight classifier).

Fine-Tuning vs. RAG: When to Use Each Approach
DimensionFine-TuningRAG
Data updateRe-train entire modelUpdate docs in vector store
Startup costHigh (GPU, labeled data)Low (embeddings + search)
TraceabilityOpaque (model weights)Transparent (source citations)
HallucinationsHard to detectDetectable with grounding
Ideal caseStyle, tone, specific tasksDynamic data, compliance
Google Cloud · RAG Production Stack
Embeddings
Vertex AI EmbeddingsCloud Storage (docs)
Vector Store
Vertex AI Vector SearchAlloyDB
Orchestrator
Cloud RunCloud Functions (guardrails)
Generation
Vertex AI Gemini
Eval Log
BigQueryLooker

Key Takeaways

  • RAG quality depends on retrieval, not the LLM. Chunking and hybrid search are the most critical decisions.
  • Hybrid search (dense + sparse BM25) consistently outperforms embeddings-only.
  • Citation grounding with an evaluator model is essential for detecting hallucinations in production.