Retrieval-Augmented Generation | Generative AI | AI / ML

Retrieval-Augmented Generation (RAG) extends LLMs by retrieving relevant documents from an external knowledge base at inference time and injecting them into the context. This grounds the model in up-to-date, domain-specific facts — reducing hallucination and eliminating the need for fine-tuning.

Key Points

Problem it solves: LLMs have a knowledge cutoff and cannot access private/proprietary data
Embedding: text is converted to a dense vector (embedding) capturing semantic meaning
Vector Database: stores embeddings and enables fast similarity search (pgvector, Pinecone, Weaviate)
Chunking: split documents into segments (256–1024 tokens) before embedding — chunk size is critical
Retrieval: at query time, embed the query and find top-k most similar document chunks
Augmentation: inject retrieved chunks into the prompt as context before the user question
Generation: LLM generates an answer grounded in the retrieved context
Re-ranking: use a cross-encoder to re-rank top-k results before passing to LLM
Hybrid search: combine dense vector search with keyword (BM25) search for better recall
Chunking strategy matters: sliding window, semantic chunking, sentence-level splitting

RAG pipeline: query → embed → retrieve from vector DB → augment prompt → LLM generates grounded answer

Real-World Example

AWS, Microsoft, and Google all offer managed RAG solutions (Bedrock Knowledge Bases, Azure AI Search, Vertex AI Search). Enterprise AI assistants — legal research tools, customer support bots, internal knowledge bases — primarily use RAG to combine LLM reasoning with proprietary company knowledge.

←PreviousPrompt Engineering NextFine-Tuning→