Retrieval-Augmented Generation
Vector embeddings, semantic search, chunking, re-ranking, hallucination reduction
Retrieval-Augmented Generation (RAG) extends LLMs by retrieving relevant documents from an external knowledge base at inference time and injecting them into the context. This grounds the model in up-to-date, domain-specific facts — reducing hallucination and eliminating the need for fine-tuning.
Key Points
- Problem it solves: LLMs have a knowledge cutoff and cannot access private/proprietary data
- Embedding: text is converted to a dense vector (embedding) capturing semantic meaning
- Vector Database: stores embeddings and enables fast similarity search (pgvector, Pinecone, Weaviate)
- Chunking: split documents into segments (256–1024 tokens) before embedding — chunk size is critical
- Retrieval: at query time, embed the query and find top-k most similar document chunks
- Augmentation: inject retrieved chunks into the prompt as context before the user question
- Generation: LLM generates an answer grounded in the retrieved context
- Re-ranking: use a cross-encoder to re-rank top-k results before passing to LLM
- Hybrid search: combine dense vector search with keyword (BM25) search for better recall
- Chunking strategy matters: sliding window, semantic chunking, sentence-level splitting
RAG pipeline: query → embed → retrieve from vector DB → augment prompt → LLM generates grounded answer
Real-World Example
AWS, Microsoft, and Google all offer managed RAG solutions (Bedrock Knowledge Bases, Azure AI Search, Vertex AI Search). Enterprise AI assistants — legal research tools, customer support bots, internal knowledge bases — primarily use RAG to combine LLM reasoning with proprietary company knowledge.