Retrieval-Augmented Generation (RAG) extends LLMs by retrieving relevant documents from an external knowledge base at inference time and injecting them into the context. This grounds the model in up-to-date, domain-specific facts — reducing hallucination and eliminating the need for fine-tuning.

Key Points

  • Problem it solves: LLMs have a knowledge cutoff and cannot access private/proprietary data
  • Embedding: text is converted to a dense vector (embedding) capturing semantic meaning
  • Vector Database: stores embeddings and enables fast similarity search (pgvector, Pinecone, Weaviate)
  • Chunking: split documents into segments (256–1024 tokens) before embedding — chunk size is critical
  • Retrieval: at query time, embed the query and find top-k most similar document chunks
  • Augmentation: inject retrieved chunks into the prompt as context before the user question
  • Generation: LLM generates an answer grounded in the retrieved context
  • Re-ranking: use a cross-encoder to re-rank top-k results before passing to LLM
  • Hybrid search: combine dense vector search with keyword (BM25) search for better recall
  • Chunking strategy matters: sliding window, semantic chunking, sentence-level splitting
User Query "What is RAG?" Embed Query [0.23, 0.89...] Vector DB Top-k similar chunks retrieved (Pinecone / pgvector) LLM Context + Query → Answer Answer Grounded

RAG pipeline: query → embed → retrieve from vector DB → augment prompt → LLM generates grounded answer

Real-World Example

AWS, Microsoft, and Google all offer managed RAG solutions (Bedrock Knowledge Bases, Azure AI Search, Vertex AI Search). Enterprise AI assistants — legal research tools, customer support bots, internal knowledge bases — primarily use RAG to combine LLM reasoning with proprietary company knowledge.