Large Language Models are deep neural networks trained on massive text corpora to predict the next token in a sequence. Through self-supervised pre-training at scale, they develop rich linguistic and world knowledge that can be applied to a vast range of tasks without task-specific training.

Key Points

  • Architecture: Transformer with self-attention — "Attention is All You Need" (Vaswani et al., 2017)
  • Pre-training: predict the next word across trillions of tokens (GPT-style) or masked words (BERT-style)
  • Tokenisation: text is split into subword units (BPE) — "tokenisation" → ["token", "isation"]
  • Parameters: learnable weights — GPT-2 has 1.5B, GPT-3 has 175B, GPT-4 reportedly ~1.76T
  • Context window: tokens the model can process at once — GPT-4 Turbo: 128K, Gemini 1.5: 1M
  • Temperature: controls randomness of output (0 = deterministic, 1 = creative, >1 = chaotic)
  • Top-P (Nucleus sampling): sample from tokens comprising top P probability mass
  • Instruction fine-tuning: fine-tune on (instruction, response) pairs to make model helpful
  • RLHF: use human preference feedback to further align model responses
  • Model families: GPT (OpenAI), Claude (Anthropic), Gemini (Google), LLaMA (Meta), Mistral
ModelCompanyParametersContext Window
GPT-4oOpenAI~1.76T (est.)128K tokens
Claude 3.5 SonnetAnthropicUnknown200K tokens
Gemini 1.5 ProGoogleUnknown1M tokens
LLaMA 3 70BMeta70B128K tokens
Mistral LargeMistral AI123B128K tokens

Real-World Example

GitHub Copilot — powered by a code-specialised LLM — writes ~46% of code in the files it is enabled for, according to GitHub's own data. It was trained on public GitHub repositories and has been adopted by millions of developers.