Large Language Models
Transformer architecture, pre-training, tokenization, context windows, model families
Large Language Models are deep neural networks trained on massive text corpora to predict the next token in a sequence. Through self-supervised pre-training at scale, they develop rich linguistic and world knowledge that can be applied to a vast range of tasks without task-specific training.
Key Points
- Architecture: Transformer with self-attention — "Attention is All You Need" (Vaswani et al., 2017)
- Pre-training: predict the next word across trillions of tokens (GPT-style) or masked words (BERT-style)
- Tokenisation: text is split into subword units (BPE) — "tokenisation" → ["token", "isation"]
- Parameters: learnable weights — GPT-2 has 1.5B, GPT-3 has 175B, GPT-4 reportedly ~1.76T
- Context window: tokens the model can process at once — GPT-4 Turbo: 128K, Gemini 1.5: 1M
- Temperature: controls randomness of output (0 = deterministic, 1 = creative, >1 = chaotic)
- Top-P (Nucleus sampling): sample from tokens comprising top P probability mass
- Instruction fine-tuning: fine-tune on (instruction, response) pairs to make model helpful
- RLHF: use human preference feedback to further align model responses
- Model families: GPT (OpenAI), Claude (Anthropic), Gemini (Google), LLaMA (Meta), Mistral
| Model | Company | Parameters | Context Window |
|---|---|---|---|
| GPT-4o | OpenAI | ~1.76T (est.) | 128K tokens |
| Claude 3.5 Sonnet | Anthropic | Unknown | 200K tokens |
| Gemini 1.5 Pro | Unknown | 1M tokens | |
| LLaMA 3 70B | Meta | 70B | 128K tokens |
| Mistral Large | Mistral AI | 123B | 128K tokens |
Real-World Example
GitHub Copilot — powered by a code-specialised LLM — writes ~46% of code in the files it is enabled for, according to GitHub's own data. It was trained on public GitHub repositories and has been adopted by millions of developers.