Large Language Models | Generative AI | AI / ML

Large Language Models are deep neural networks trained on massive text corpora to predict the next token in a sequence. Through self-supervised pre-training at scale, they develop rich linguistic and world knowledge that can be applied to a vast range of tasks without task-specific training.

Key Points

Architecture: Transformer with self-attention — "Attention is All You Need" (Vaswani et al., 2017)
Pre-training: predict the next word across trillions of tokens (GPT-style) or masked words (BERT-style)
Tokenisation: text is split into subword units (BPE) — "tokenisation" → ["token", "isation"]
Parameters: learnable weights — GPT-2 has 1.5B, GPT-3 has 175B, GPT-4 reportedly ~1.76T
Context window: tokens the model can process at once — GPT-4 Turbo: 128K, Gemini 1.5: 1M
Temperature: controls randomness of output (0 = deterministic, 1 = creative, >1 = chaotic)
Top-P (Nucleus sampling): sample from tokens comprising top P probability mass
Instruction fine-tuning: fine-tune on (instruction, response) pairs to make model helpful
RLHF: use human preference feedback to further align model responses
Model families: GPT (OpenAI), Claude (Anthropic), Gemini (Google), LLaMA (Meta), Mistral

Model	Company	Parameters	Context Window
GPT-4o	OpenAI	~1.76T (est.)	128K tokens
Claude 3.5 Sonnet	Anthropic	Unknown	200K tokens
Gemini 1.5 Pro	Google	Unknown	1M tokens
LLaMA 3 70B	Meta	70B	128K tokens
Mistral Large	Mistral AI	123B	128K tokens

Real-World Example

GitHub Copilot — powered by a code-specialised LLM — writes ~46% of code in the files it is enabled for, according to GitHub's own data. It was trained on public GitHub repositories and has been adopted by millions of developers.

←PreviousGenAI Overview NextPrompt Engineering→