Reinforcement Learning (RL) trains an agent to make sequential decisions by rewarding good behaviour and penalising bad behaviour — learning through trial and error in an environment. It is inspired by how humans and animals learn.

Key Points

  • Agent: the learner/decision maker (e.g., a robot, a game-playing AI)
  • Environment: the world the agent interacts with (game engine, simulator, real world)
  • State (s): current situation of the environment observed by the agent
  • Action (a): choices the agent can make at each state
  • Reward (r): scalar signal indicating how good the last action was
  • Policy (π): the agent's strategy — maps states to actions
  • Q-Learning: model-free RL that learns value of (state, action) pairs
  • Deep Q-Network (DQN): uses a neural network to approximate Q-values
  • Actor-Critic (PPO, A3C): separates policy (actor) and value estimation (critic)
  • RLHF (Reinforcement Learning from Human Feedback): used to align LLMs with human preferences
AlgorithmTypeBest For
Q-LearningModel-freeDiscrete action spaces, simple envs
DQNDeep RLAtari games, visual inputs
PPOPolicy gradientContinuous control, robotics
AlphaZeroMCTS + RLBoard games (Chess, Go, Shogi)
RLHFHuman feedbackLLM alignment (ChatGPT)

Real-World Example

DeepMind's AlphaGo defeated Go world champion Lee Sedol (2016) using Deep RL. OpenAI Five beat human world champions at Dota 2 (2019). RLHF is the technique that turned GPT-3 into ChatGPT — making the model helpful and harmless.