Reinforcement Learning
Reward signals, policy, Q-learning, actor-critic, games and robotics
Reinforcement Learning (RL) trains an agent to make sequential decisions by rewarding good behaviour and penalising bad behaviour — learning through trial and error in an environment. It is inspired by how humans and animals learn.
Key Points
- Agent: the learner/decision maker (e.g., a robot, a game-playing AI)
- Environment: the world the agent interacts with (game engine, simulator, real world)
- State (s): current situation of the environment observed by the agent
- Action (a): choices the agent can make at each state
- Reward (r): scalar signal indicating how good the last action was
- Policy (π): the agent's strategy — maps states to actions
- Q-Learning: model-free RL that learns value of (state, action) pairs
- Deep Q-Network (DQN): uses a neural network to approximate Q-values
- Actor-Critic (PPO, A3C): separates policy (actor) and value estimation (critic)
- RLHF (Reinforcement Learning from Human Feedback): used to align LLMs with human preferences
| Algorithm | Type | Best For |
|---|---|---|
| Q-Learning | Model-free | Discrete action spaces, simple envs |
| DQN | Deep RL | Atari games, visual inputs |
| PPO | Policy gradient | Continuous control, robotics |
| AlphaZero | MCTS + RL | Board games (Chess, Go, Shogi) |
| RLHF | Human feedback | LLM alignment (ChatGPT) |
Real-World Example
DeepMind's AlphaGo defeated Go world champion Lee Sedol (2016) using Deep RL. OpenAI Five beat human world champions at Dota 2 (2019). RLHF is the technique that turned GPT-3 into ChatGPT — making the model helpful and harmless.