Reinforcement Learning | Machine Learning | AI / ML

Reinforcement Learning (RL) trains an agent to make sequential decisions by rewarding good behaviour and penalising bad behaviour — learning through trial and error in an environment. It is inspired by how humans and animals learn.

Key Points

Agent: the learner/decision maker (e.g., a robot, a game-playing AI)
Environment: the world the agent interacts with (game engine, simulator, real world)
State (s): current situation of the environment observed by the agent
Action (a): choices the agent can make at each state
Reward (r): scalar signal indicating how good the last action was
Policy (π): the agent's strategy — maps states to actions
Q-Learning: model-free RL that learns value of (state, action) pairs
Deep Q-Network (DQN): uses a neural network to approximate Q-values
Actor-Critic (PPO, A3C): separates policy (actor) and value estimation (critic)
RLHF (Reinforcement Learning from Human Feedback): used to align LLMs with human preferences

Algorithm	Type	Best For
Q-Learning	Model-free	Discrete action spaces, simple envs
DQN	Deep RL	Atari games, visual inputs
PPO	Policy gradient	Continuous control, robotics
AlphaZero	MCTS + RL	Board games (Chess, Go, Shogi)
RLHF	Human feedback	LLM alignment (ChatGPT)

Real-World Example

DeepMind's AlphaGo defeated Go world champion Lee Sedol (2016) using Deep RL. OpenAI Five beat human world champions at Dota 2 (2019). RLHF is the technique that turned GPT-3 into ChatGPT — making the model helpful and harmless.

←PreviousUnsupervised Learning NextNeural Networks & Deep Learning→