Introduction: Reinforcement Learning, Elements of Reinforcement Learning, Limitations and Scope, An Extended Example- Tic-Tac-Toe. Multi-armed Bandits: A k-armed Bandit Problem, Action-value Methods, The 10-armed Testbed, Incremental Implementation, Tracking a Nonstationary Problem, Optimistic Initial Values, Upper-Confidence-Bound Action Selection, Gradient Bandit Algorithms.
Finite Markov Decision Processes: The Agent–Environment Interface, Goals and Rewards, Returns and Episodes, Unified Notation for Episodic and Continuing Tasks, Policies and Value Functions, Optimal Policies and Optimal Value Functions, Optimality and Approximation.
Review of Markov process and Dynamic Programming.
Temporal-Difference Learning: TD Prediction, Advantages of TD Prediction Methods, Optimality of TD, Sarsa: On-policy TD Control, Q-learning: Policy TD Control. Expected Sarsa. Maximization Bias and Double Learning.