Introduction: Reinforcement Learning, Elements of Reinforcement Learning, Limitations and Scope, An Extended Example- Tic-Tac-Toe.
Multi-armed Bandits: A k-armed Bandit Problem , Action-value Methods, The 10-armed Testbed, Incremental Implementation, Tracking a Nonstationary Problem, Optimistic Initial Values, Upper-Confidence-Bound Action Selection, Gradient Bandit Algorithms.
Finite Markov Decision Processes: The Agent–Environment Interface, Goals and Rewards, Returns and Episodes , Unified Notation for Episodic and Continuing Tasks, Policies and Value Functions, Optimal Policies and Optimal Value Functions, Optimality and Approximation. Review of Markov process and Dynamic Programming.
Temporal-Difference Learning: TD Prediction, Advantages of TD Prediction Methods, Optimality of TD, Sarsa: On-policy TD Control, Q-learning: Policy TD Control. Expected Sarsa. Maximization Bias and Double Learning.
Eligibility Traces, Functional Approximation, Fitted Q, DQN & Policy Gradient for Full RL and Hierarchical RL.