COS 435 / ECE 433: Reinforcement Learning

Princeton Polaris Lab

PRINCETON UNIVERSITY, SPRING 2026

Location: Friend Center 101

Time: Friday 1:20pm-4:10pm

Instructor

Prof. Peter Henderson

Assistant Professor in CS/SPIA

Office Hours: By appointment

Teaching Assistants

Zeyu Shen

Office Hours: Fri 9-10am

Sherred Hall 3rd Floor

Kincaid MacDonald

Office Hours: Wed 3-4pm

Friend Center 010

Raj H. Ghugare

Office Hours: Mon 3-4pm

COS Building 003

Chongyi Zheng

Office Hours: Thu 3-4pm

COS Building 302

Course Description

This course provides an introductory overview of reinforcement learning (RL), a machine learning paradigm where agents learn to make decisions by interacting with their environment. We will cover fundamental concepts such as Markov Decision Processes, value functions, and policy optimization. Students will learn important RL algorithms including Q-learning, policy gradient methods, and actor-critic approaches. We will also address key challenges in RL such as exploration, generalization, and sample efficiency. Applications of RL to real-world problems—including robotics, healthcare, and molecular science—will be highlighted throughout the course. Assignments will involve implementing RL algorithms and conducting mathematical analyses. Students will complete an open-ended final group project.

Prerequisites

Students should have a solid foundation in machine learning and mathematics, including familiarity with probability, statistics, and linear algebra. Prior completion of courses such as COS 324 (Introduction to Machine Learning) or equivalent is recommended. Programming experience in Python is required.

Course Expectations & Grading

Components

Participation (15%): Starting week 3: Google form with in-class polling questions; breakout discussions on assigned papers; submit reading reflections on assigned papers with the marked up PDF of the paper.
Problem Sets (15%): 3 assignments, due every other week starting on week 3; small theory problems.
Programming Assignments (20%): 3 assignments, starting on week 3; small programming tasks.
Final Project (50%): The biggest component! Research project on a topic in RL; aim for academic workshop-level quality.

Policies

Late Submissions: Late assignments will incur a penalty of 10% per day, up to a maximum of three days. After three days, assignments will not be accepted unless prior arrangements are made.
Academic Integrity: Students are expected to adhere to Princeton University's academic integrity policies. Using LLMs for solving assignments is NOT permitted other than for getting basic understanding and learning, you must understand and be able to explain all code you submit. That being said, we are okay with some small amount of LLM usage for understanding concepts and ideas, as well as helping with code for more complicated projects—but only minimimally for writing as a post-draft check! But again, you are responsible for the content.
Collaboration: You may discuss problem sets with classmates and work together with up to 5 people. List collaborators on your submission and write-up/submit solutions individually.

Resources

Lecture Notes

Lecture notes (PDF)

Textbook

Required: None — lecture notes are posted on the course website (see above).

Optional Textbooks

Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto
Reinforcement Learning: Bit by Bit by Xiuyuan Lu, Benjamin Van Roy, Vikranth Dwaracherla, Morteza Ibrahimi, Ian Osband, and Zheng Wen
Bandit Algorithms by Tor Lattimore and Csaba Szepesvári (if you're interested in bandits)
Algorithms for Reinforcement Learning by Csaba Szepesvári
Mathematical Foundations of Reinforcement Learning by Shiyu Zhao
An Introduction to Deep Reinforcement Learning by V. François-Lavet, P. Henderson, R. Islam, M.G. Bellemare, and J. Pineau
Markov Decision Processes: Discrete Stochastic Dynamic Programming by Martin L. Puterman
Theoretical Neuroscience by Peter Dayan and Laurence F. Abbott (Chapters 8–10)

Supplementary Materials

Selected research papers for advanced topics
OpenAI Spinning Up in Deep RL [Link]

Assignments

Problem Set and Coding Assignment 1

We have provided both the assignment and the TeX file for use as a template. Unless explicitly specified, please type up your solutions using TeX and submit your compiled PDF, generated plots, and completed hw.py file on Gradescope.

PDF | .TeX | code.zip

We will post submission instructions on Canvas.

Final Project

The final project is the largest component of the course (50%). You will work in groups of 3–5 to complete a research project on a topic in reinforcement learning, aiming for academic workshop-level quality.

View Project Instructions →

Course Schedule

Schedule is tentative and subject to change. Check the course website for the most up-to-date information.

WEEK	TOPIC	DESCRIPTION & READINGS
1	Course Introduction & Foundations	Lecture 1 (Jan 30): Course intro, what is RL, the Markov Decision Process (MDP), value iteration, and policy iteration. [Slides] Optional Textbook Coverage: Sutton & Barto: Ch 3 (Finite Markov Decision Processes), Ch 4 (Dynamic Programming) Puterman: Ch 2 (Model Formulation), Ch 6 (Discounted Markov Decision Problems) Szepesvári: Ch 1 (Markov Decision Processes) Zhao: Ch 1–4 (Basic Concepts through Value Iteration and Policy Iteration) François-Lavet et al.: Ch 3 (Introduction to Reinforcement Learning) Dayan & Abbott: Ch 9 (Classical Conditioning and Reinforcement Learning)
2	Value-based RL	Lecture 2 (Feb 6): Q-learning, value-based methods, and value function learning. [Slides (pre-lecture)] Pick any two: Playing Atari with Deep Reinforcement Learning (Mnih et al., 2013) Rainbow: Combining Improvements in Deep Reinforcement Learning (Hessel et al., 2017) “Bias-Variance” Error Bounds for Temporal Difference Updates (Kearns & Singh 2000) Optional Readings: From r to Q: Your Language Model is Secretly a Q-Function (Rafailov et al., 2024) Q-learning (Watkins & Dayan, 1992) Optional Textbook Coverage:* Sutton & Barto: Ch 6 (Temporal-Difference Learning), Ch 9–10 (On-policy Prediction and Control with Approximation) Szepesvári: Ch 2 (Value Prediction Problems) Zhao: Ch 7 (Temporal-Difference Methods), Ch 8 (Value Function Methods) François-Lavet et al.: Ch 4 (Value-Based Methods for Deep RL)
3	Value-based RL (cont'd)	Lecture 3 (Feb 13): Continuation of value-based methods and DDPG. [Slides] Pick any two: Playing Atari with Deep Reinforcement Learning (Mnih et al., 2013) Rainbow: Combining Improvements in Deep Reinforcement Learning (Hessel et al., 2017) "Bias-Variance" Error Bounds for Temporal Difference Updates (Kearns & Singh 2000) Continuous control with deep reinforcement learning [DDPG] (Lillicrap et al., 2015) Optional Textbook Coverage: Sutton & Barto: Ch 6 (Temporal-Difference Learning), Ch 9–10 (On-policy Prediction and Control with Approximation) Szepesvári: Ch 2 (Value Prediction Problems), Ch 3 (Control) Zhao: Ch 7 (Temporal-Difference Methods), Ch 8 (Value Function Methods) François-Lavet et al.: Ch 4 (Value-Based Methods for Deep RL)
4	Policy Gradient and Actor-Critic Methods	Lecture 4 (Feb 20): REINFORCE, policy gradients, and TRPO. [Slides] Pick any two: Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning [REINFORCE] (Williams, 1992) A Natural Policy Gradient (Kakade, 2001) Trust Region Policy Optimization [TRPO] (Schulman et al., 2015) Approximately Optimal Approximate Reinforcement Learning (Kakade & Langford, 2002) On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift (Agarwal et al., 2019) Optional Textbook Coverage: Sutton & Barto: Ch 13 (Policy Gradient Methods) Szepesvári: Ch 3 (Control — actor-critic methods) Zhao: Ch 9 (Policy Gradient Methods) François-Lavet et al.: Ch 5 (Policy Gradient Methods for Deep RL)
5	Actor-Critic Methods	Lecture 5 (Feb 27): Bias-variance trade-offs, actor-critic methods, baselines as control variates, and GRPO. [Notes] Pick any two: Proximal Policy Optimization Algorithms [PPO] (Schulman et al., 2017) DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [GRPO] (Shao et al., 2024) Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation [ACKTR] (Wu et al., 2017) Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning (Greensmith, Bartlett & Baxter, 2004) Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization (Chung et al., 2021) Addressing Function Approximation Error in Actor-Critic Methods [TD3] (Fujimoto et al., 2018) Optional Textbook Coverage: Sutton & Barto: Ch 13 (Policy Gradient Methods) Szepesvári: Ch 3 (Control — actor-critic methods) Zhao: Ch 10 (Actor-Critic Methods) François-Lavet et al.: Ch 5 (Policy Gradient Methods for Deep RL)

Frequently Asked Questions

How do I enroll in this course?

This course is closed for enrollment.

What are the prerequisites?

Students should have completed COS 324 (Introduction to Machine Learning) or an equivalent course. Familiarity with probability, statistics, linear algebra, and Python programming is required.

Is this course suitable for graduate students?

Yes! This course is open to both undergraduate and graduate students. Graduate students may be expected to complete a more advanced final project.

What programming language will we use?

All assignments will be in Python using standard ML libraries (NumPy, PyTorch). Familiarity with these tools is helpful but not required—we will provide tutorials.

Can the final project be individual?

Generally no. Due to the size of the enrollment, we will require 3-5 students per group, except in exceptional circumstances.

Can I audit the course?

Formal auditing is not possible, but if there's room you can sit in on lectures.