Towards Optimal Offline Reinforcement Learning

Towards Optimal Offline Reinforcement Learning

analytics-operations

In "Seminars and talks"

Speakers

Mengmeng Li

EPFL

Mengmeng Li is a Ph.D. candidate at the College of Management of Technology at EPFL, advised by Prof. Daniel Kuhn. She holds an M.Sc. in Mathematics from EPFL and a B.Sc. in Honors Mathematics from NYU (Shanghai & New York). Her research focuses on developing statistically and computationally efficient methods for data-driven sequential decision-making, with applications in operations management, business analytics, and sustainability. More information is available on her personal webpage: https://thatmengmengli.github.io/.

Date:

Friday, 6 March 2026

Time:

10:00 am - 11:30 am

Venue:

NUS Business School
Mochtar Riady Building BIZ1 0204
15 Kent Ridge Drive
Singapore 119245 (Map)

Abstract

We study offline reinforcement learning problems with a long-run average reward objective. The state-action pairs generated by any fixed behavioral policy thus follow a Markov chain, and the empirical state-action-next-state distribution satisfies a large deviations principle. We use the rate function of this large deviations principle to construct an uncertainty set for the unknown true state-action-next-state distribution. We also construct a distribution shift transformation that maps any distribution in this uncertainty set to a state-action-next-state distribution of the Markov chain generated by a fixed evaluation policy, which may differ from the unknown behavioral policy. We prove that the worst-case average reward of the evaluation policy with respect to all distributions in the shifted uncertainty set provides, in a rigorous statistical sense, the least conservative estimator for the average reward under the unknown true distribution. This guarantee is available even if one has only access to one single trajectory of serially correlated state-action pairs. The emerging robust optimization problem can be viewed as a robust Markov decision process with a non-rectangular uncertainty set. We adapt an efficient policy gradient algorithm to solve this problem. Numerical experiments show that our methods compare favorably against state-of-the-art methods.

BUSINESS SCHOOL

Main Menu

Speakers

Mengmeng Li

Abstract

Search Results