Abstract
Safe Policy Improvement (SPI) aims at provable guarantees that a learned policy is at least approximately as good as a given baseline policy. Building on SPI with Soft Baseline Bootstrapping (Soft-SPIBB) by Nadjahi et al., we identify theoretical issues in their approach, provide a corrected theory, and derive a new algorithm that is provably safe on finite Markov Decision Processes (MDP). Additionally, we provide a heuristic algorithm that exhibits the best performance among many state of the art SPI algorithms on two different benchmarks. Furthermore, we introduce a taxonomy of SPI algorithms and empirically show an interesting property of two classes of SPI algorithms: while the mean performance of algorithms that incorporate the uncertainty as a penalty on the action-value is higher, actively restricting the set of policies more consistently produces good policies and is, thus, safer.
| Original language | English |
|---|---|
| Pages (from-to) | 142-151 |
| Number of pages | 10 |
| Journal | International Conference on Agents and Artificial Intelligence |
| Volume | 2 |
| DOIs | |
| State | Published - 2022 |
| Event | 14th International Conference on Agents and Artificial Intelligence , ICAART 2022 - Virtual, Online Duration: 3 Feb 2022 → 5 Feb 2022 |
Keywords
- Markov Decision Processes
- Risk-sensitive Reinforcement Learning
- Safe Policy Improvement
Fingerprint
Dive into the research topics of 'Safe Policy Improvement Approaches on Discrete Markov Decision Processes'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver