TY - JOUR
T1 - Safe Policy Improvement Approaches on Discrete Markov Decision Processes
AU - Scholl, Philipp
AU - Dietrich, Felix
AU - Otte, Clemens
AU - Udluft, Steffen
N1 - Publisher Copyright:
© 2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved.
PY - 2022
Y1 - 2022
N2 - Safe Policy Improvement (SPI) aims at provable guarantees that a learned policy is at least approximately as good as a given baseline policy. Building on SPI with Soft Baseline Bootstrapping (Soft-SPIBB) by Nadjahi et al., we identify theoretical issues in their approach, provide a corrected theory, and derive a new algorithm that is provably safe on finite Markov Decision Processes (MDP). Additionally, we provide a heuristic algorithm that exhibits the best performance among many state of the art SPI algorithms on two different benchmarks. Furthermore, we introduce a taxonomy of SPI algorithms and empirically show an interesting property of two classes of SPI algorithms: while the mean performance of algorithms that incorporate the uncertainty as a penalty on the action-value is higher, actively restricting the set of policies more consistently produces good policies and is, thus, safer.
AB - Safe Policy Improvement (SPI) aims at provable guarantees that a learned policy is at least approximately as good as a given baseline policy. Building on SPI with Soft Baseline Bootstrapping (Soft-SPIBB) by Nadjahi et al., we identify theoretical issues in their approach, provide a corrected theory, and derive a new algorithm that is provably safe on finite Markov Decision Processes (MDP). Additionally, we provide a heuristic algorithm that exhibits the best performance among many state of the art SPI algorithms on two different benchmarks. Furthermore, we introduce a taxonomy of SPI algorithms and empirically show an interesting property of two classes of SPI algorithms: while the mean performance of algorithms that incorporate the uncertainty as a penalty on the action-value is higher, actively restricting the set of policies more consistently produces good policies and is, thus, safer.
KW - Markov Decision Processes
KW - Risk-sensitive Reinforcement Learning
KW - Safe Policy Improvement
UR - http://www.scopus.com/inward/record.url?scp=85146882818&partnerID=8YFLogxK
U2 - 10.5220/0010786600003116
DO - 10.5220/0010786600003116
M3 - Conference article
AN - SCOPUS:85146882818
SN - 2184-3589
VL - 2
SP - 142
EP - 151
JO - International Conference on Agents and Artificial Intelligence
JF - International Conference on Agents and Artificial Intelligence
T2 - 14th International Conference on Agents and Artificial Intelligence , ICAART 2022
Y2 - 3 February 2022 through 5 February 2022
ER -