TY - GEN
T1 - Safe Policy Improvement Approaches and Their Limitations
AU - Scholl, Philipp
AU - Dietrich, Felix
AU - Otte, Clemens
AU - Udluft, Steffen
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - Safe Policy Improvement (SPI) is an important technique for offline reinforcement learning in safety critical applications as it improves the behavior policy with a high probability. We classify various SPI approaches from the literature into two groups, based on how they utilize the uncertainty of state-action pairs. Focusing on the Soft-SPIBB (Safe Policy Improvement with Soft Baseline Bootstrapping) algorithms, we show that their claim of being provably safe does not hold. Based on this finding, we develop adaptations, the Adv-Soft-SPIBB algorithms, and show that they are provably safe. A heuristic adaptation, Lower-Approx-Soft-SPIBB, yields the best performance among all SPIBB algorithms in extensive experiments on two benchmarks. We also check the safety guarantees of the provably safe algorithms and show that huge amounts of data are necessary such that the safety bounds become useful in practice.
AB - Safe Policy Improvement (SPI) is an important technique for offline reinforcement learning in safety critical applications as it improves the behavior policy with a high probability. We classify various SPI approaches from the literature into two groups, based on how they utilize the uncertainty of state-action pairs. Focusing on the Soft-SPIBB (Safe Policy Improvement with Soft Baseline Bootstrapping) algorithms, we show that their claim of being provably safe does not hold. Based on this finding, we develop adaptations, the Adv-Soft-SPIBB algorithms, and show that they are provably safe. A heuristic adaptation, Lower-Approx-Soft-SPIBB, yields the best performance among all SPIBB algorithms in extensive experiments on two benchmarks. We also check the safety guarantees of the provably safe algorithms and show that huge amounts of data are necessary such that the safety bounds become useful in practice.
KW - Markov decision processes
KW - Risk-sensitive reinforcement learning
KW - Safe policy improvement
UR - http://www.scopus.com/inward/record.url?scp=85149650734&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-22953-4_4
DO - 10.1007/978-3-031-22953-4_4
M3 - Conference contribution
AN - SCOPUS:85149650734
SN - 9783031229527
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 74
EP - 98
BT - Agents and Artificial Intelligence - 14th International Conference, ICAART 2022, Revised Selected Papers
A2 - Rocha, Ana Paula
A2 - Steels, Luc
A2 - van den Herik, Jaap
PB - Springer Science and Business Media Deutschland GmbH
T2 - 14th International Conference on Agents and Artificial Intelligence, ICAART 2022
Y2 - 3 February 2022 through 5 February 2022
ER -