TY - GEN
T1 - Value iteration for long-run average reward in markov decision processes
AU - Ashok, Pranav
AU - Chatterjee, Krishnendu
AU - Daca, Przemys̷law
AU - Křetínský, Jan
AU - Meggendorfer, Tobias
N1 - Publisher Copyright:
© Springer International Publishing AG 2017.
PY - 2017
Y1 - 2017
N2 - Markov decision processes (MDPs) are standard models for probabilistic systems with non-deterministic behaviours. Long-run average rewards provide a mathematically elegant formalism for expressing long term performance. Value iteration (VI) is one of the simplest and most efficient algorithmic approaches to MDPs with other properties, such as reachability objectives. Unfortunately, a naive extension of VI does not work for MDPs with long-run average rewards, as there is no known stopping criterion. In this work our contributions are threefold. (1) We refute a conjecture related to stopping criteria for MDPs with long-run average rewards. (2) We present two practical algorithms for MDPs with long-run average rewards based on VI. First, we show that a combination of applying VI locally for each maximal end-component (MEC) and VI for reachability objectives can provide approximation guarantees. Second, extending the above approach with a simulation-guided on-demand variant of VI, we present an anytime algorithm that is able to deal with very large models. (3) Finally, we present experimental results showing that our methods significantly outperform the standard approaches on several benchmarks.
AB - Markov decision processes (MDPs) are standard models for probabilistic systems with non-deterministic behaviours. Long-run average rewards provide a mathematically elegant formalism for expressing long term performance. Value iteration (VI) is one of the simplest and most efficient algorithmic approaches to MDPs with other properties, such as reachability objectives. Unfortunately, a naive extension of VI does not work for MDPs with long-run average rewards, as there is no known stopping criterion. In this work our contributions are threefold. (1) We refute a conjecture related to stopping criteria for MDPs with long-run average rewards. (2) We present two practical algorithms for MDPs with long-run average rewards based on VI. First, we show that a combination of applying VI locally for each maximal end-component (MEC) and VI for reachability objectives can provide approximation guarantees. Second, extending the above approach with a simulation-guided on-demand variant of VI, we present an anytime algorithm that is able to deal with very large models. (3) Finally, we present experimental results showing that our methods significantly outperform the standard approaches on several benchmarks.
UR - https://www.scopus.com/pages/publications/85026745865
U2 - 10.1007/978-3-319-63387-9_10
DO - 10.1007/978-3-319-63387-9_10
M3 - Conference contribution
AN - SCOPUS:85026745865
SN - 9783319633862
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 201
EP - 221
BT - Computer Aided Verification - 29th International Conference, CAV 2017, Proceedings
A2 - Kuncak, Viktor
A2 - Majumdar, Rupak
PB - Springer Verlag
T2 - 29th International Conference on Computer Aided Verification, CAV 2017
Y2 - 24 July 2017 through 28 July 2017
ER -