Abstract
We consider learning policies online in Markov decision processes with the long-run average reward (a.k.a. mean payoff). To ensure implementability of the policies, we focus on policies with finite memory. Firstly, we show that near optimality can be achieved almost surely, using an unintuitive gadget we call forgetfulness. Secondly, we extend the approach to a setting with partial knowledge of the system topology, introducing two optimality measures and providing near-optimal algorithms also for these cases.
Original language | English |
---|---|
Pages | 1149-1158 |
Number of pages | 10 |
State | Published - 2020 |
Event | 36th Conference on Uncertainty in Artificial Intelligence, UAI 2020 - Virtual, Online Duration: 3 Aug 2020 → 6 Aug 2020 |
Conference
Conference | 36th Conference on Uncertainty in Artificial Intelligence, UAI 2020 |
---|---|
City | Virtual, Online |
Period | 3/08/20 → 6/08/20 |