Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu

Publikation: Beitrag in Buch/Bericht/KonferenzbandKonferenzbeitragBegutachtung

45 Zitate (Scopus)

Abstract

We consider the task of learning in episodic finitehorizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves O(LjXj p jAjT) regret with high probability, where L is the horizon, jXj the number of states, jAj the number of actions, and T the number of episodes. To our knowledge, our algorithm is the first to ensure O( p T) regret in this challenging setting; in fact it achieves the same regret as (Rosenberg and Mansour, 2019a) who consider the easier setting with full-information. Our key contributions are two-fold: A tighter confidence set for the transition function; and an optimistic loss estimator that is inversely weighted by an upper occupancy bound.

OriginalspracheEnglisch
Titel37th International Conference on Machine Learning, ICML 2020
Redakteure/-innenHal Daume, Aarti Singh
Herausgeber (Verlag)International Machine Learning Society (IMLS)
Seiten4810-4819
Seitenumfang10
ISBN (elektronisch)9781713821120
PublikationsstatusVeröffentlicht - 2020
Extern publiziertJa
Veranstaltung37th International Conference on Machine Learning, ICML 2020 - Virtual, Online
Dauer: 13 Juli 202018 Juli 2020

Publikationsreihe

Name37th International Conference on Machine Learning, ICML 2020
BandPartF168147-7

Konferenz

Konferenz37th International Conference on Machine Learning, ICML 2020
OrtVirtual, Online
Zeitraum13/07/2018/07/20

Fingerprint

Untersuchen Sie die Forschungsthemen von „Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren