Best Policy Identification in Linear MDPs

EasyChair Preprint 10990

57 pages•Date: September 30, 2023

Jerome Taupin, Yassir Jedra and Alexandre Proutiere

Abstract

We consider the problem of best policy identification in discounted Linear Markov Decision Processes in the fixed confidence setting, under both generative and forward models. We derive an instance-specific lower bound on the expected number of samples required to identify an $\varepsilon$-optimal policy with probability $1-\delta$. The lower bound characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms. We devise such algorithms. In the generative model, our algorithm exhibits a sample complexity upper bounded by ${\cal O}( (d(1-\gamma)^{-4}/ (\varepsilon+\Delta)^2) (\log(1/\delta)+d))$ where $\Delta$ denotes the minimum reward gap of sub-optimal actions and $d$ is the dimension of the feature space. This upper bound holds in the moderate-confidence regime (i.e., for all $\delta$), and matches existing minimax and gap-dependent lower bounds. In the forward model, we determine a condition under which learning approximately optimal policies is possible; this condition is weak and does not require the MDP to be ergodic nor communicating. Under this condition, the sample complexity of our algorithm is asymptotically (as $\delta$ approaches 0) upper bounded by ${\cal O}((\sigma^\star(1-\gamma)^{-4}/{(\varepsilon+\Delta)^2}) (\log(\frac{1}{\delta})))$ where $\sigma^\star$ is an instance-specific constant, value of an optimal experiment-design problem. To derive this bound, we establish novel concentration results for random matrices built on Markovian data.

Keyphrases: Best policy identification, Linear MDP, forward model

Links:

https://easychair.org/publications/preprint/VDwC

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:10990,
  author    = {Jerome Taupin and Yassir Jedra and Alexandre Proutiere},
  title     = {Best Policy Identification in Linear MDPs},
  howpublished = {EasyChair Preprint 10990},
  year      = {EasyChair, 2023}}

Download PDF Open PDF in browser