Download PDFOpen PDF in browserBest Policy Identification in Linear MDPsEasyChair Preprint 1099057 pages•Date: September 30, 2023AbstractWe consider the problem of best policy identification in discounted Linear Markov Decision Processes in the fixed confidence setting, under both generative and forward models. We derive an instance-specific lower bound on the expected number of samples required to identify an $\varepsilon$-optimal policy with probability $1-\delta$. The lower bound characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms. We devise such algorithms. In the generative model, our algorithm exhibits a sample complexity upper bounded by ${\cal O}( (d(1-\gamma)^{-4}/ (\varepsilon+\Delta)^2) (\log(1/\delta)+d))$ where $\Delta$ denotes the minimum reward gap of sub-optimal actions and $d$ is the dimension of the feature space. This upper bound holds in the moderate-confidence regime (i.e., for all $\delta$), and matches existing minimax and gap-dependent lower bounds. In the forward model, we determine a condition under which learning approximately optimal policies is possible; this condition is weak and does not require the MDP to be ergodic nor communicating. Under this condition, the sample complexity of our algorithm is asymptotically (as $\delta$ approaches 0) upper bounded by ${\cal O}((\sigma^\star(1-\gamma)^{-4}/{(\varepsilon+\Delta)^2}) (\log(\frac{1}{\delta})))$ where $\sigma^\star$ is an instance-specific constant, value of an optimal experiment-design problem. To derive this bound, we establish novel concentration results for random matrices built on Markovian data. Keyphrases: Best policy identification, Linear MDP, forward model
|