If we convert this problem to MAB and run UCB, what is the complexity?
Unique policies: -> all treated as arms
Therefore UCB gives which is really bad.
WHY? Policies should not be treated as independent arms!
UCB-VI
Use all previous data to estimate transactions
Design reward bonus
Optimistic planning with learned model:
Let us consider the very beginning of episode :
Part 1: Model estimation
And we have empirical estimation:
Part 2: Reward Bonus Design and Value Iteration
We use a bonus to encourage exploration:
Now we perform VI starting from :
Remarks: The truncation for is due to that the reward is bounded and thus no policy's Q value will be larger than ().
Analysis
Theorem 7.1 (Regret Bound of UCBVI). UCBVI achieves the following regret bound
Remarks: A sharper bound of can be obtained. UCBVI is suboptimal w.r.t. .
Remarks: Note that here is an expectation statement, but a high probability version should not be hard with a martingale argument.
Sketch of proof:
Bonus is related to
VI with bonus inside the learned model gives optimism, i.e.,
Upper bound per-episode regret:
Apply simulation lemma
Step 1:
Lemma 7.2 (State-action wise model error). Fix . For all with probability at least we have that for any :
Proof: Azuma-Hoeffding + Union Bound
Remarks: Here the union bound includes , so we need to utiliza an -net argument. Refer to https://arxiv.org/pdf/1011.3027.pdf Lemma 5.2. for why we have for (Note that , so we essentially need an -net for an L2 ball with radii )
Remarks: After Union Bound, we only need to bound the points in the -net.
Lemma 7.3 (State-action wise average model error under ). Fix ∈(0,1). For all and consider with probability at least we have:
Proof: This is simply Lemma 7.2 without union bound on .
Step 2:
Denote as the above events.
Lemma 7.4 (Optimism). Assume is true. For all episode , we have:
where is computed based on Vl in Eg.0.3.
Proof: We use induction to exploit the fact that VI is an iterative algo. Proof at P76 of the book.
Lemma 7.5. Consider arbitrary sequence of trajectories for We have
Proof at P77 of the book
Now we can give proof to the main theorem.
Proof at P77 of the book.
Remarks: The simulation lemma is at lecture (basically induction). Also check this note for simulation lemma.
Remarks: Note that is data-dependent, so we cannot apply Lemma 7.2. Instead, we need to use Hölder.
An Improved Regret Bound
In this section, we show a regret bound of . The core of the analysis is to use a sharper concentration (Bernstein) than Hoeffding. Check appendix A.7.