OverviewStrategic Exploration in Tabular MDPsUCB-VIAnalysisAn Improved Regret Bound

Overview

This is my note for Chapter 7 of Reinforcement Learning Theory.

Strategic Exploration in Tabular MDPs

Assumptions in this chapter:

Finite horizon
Unknown Transition

Learning Protocol

$\pi^1$
$n$ $\pi^n$ $\{s_{h}^{n},a_{h}^{n},r_{h}^{n}\}_{h=0}^{H-1},$ $a_{h}^{n}=\pi^{n}(s_{h}^{n}),r_{h}^{n}=r(s_{h}^{n},a_{h}^{n}),s_{h+1}^{n}\sim P(\cdot|s_{h}^{n},a_{h}^{n})$
$\pi^{n+1}$

$K$ episodes:

Regret := E [K V^{⋆} (s_{0}) - \sum_{k = 0}^{K - 1} \sum_{h = 0}^{H - 1} r (s_{h}^{k}, a_{h}^{k})] = E [\sum_{k = 1}^{K} (V^{⋆} (s_{0}) - V^{π^{n}} (s_{0}))] .

If we convert this problem to MAB and run UCB, what is the complexity?

$(|\mathcal A|^{|\mathcal S|})^{H}$ -> all treated as arms

$O(\sqrt{|\mathcal A|^{|\mathcal S|H}K})$ which is really bad.

WHY? Policies should not be treated as independent arms!

UCB-VI

$\widehat{P}_1^k,\ldots,\widehat{P}_{H-1}^k$
$b^k_h(s,a),\forall s,a,h$
$\pi^k=\text{Value-Iter}\left(\{\widehat{P}_h^k,r_h+b_h^k\}_{h=1}^{H-1}\right)$

$k$ :

Part 1: Model estimation

\begin{matrix} N_{h}^{k} (s, a, s^{'}) = \sum_{i = 0}^{k - 1} I {(s_{h}^{i}, a_{h}^{i}, s_{h + 1}^{i}) = (s, a, s^{'})}, \\ N_{h}^{k} (s, a) = \sum_{i = 0}^{k - 1} I {(s_{h}^{i}, a_{h}^{i}) = (s, a)}, \forall h, s, a . \end{matrix}

And we have empirical estimation:

{\hat{P}}_{h}^{k} (s^{'} | s, a) = \frac{N_{h}^{k} (s, a, s^{'})}{N_{h}^{k} (s, a)}, \forall h, s, a, s^{'} .

Part 2: Reward Bonus Design and Value Iteration

We use a bonus to encourage exploration:

b_{h}^{k} (s, a) = c H \sqrt{\frac{\ln (SAHK / δ)}{N_{h}^{k} (s, a)}}

$H$ :

\begin{aligned} {\hat{V}}_{H}^{k} (s) = 0, \forall s, \\ {\hat{Q}}_{h}^{k} (s, a) = \min {r_{h} (s, a) + b_{h}^{k} (s, a) + {\hat{P}}_{h}^{k} (\cdot | s, a) \cdot {\hat{V}}_{h + 1}^{k}, H}, \\ {\hat{V}}_{h}^{k} (s) = \max_{a} {\hat{Q}}_{h}^{k} (s, a), π_{h}^{k} (s) = {argmax}_{a} {\hat{Q}}_{h}^{k} (s, a), \forall h, s, a . \end{aligned}

Remarks $H$ $\widehat{Q}_{h}^{n}(s,a)$ $H$ $\|\widehat{V}_h^n\|_\infty \le H, \forall h, n$ ).

Analysis

Theorem 7.1 (Regret Bound of UCBVI). UCBVI achieves the following regret bound

Regret := E [\sum_{k = 0}^{K - 1} (V^{⋆} (s_{0}) - V^{π^{k}} (s_{0}))] \leq 10 H^{2} S \sqrt{A K \cdot \ln (S A H^{2} K^{2})} = \tilde{O} (H^{2} S \sqrt{A K})

Remarks $\tilde{O}\left(H^2\sqrt{SAK}\right)$ $H^2$ .

Remarks: Note that here is an expectation statement, but a high probability version should not be hard with a martingale argument.

Sketch of proof:

$b_h^n(s,a)$ $\left(\left(\widehat{P}_h^k(\cdot|s,a)-P_h(\cdot|s,a)\right)\cdot V_{h+1}^{\star}\right)$
$\widehat{V}^k_h(s)≥ V_h^\star(s), \forall h,n,s,a$
$V_0^{\star}(s_0)-V_0^{\pi^k}(s_0)\leq\widehat{V}_0^n(s_0)-V_0^{\pi^k}(s_0)$
$\widehat{V}_0^k(s_0)-V^{\pi^k}(s_0)$

Step 1:

Lemma 7.2 $\delta \in (0,1)$ $k\in[0,\ldots,K-1],s\in\mathcal{S},a\in\mathcal{A},h\in$ $[0,\ldots,H-1],$ $1-\delta,$ $f:\mathcal{S}\mapsto[0,H]$ :

| {({\hat{P}}_{h}^{k} (\cdot | s, a) - P_{h}^{*} (\cdot | s, a))}^{⊤} f | \leq 8 H \sqrt{\frac{S \ln (S A H K / δ)}{N_{h}^{k} (s, a)}} .

Proof: Azuma-Hoeffding + Union Bound

Remarks $f$ $\epsilon$ https://arxiv.org/pdf/1011.3027.pdf $|\mathcal N_\epsilon| \le (1 + 2H\sqrt {|\mathcal S|} /\epsilon)^{|\mathcal S|}$ $f: \mathcal S \mapsto [0, H].$ $\|f\|_2 \le H \sqrt{|\mathcal S|}$ $\epsilon$ $H \sqrt{|\mathcal S|}$ )

Remarks $\epsilon$ -net.

Lemma 7.3 $V^\star$ $k\in[1,\ldots,K-1],s\in\mathcal{S},a\in$ $\mathcal{A},h\in[0,\ldots,H-1],$ $V_h^{\star}:\mathcal{S}\to[0,H],$ $1-\delta,$ we have:

| {\hat{P}}_{h}^{k} (\cdot | s, a) \cdot V_{h + 1}^{⋆} - P_{h}^{⋆} (\cdot | s, a) \cdot V_{h + 1}^{⋆} | \leq 2 H \sqrt{\frac{\ln (S A H K / δ)}{N_{h}^{k} (s, a)}}

$f$ .

Step 2:

$\mathcal{E}_{model}$ as the above events.

Lemma 7.4 $\mathcal{E}_{model}$ $k$ , we have:

{\hat{V}}_{0}^{k} (s_{0}) \geq V_{0}^{⋆} (s_{0}), \forall s_{0} \in S;

$\widehat{V}_h^k$ is computed based on Vl in Eg.0.3.

Proof: We use induction to exploit the fact that VI is an iterative algo. Proof at P76 of the book.

Lemma 7.5. $K$ $\tau^k=\{s_h^k,a_h^k\}_{h=0}^{H-1}$ $k=0,\ldots,K-1$ We have

\sum_{k = 0}^{K - 1} \sum_{h = 0}^{H - 1} \frac{1}{\sqrt{N_{h}^{k} (s_{h}^{k}, a_{h}^{k})}} \leq 2 H \sqrt{S A K} .

Proof at P77 of the book

Now we can give proof to the main theorem.

Proof at P77 of the book.

Remarks: The simulation lemma is at lecture (basically induction). Also check this note for simulation lemma.

Remarks $\widehat{V}_{h+1}^{\pi_k}$ is data-dependent, so we cannot apply Lemma 7.2. Instead, we need to use Hölder.

An Improved Regret Bound

$\tilde{O}\left(H^2\sqrt{|\mathcal S||\mathcal A|K} + H^3|\mathcal S|^2|\mathcal A|\right)$ . The core of the analysis is to use a sharper concentration (Bernstein) than Hoeffding. Check appendix A.7.