OverviewLinearly Parameterized MDPsSettingLow-Rank MDPs and Linear MDPsPlanning in Linear MDPsLearning Transition using Ridge Linear RegressionUniform Convergence via CoveringAlgorithmAnalysis of UCBVI for Linear MDPsProving OptimismRegret DecompositionConclusion

Overview

This is my note for Chapter 8 of Reinforcement Learning Theory.

Linearly Parameterized MDPs

The whole idea is similar to Chapter 6, where we extend from MAB to linear bandits to deal with a huge action space. We seek some sort of extension to discrete MDPs in Chapter 6.

Setting

Setting is similar to Chapter 7, i.e., finite horizon+episodic setting+minimize regret

Regret := E [\sum_{k = 0}^{K - 1} (V^{⋆} - V^{π^{k}})]

Low-Rank MDPs and Linear MDPs

$\mathcal S$ $\mathcal A$ can be infinite. Without any further structural assumption, the lower bounds we saw in the Generalization Lecture forbid us to get a polynomial regret bound.

Remarks $\tilde{O}(H^2 S \sqrt{AK})$ $\mathcal S$ $\mathcal A$ $\tilde{O}(d\sqrt{T})$ $d$ $T$ is iterations). We see the main motivation is that we need to impose some structural assumption, such that we can get a polynomial bound for regret.

The structural assumption we make in this note is a linear structure in both the reward and the transition.

Definition 8.1 $\{P_h\}$ $\{r_h\}$ $\{r_h\}$ $\{P_h\}$ :

r_{h} (s, a) = θ_{h}^{⋆} \cdot ϕ (s, a), P_{h} (\cdot | s, a) = μ_{h}^{⋆} ϕ (s, a), \forall h

$\phi$ $\phi:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R}^d,$ $\mu_h^{\star}\in\mathbb{R}^{|\mathcal{S}|\times d}.$ $\phi$ $\theta^\star_h$ known $\mu^\star$ $\sup_{s,a}\|\phi(s,a)\|_2\le1,$ $\|v^\top\mu_h^\star\|_2\le\sqrt{d}$ $v$ $\|v\|_\infty \leq 1$ $h$ $\|\theta^\star_h\|_2 \le W$ $h$ $r_h(s,a)\in[0,1]$ $h$ $s, a$ .

$d$ $|\mathcal S|$ $|\mathcal A|$ .

Remarks $\phi$ $\mu_h^\star \phi(s, a) = (\mu_h^\star)_i$ $(\mu_h^\star)_i$ $i$ $r_h(s, a) = [\theta^\star_h]_i$ . This reduces to tabular MDPs.

An example is latent variable models, see slides.

Remarks $\phi$ .

Planning in Linear MDPs

An important observation in linear MDPs is the linearity of everything, i.e.,

Q_{h}^{⋆} (s, a) = ϕ (s, a) \cdot (θ_{h}^{⋆} + μ_{h}^{⋆} \cdot V_{h + 1}^{*}) =: ϕ (s, a) \cdot w_{h}

$\cdot$ $\pi^\star = \arg \max_a Q^\star_h(s, a)$ $V^\star(s) = \max_a Q^\star_h(s, a)$ . More generally, we have

Claim 8.2. $f:\mathcal{S}\mapsto[0,H].$ $h \in [0,\ldots H-1]$ $w \in \mathbb{R}^d$ $s,a \in \mathcal{S}\times \mathcal{A} \colon$

T f = r_{h} (s, a) + P_{h} (\cdot | s, a) \cdot f = w^{⊤} ϕ (s, a),

$\mathcal T$ is the bellman operator.

Proof: derive by definition.

UCBVI?

$\{\widehat{P}_h^n\}_{h=0}^{H-1}$ $\{s_h^i,a_h^i,s_{h+1}^i\}_{i=0}^{n-1}$
$b^n_h(s,a),\forall s,a$
$\pi^{n+1} = \text{Value-lter} \left(\{\widehat{P}^n\}_h,\{r_h+b_h^n\}\right)$
$H$ steps

Remarks: We can see that the general framework is identical, so the point falls on how to design bonus and how to learn the transition kernel.

Learning Transition using Ridge Linear Regression

Remarks $|\mathcal S|$ $|\mathcal A|$ are super big/continuous. So we need a way out.

$P_h$ $h$ ?

$n$ $n$ $1$ $n-1$ ). We denote such dataset as:

D_{h}^{n} = {s_{h}^{i}, a_{h}^{i}, s_{h + 1}^{i}}_{i = 0}^{n - 1} .

$D_h^n$ :

Λ_{h}^{n} = \sum_{i = 0}^{n - 1} ϕ (s_{h}^{i}, a_{h}^{i}) ϕ (s_{h}^{i}, a_{h}^{i})^{⊤} + λ I .

Remarks $\phi$ $\Lambda_h^n$ $N_h^n(s, a)$ $N_h^n(s, a) = \sum_{i=0}^{n-1} \mathbb{I}\{(s^i_h, a_h^i) = (s, a)\}$ .)

Remarks $|\Lambda_h^n| = |\sum\limits_{i=0}^{n-1}\phi(s_h^i,a_h^i)^\top \phi(s_h^i,a_h^i)+\lambda I| \le |n + \lambda I| \le (n+\lambda)^d$ $x^\top x$ $xx^\top$ having the same non-zero eigenvalues.

$\delta(s)$ $s$ $\epsilon_h^i=P\left(\cdot \mid s_h^i, a_h^i\right)-\delta\left(s_{h+1}^i\right)$ $\mathcal{H}_h^i$ $\mathcal{H}_h^i$ $\left.\left(s_h^i, a_h^i\right)\right)$ , we have:

E [ϵ_{h}^{i} ∣ H_{h}^{i}] = 0

$s_{h+1}^i$ $P_h\left(\cdot \mid s_h^i, a_h^i\right)$ $\left(s_h^i, a_h^i\right)$ Remarks $\mathbb E[\delta(s_{h+1}^i)|(s_h^i, a_h^i)] = P_h(\cdot | s_h^i, a_h^i)$ $\left\|\epsilon_h^i\right\|_1 \leq 2$ $h, i$ .

$\mu_h^{\star} \phi\left(s_h^i, a_h^i\right)=P_h\left(\cdot \mid s_h^i, a_h^i\right)$ $\delta\left(s_{h+1}^i\right)$ $P_h\left(\cdot \mid s_h^i, a_h^i\right)$ $s_h^i, a_h^i$ $\mu^{\star}$ $\phi\left(s_h^i, a_h^i\right)$ $\delta\left(s_{h+1}^i\right)$ . This leads us to the following ridge linear regression:

{\hat{μ}}_{h}^{n} = {argmin}_{μ \in R | S | \times d} \sum_{i = 0}^{n - 1} {‖ μ ϕ (s_{h}^{i}, a_{h}^{i}) - δ (s_{h + 1}^{i}) ‖}_{2}^{2} + λ ∥ μ ∥_{F}^{2} .

Ridge linear regression has the following closed-form solution:

{\hat{μ}}_{h}^{n} = \sum_{i = 0}^{n - 1} δ (s_{h + 1}^{i}) ϕ {(s_{h}^{i}, a_{h}^{i})}^{⊤} {(Λ_{h}^{n})}^{- 1}

$\widehat{\mu}_h^n \in \mathbb{R}^{|\mathcal{S}| \times d}$ $\widehat{\mu}_h^n$ $s, a$ $V$ $\widehat{P}_h^n(\cdot \mid s, a) \cdot V:=\left(\widehat{\mu}_h^n \phi(s, a)\right) \cdot V$ , which can be re-written as:

{\hat{P}}_{h}^{n} (\cdot ∣ s, a) \cdot V := ({\hat{μ}}_{h}^{n} ϕ (s, a)) \cdot V = ϕ (s, a)^{⊤} \sum_{i = 0}^{n - 1} {(Λ_{h}^{n})}^{- 1} ϕ (s_{h}^{i}, a_{h}^{i}) V (s_{h + 1}^{i}),

$\widehat{P}_h^n(\cdot \mid s, a) \cdot V$ $d, n$ $|\mathcal S|$ .

Lemma 8.3 $\widehat{\mu}_h$ $\mu_h^{\star}$ $n$ $h$ , we must have:

{\hat{μ}}_{h}^{n} - μ_{h}^{⋆} = - λ μ_{h}^{⋆} {(Λ_{h}^{n})}^{- 1} + \sum_{i = 1}^{n - 1} ϵ_{h}^{i} ϕ {(s_{h}^{i}, a_{h}^{i})}^{⊤} {(Λ_{h}^{n})}^{- 1}

$\widehat{\mu}_h^n$ :

\begin{aligned} {\hat{μ}}_{h}^{n} & = \sum_{i = 0}^{n - 1} δ (s_{h + 1}^{i}) ϕ {(s_{h}^{i}, a_{h}^{i})}^{⊤} {(Λ_{h}^{n})}^{- 1} = \sum_{i = 0}^{n - 1} (P (\cdot ∣ s_{h}^{i}, a_{h}^{i}) + ϵ_{h}^{n}) ϕ {(s_{h}^{i}, a_{h}^{i})}^{⊤} {(Λ_{h}^{n})}^{- 1} \\ = \sum_{i = 0}^{n - 1} (μ_{h}^{⋆} ϕ (s_{h}^{i}, a_{h}^{i}) + ϵ_{h}^{i}) ϕ {(s_{h}^{i}, a_{h}^{i})}^{⊤} {(Λ_{h}^{n})}^{- 1} \\ = \sum_{i = 0}^{n - 1} μ_{h}^{⋆} ϕ (s_{h}^{i}, a_{h}^{i}) ϕ {(s_{h}^{i}, a_{h}^{i})}^{⊤} {(Λ_{h}^{n})}^{- 1} + \sum_{i = 0}^{n - 1} ϵ_{h}^{i} ϕ {(s_{h}^{i}, a_{h}^{i})}^{⊤} {(Λ_{h}^{n})}^{- 1} \\ = μ_{h}^{⋆} (Λ_{h}^{n} - λ I) {(Λ_{h}^{n})}^{- 1} + \sum_{i = 0}^{n - 1} ϵ_{h}^{i} ϕ {(s_{h}^{i}, a_{h}^{i})}^{⊤} {(Λ_{h}^{n})}^{- 1} = μ_{h}^{⋆} - λ μ_{h}^{⋆} {(Λ_{h}^{n})}^{- 1} + \sum_{i = 0}^{n - 1} ϵ_{h}^{i} ϕ {(s_{h}^{i}, a_{h}^{i})}^{⊤} {(Λ_{h}^{n})}^{- 1} \end{aligned}

Rearrange terms, we conclude the proof.

Lemma 8.4. $V: \mathcal{S} \mapsto[0, H]$ $n$ $s, a \in \mathcal{S} \times \mathcal{A}$ $h$ $1-\delta$ , we have:

{‖ \sum_{i = 0}^{n - 1} ϕ (s_{h}^{i}, a_{h}^{i}) (V^{⊤} ϵ_{h}^{i}) ‖}_{{(Λ_{h}^{n})}^{- 1}} \leq 3 H \sqrt{\ln \frac{H \det {(Λ_{h}^{n})}^{1 / 2} \det (λ I)^{- 1 / 2}}{δ}} .

$\left\{V^{\top} \epsilon_h^i\right\}_{h, i}$ $V$ is independent of the data (it's a pre-fixed function), and by linear property of expectation, we have:

E [V^{⊤} ϵ_{h}^{i} ∣ H_{h}^{i}] = 0, | V^{⊤} ϵ_{h}^{i} | \leq ∥ V ∥_{\infty} {‖ ϵ_{h}^{i} ‖}_{1} \leq 2 H, \forall h, i

$n$ $1-\delta$ :

{‖ \sum_{i = 0}^{n - 1} ϕ (s_{h}^{i}, a_{h}^{i}) (V^{⊤} ϵ_{h}^{i}) ‖}_{{(Λ_{h}^{n})}^{- 1}} \leq 3 H \sqrt{\ln \frac{\det {(Λ_{h}^{n})}^{1 / 2} \det (λ I)^{- 1 / 2}}{δ}} .

$h \in[H]$ $1-\delta$ $n, h$ :

{‖ \sum_{i = 0}^{n - 1} ϕ (s_{h}^{i}, a_{h}^{i}) (V^{⊤} ϵ_{h}^{i}) ‖}_{{(Λ_{h}^{n})}^{- 1}} \leq 3 H \sqrt{\ln \frac{H \det {(Λ_{h}^{n})}^{1 / 2} \det (λ I)^{- 1 / 2}}{δ}}

Uniform Convergence via Covering

The reason to use covering here is that the function class is infinite.

Lemma 8.5. $\epsilon$ $\Theta=\left\{\theta \in \mathbb{R}^d:\|\theta\|_2 \leq R \in \mathbb{R}^{+}\right\}$ $(1+2 R / \epsilon)^d$ .

Proof: We already show this in Chapter 7. Refer to https://arxiv.org/pdf/1011.3027.pdf Lemma 5.2.

$(w, \beta, \Lambda)$ $w \in \mathbb{R}^d$ $\|w\|_2 \leq L, \beta \in[0, B]$ $\Lambda$ $\sigma_{\min }(\Lambda) \geq \lambda$ $f_{w, \beta, \Lambda}: \mathcal{S} \mapsto[0, R]$ as follows:

f_{w, β, Λ} (s) = min {max_{a} (w^{⊤} ϕ (s, a) + β \sqrt{ϕ (s, a)^{⊤} Λ^{- 1} ϕ (s, a)}), H}, \forall s \in S

$\mathcal{F}$ as:

F = {f_{w, β, Λ} : ∥ w ∥_{2} \leq L, β \in [0, B], σ_{min} (Λ) \geq λ}

$\mathcal{F}$ contains infinitely many functions as the parameters are continuous. However we will show that it has finite covering number that scales $(w, \beta, \Lambda)$ .

$\mathcal{F}$ $\mathcal{F}$ $\widehat{Q}_h$ functions one could encounter during the learning process.

Lemma 8.6 $\epsilon$ $\mathcal{F})$ $\mathcal{F}$ $\epsilon$ $\mathcal{N}_\epsilon$ $\ell_{\infty}$ $d\left(f_1, f_2\right)=\left\|f_1-f_2\right\|_{\infty}$ $f_1, f_2 \in \mathcal{F}$ . We have that:

\ln (| N_{ϵ} |) \leq d \ln (1 + 6 L / ϵ) + \ln (1 + 6 B / (\sqrt{λ} ϵ)) + d^{2} \ln (1 + 18 B^{2} \sqrt{d} / (λ ϵ^{2}))

$\epsilon$ $d$ .

$(w, \beta, \Lambda)$ $\epsilon$ $\mathcal{F}$ $\ell_{\infty}$ distance metric.

$(w, \beta, \Lambda)$ $(\hat{w}, \hat{\beta}, \widehat{\Lambda})$ .

\begin{aligned} | f (s) - \hat{f} (s) | \leq | max_{a} (w^{⊤} ϕ (s, a) + β \sqrt{ϕ (s, a)^{⊤} Λ^{- 1} ϕ (s, a)}) - max_{a} ({\hat{w}}^{⊤} ϕ (s, a) + \hat{β} \sqrt{ϕ (s, a)^{⊤} \hat{Λ^{- 1} ϕ (s, a)}}) | \\ \leq max_{a} | (w^{⊤} ϕ (s, a) + β \sqrt{ϕ (s, a)^{⊤} Λ^{- 1} ϕ (s, a)}) - ({\hat{w}}^{⊤} ϕ (s, a) + \hat{β} \sqrt{ϕ (s, a)^{⊤} {\hat{Λ}}^{- 1} ϕ (s, a)}) | \\ \leq max_{a} | (w - \hat{w})^{⊤} ϕ (s, a) | + max_{a} | (β - \hat{β}) \sqrt{ϕ (s, a)^{⊤} Λ^{- 1} ϕ (s, a)} | \\ + max_{a} ∣ \hat{β} (\sqrt{ϕ (s, a)^{⊤} Λ^{- 1} ϕ (s, a)} - \sqrt{ϕ (s, a)^{⊤} {\hat{Λ}}^{- 1} ϕ (s, a))} ∣ \\ \leq ∥ w - \hat{w} ∥_{2} + | β - \hat{β} | / \sqrt{λ} + B \sqrt{| ϕ (s, a)^{⊤} (Λ^{- 1} - {\hat{Λ}}^{- 1}) ϕ (s, a) |} \\ \leq ∥ w - \hat{w} ∥_{2} + | β - \hat{β} | / \sqrt{λ} + B \sqrt{{‖ Λ^{- 1} - {\hat{Λ}}^{- 1} ‖}_{F}} \end{aligned}

Remarks $\|\phi\|_\infty \le 1$ .

$\Lambda^{-1}$ $\sigma_{\max }\left(\Lambda^{-1}\right) \leq 1 / \lambda$ $\epsilon / 3$ $\mathcal{N}_{\epsilon / 3, w}$ $\left\{w:\|w\|_2 \leq L\right\}, \sqrt{\lambda} \epsilon / 3$ $\mathcal{N}_{\sqrt{\lambda} \epsilon / 3, \beta}$ $[0, B]$ $\beta$ $\epsilon^2 /\left(9 B^2\right)$ $\mathcal{N}_{\epsilon^2 /(9 B), \Lambda}$ $\left\{\Lambda:\|\Lambda\|_F \leq \sqrt{d} / \lambda\right\}$ .

Remarks $\epsilon /3$ method, covering net version.

$\epsilon$ $\mathcal{F}$ $\epsilon$ $\mathcal{N}_\epsilon$ $\mathcal{F}$ is upper bounded as:

\begin{aligned} \ln | N_{ϵ} | & \leq \ln | N_{ϵ / 3, w} | + \ln | N_{\sqrt{λ} ϵ / 3, β} | + \ln | N_{ϵ^{2} / (9 B^{2}), Λ} | \\ \leq d \ln (1 + 6 L / ϵ) + \ln (1 + 6 B / (\sqrt{λ} ϵ)) + d^{2} \ln (1 + 18 B^{2} \sqrt{d} / (λ ϵ^{2})) . \end{aligned}

$f \in \mathcal{F}$ Lemma 8.7 $\lambda=1$ $\delta \in(0,1)$ $n, h$ $s, a$ $f \in \mathcal{F}$ $1-\delta$ , we have:

| ({\hat{P}}_{h}^{n} (\cdot ∣ s, a) - P (\cdot ∣ s, a)) \cdot f | ≲ H ∥ ϕ (s, a) ∥_{{(Λ_{h}^{n})}^{- 1}} (\sqrt{d \ln (1 + 6 L \sqrt{N})} + d \sqrt{\ln (1 + 18 B^{2} \sqrt{d} N)} + \sqrt{\ln \frac{H}{δ}})

$1-\delta$ $n, h$ $V$ (independent of the random process):

{‖ \sum_{i = 1}^{n - 1} ϕ (s_{h}^{i}, a_{h}^{i}) (V^{⊤} ϵ_{h}^{i}) ‖}_{{(Λ_{h}^{n})}^{- 1}}^{2} \leq 9 H^{2} \ln \frac{H \det {(Λ_{h}^{n})}^{1 / 2} \det (λ I)^{- 1 / 2}}{δ} \leq 9 H^{2} (\ln \frac{H}{δ} + d \ln (1 + N))

$\|\phi\|_2 \leq 1, \lambda=1$ $\left\|\Lambda_h^n\right\|_2 \leq N+1$ .

Remarks $N$ $\operatorname{det}(\Lambda_h^n)^{1 / 2} \operatorname{det}(\lambda I)^{-1 / 2} \le (n + \lambda)^{d/2} \lambda^{-d/2} \le (N/\lambda + 1)^{d/2} = (N+1)^{d/2}$ .

$\epsilon$ $\mathcal{F}$ $\mathcal{N}_\epsilon$ $\mathcal{N}_\epsilon$ $1-\delta$ $V \in \mathcal{N}_\epsilon$ $n, h$ , we have:

{‖ \sum_{i = 1}^{n - 1} ϕ (s_{h}^{i}, a_{h}^{i}) (V^{⊤} ϵ_{h}^{i}) ‖}_{{(Λ_{h}^{n})}^{- 1}}^{2} \leq 9 H^{2} (\ln \frac{H}{δ} + \ln (| N_{ϵ} |) + d \ln (1 + N))

$\ln \left|\mathcal{N}_\epsilon\right|$ into the above inequality, we get:

{‖ \sum_{i = 1}^{n - 1} ϕ (s_{h}^{i}, a_{h}^{i}) (V^{⊤} ϵ_{h}^{i}) ‖}_{{(Λ_{h}^{n})}^{- 1}}^{2} \leq 9 H^{2} (\ln \frac{H}{δ} + d \ln (1 + 6 L / ϵ) + d^{2} \ln (1 + 18 B^{2} \sqrt{d} / ϵ^{2}) + d \ln (1 + N)) .

$f \in \mathcal{F}$ $\epsilon$ $f$ $V \in \mathcal{N}_\epsilon$ $\|f-V\|_{\infty} \leq \epsilon$ . Thus, we have:

\begin{aligned} {‖ \sum_{i = 1}^{n - 1} ϕ (s_{h}^{i}, a_{h}^{i}) (f^{⊤} ϵ_{h}^{i}) ‖}_{{(Λ_{h}^{n})}^{- 1}}^{2} & \leq 2 {‖ \sum_{i = 1}^{n - 1} ϕ (s_{h}^{i}, a_{h}^{i}) (V^{⊤} ϵ_{h}^{i}) ‖}_{{(Λ_{h}^{n})}^{- 1}}^{2} + 2 {‖ \sum_{i = 1}^{n - 1} ϕ (s_{h}^{i}, a_{h}^{i}) ((V - f)^{⊤} ϵ_{h}^{i}) ‖}_{{(Λ_{h}^{n})}^{- 1}}^{2} \\ \leq 2 {‖ \sum_{i = 1}^{n - 1} ϕ (s_{h}^{i}, a_{h}^{i}) (V^{⊤} ϵ_{h}^{i}) ‖}_{{(Λ_{h}^{n})}^{- 1}}^{2} + 8 ϵ^{2} N \\ \leq 9 H^{2} (\ln \frac{H}{δ} + d \ln (1 + 6 L / ϵ) + d^{2} \ln (1 + 18 B^{2} \sqrt{d} / ϵ^{2}) + d \ln (1 + N)) + 8 ϵ^{2} N, \end{aligned}

$\left\|\sum_{i=1}^{n-1} \phi\left(s_h^i, a_h^i\right)(V-f)^{\top} \epsilon_h^i\right\|_{\left(\Lambda_h^n\right)^{-1}}^2 \leq 4 \epsilon^2 N$ , which is from

{‖ \sum_{i = 1}^{n - 1} ϕ (s_{h}^{i}, a_{h}^{i}) (V - f)^{⊤} ϵ_{h}^{i} ‖}_{{(Λ_{h}^{n})}^{- 1}}^{2} \leq {‖ \sum_{i = 1}^{n - 1} ϕ (s_{h}^{i}, a_{h}^{i}) 2 ϵ ‖}_{{(Λ_{h}^{n})}^{- 1}}^{2} \leq \frac{4 ϵ^{2}}{λ} {‖ \sum_{i = 1}^{n - 1} ϕ (s_{h}^{i}, a_{h}^{i}) ‖}_{2} \leq 4 ϵ^{2} N

$\epsilon=1 / \sqrt{N}$ , we get:

\begin{aligned} {‖ \sum_{i = 1}^{n - 1} ϕ (s_{h}^{i}, a_{h}^{i}) (f^{⊤} ϵ_{h}^{i}) ‖}_{{(Λ_{h}^{n})}^{- 1}}^{2} & \leq 9 H^{2} (\ln \frac{H}{δ} + d \ln (1 + 6 L \sqrt{N}) + d^{2} \ln (1 + 18 B^{2} \sqrt{d} N) + d \ln (1 + N)) + 8 \\ ≲ H^{2} (\ln \frac{H}{δ} + d \ln (1 + 6 L \sqrt{N}) + d^{2} \ln (1 + 18 B^{2} \sqrt{d} N)) \end{aligned}

$\lesssim$ $\left(\widehat{P}_h^n(\cdot \mid s, a)-P(\cdot \mid s, a)\right) \cdot f=\phi(s, a)^{\top}\left(\widehat{\mu}_h^n-\mu_h^{\star}\right)^{\top} f$ . Recall Lemma 8.3, we have:

\begin{aligned} | ({\hat{μ}}_{h}^{n} ϕ (s, a) - μ_{h}^{⋆} ϕ (s, a)) \cdot f | \leq | λ ϕ (s, a)^{⊤} {(Λ_{h}^{n})}^{- 1} {(μ_{h}^{⋆})}^{⊤} f | + | \sum_{i = 1}^{n - 1} ϕ (s, a)^{⊤} {(Λ_{h}^{n})}^{- 1} ϕ (s_{h}^{i}, a_{h}^{i}) {(ϵ_{h}^{i})}^{⊤} f | \\ ≲ H \sqrt{d} ∥ ϕ (s, a) ∥_{{(Λ_{h}^{n})}^{- 1}} + ∥ ϕ (s, a) ∥_{{(Λ_{h}^{n})}^{- 1}} \sqrt{(H^{2} (\ln \frac{H}{δ} + d \ln (1 + 6 L \sqrt{N}) + d^{2} \ln (1 + 18 B^{2} \sqrt{d} N))}) \\ ≂ H ∥ ϕ (s, a) ∥_{{(Λ_{h}^{n})}^{- 1}} (\sqrt{\ln \frac{H}{δ}} + \sqrt{d \ln (1 + 6 L \sqrt{N})} + d \sqrt{\ln (1 + 18 B^{2} \sqrt{d} N)}) . \end{aligned}

Remarks $\left|\lambda \phi(s, a)^{\top}\left(\Lambda_h^n\right)^{-1}\left(\mu_h^{\star}\right)^{\top} f\right| = \left| \phi(s, a)^{\top}\left(\Lambda_h^n\right)^{-1/2}\left(\Lambda_h^n\right)^{-1/2}\left(\mu_h^{\star}\right)^{\top} f\right| \le \left\| \phi(s, a)^{\top}\left(\Lambda_h^n\right)^{-1/2} \right\|_2 \left\| \left(\Lambda_h^n\right)^{-1/2} \left(\mu_h^{\star}\right)^{\top} f\right\|_2 = \|\phi(s, a)\|_{\left(\Lambda_h^n\right)^{-1}} \|\left(\mu_h^{\star}\right)^{\top} f\|_{\left(\Lambda_h^n\right)^{-1}}$ $f \in \mathcal F$ $\|f\|_\infty \le H$ $\|\left(\mu_h^{\star}\right)^{\top} f\|_{\left(\Lambda_h^n\right)^{-1}} \le \| \left(\Lambda_h^n\right)^{-1/2} \|_2\| \left(\mu_h^{\star}\right)^{\top} f\|_2 \le H \sqrt{d} / \sqrt{\lambda}$ $\Lambda_h^n \ge \lambda$ $\lambda=1$ , we obtain the first part of the second inequality. The second part is similar.

Remarks $\tilde{O}(Hd) \|\phi(s, a)\|_{(\Lambda_h^n)^{-1}}$ , which will be our bonus term.

Algorithm

Our algorithm, Upper Confidence Bound Value Iteration (UCB-VI) will use reward bonus to ensure optimism. Specifically, we will the following reward bonus, which is motivated from the reward bonus used in linear bandit:

b_{h}^{n} (s, a) = β \sqrt{ϕ (s, a)^{⊤} {(Λ_{h}^{n})}^{- 1} ϕ (s, a)},

$\beta$ $H$ $d$ $\tilde{O}(Hd)$ ). Again to gain intuition, please think about what this bonus would look like when we specialize linear MDP to tabular MDP.

Algorithm 6 UCBVI for Linear MDPs

$\beta, \lambda$
$n=1 \ldots N$ do
$\widehat{P}_h^n$ $h$ (Eq. 0.3)
$b_h^n$ $h$ (Eq. 0.7)
$\left\{\widehat{P}_h^n, r_h+b_h^n\right\}_{h=0}^{H-1}$ (Eq. 0.8)
$\pi^n$ as the returned policy of VI.
end for

$n$ $\widehat{\mu}_h^n$ $\mathrm{Q}$ $H$ ):

\begin{aligned} {\hat{V}}_{H}^{n} (s) = 0, \forall s, \\ {\hat{Q}}_{h}^{n} (s, a) = θ^{⋆} \cdot ϕ (s, a) + β \sqrt{ϕ (s, a)^{⊤} {(Λ_{h}^{n})}^{- 1} ϕ (s, a)} + ϕ (s, a)^{⊤} {({\hat{μ}}_{h}^{n})}^{⊤} {\hat{V}}_{h + 1}^{n} \\ = β \sqrt{ϕ (s, a)^{⊤} {(Λ_{h}^{n})}^{- 1} ϕ (s, a)} + {(θ^{⋆} + {({\hat{μ}}_{h}^{n})}^{⊤} {\hat{V}}_{h + 1}^{n})}^{⊤} ϕ (s, a), \\ {\hat{V}}_{h}^{n} (s) = min {max_{a} {\hat{Q}}_{h}^{n} (s, a), H}, π_{h}^{n} (s) = {argmax}_{a} {\hat{Q}}_{h}^{n} (s, a) . \end{aligned}

$\widehat{Q}_h^n$ $\phi(s,a)$ $\widehat{V}_h^n$ $f_{w, \beta, \Lambda}$ defined in Eq. 0.5.

$\widehat{V}_h^n$ $\mathcal F$ :

Lemma 8.8. $\beta \in[0, B]$ $n, h$ $\widehat{V}_h^n$ $\widehat{V}_h^n$ falls into the following class:

V = {f_{w, β, Λ} : ∥ w ∥_{2} \leq W + \frac{H N}{λ}, β \in [0, B], σ_{min} (Λ) \geq λ}

$\theta^{\star}+\left(\widehat{\mu}_h^n\right)^{\top} \widehat{V}_{h+1}^n$ $\ell_2$ $\left\|\widehat{V}_{h+1}^n\right\|_{\infty} \leq H$ as we do truncation at Value Iteration:

{‖ θ^{⋆} + {({\hat{μ}}_{h}^{n})}^{⊤} {\hat{V}}_{h + 1}^{n} ‖}_{2} \leq W + {‖ {({\hat{μ}}_{h}^{n})}^{⊤} {\hat{V}}_{h + 1}^{n} ‖}_{2}

Remarks: This is from assumption 3.

$\widehat{\mu}_h^n$ from Eq. 0.3:

{‖ {({\hat{μ}}_{h}^{n})}^{⊤} {\hat{V}}_{h + 1}^{n} ‖}_{2} = {‖ \sum_{i = 1}^{n - 1} {\hat{V}}_{h + 1}^{n} (s_{h + 1}^{i}) ϕ {(s_{h}^{i}, a_{h}^{i})}^{⊤} {(Λ_{h}^{n})}^{- 1} ‖}_{2} \leq H {‖ {(Λ_{h}^{n})}^{- 1} \sum_{i = 0}^{n - 1} ϕ (s_{h}^{i}, a_{h}^{i}) ‖}_{2} \leq \frac{H n}{λ},

$\left\|\widehat{V}_{h+1}^n\right\|_{\infty} \leq H, \sigma_{\max }\left(\Lambda^{-1}\right) \leq 1 / \lambda$ $\sup _{s, a}\|\phi(s, a)\|_2 \leq 1$

Analysis of UCBVI for Linear MDPs

We finally arrive at regret bound!!

Theorem 8.9 $\beta=\widetilde{O}(H d), \lambda=1$ . UCBVI (Algorithm 6) achieves the following regret bound:

E [N V^{⋆} - \sum_{i = 0}^{N} V^{π^{n}}] \leq \tilde{O} (H^{2} \sqrt{d^{3} N})

The main steps of the proof are similar to the main steps of UCBVI in tabular MDPs. We first prove optimism via induction, and then we use optimism to upper bound per-episode regret. Finally, we use simulation lemma $\lambda=1$ directly.

Proving Optimism

$\beta$ $\mathcal E_{\text {model}}$ .

Lemma 8.10 $\mathcal{E}_{\text {model }}$ $n$ $h$ ,

{\hat{V}}_{h}^{n} (s) \geq V_{h}^{⋆} (s), \forall s

$n$ $\widehat{V}_{h+1}^n(s) \geq V_{h+1}^{\star}(s)$ $s$ $h$ , we have:

\begin{aligned} {\hat{Q}}_{h}^{n} (s, a) - Q_{h}^{⋆} (s, a) \\ = θ^{⋆} \cdot ϕ (s, a) + β \sqrt{ϕ (s, a)^{⊤} {(Λ_{h}^{n})}^{- 1} ϕ (s, a)} + ϕ (s, a)^{⊤} {({\hat{μ}}_{h}^{n})}^{⊤} {\hat{V}}_{h + 1}^{n} - θ^{⋆} \cdot ϕ (s, a) - ϕ (s, a)^{⊤} {(μ_{h}^{⋆})}^{⊤} V_{h + 1}^{⋆} \\ \geq β \sqrt{ϕ (s, a)^{⊤} {(Λ_{h}^{n})}^{- 1} ϕ (s, a)} + ϕ (s, a)^{⊤} {({\hat{μ}}_{h}^{n} - μ_{h}^{⋆})}^{⊤} {\hat{V}}_{h + 1}^{n}, \end{aligned}

$\widehat{V}_{h+1}^n(s) \geq V_{h+1}^{\star}(s)$ $\mu_h^{\star} \phi(s, a)$ $\widehat{\mu}_h^n \phi(s, a)$ $\phi(s, a)^{\top}\left(\widehat{\mu}_h^n-\mu_h^{\star}\right)^{\top} \widehat{V}_{h+1}^n$ $\mathcal{E}_{\text {model }}$ being true, we have that:

| ({\hat{P}}_{h}^{n} (\cdot ∣ s, a) - P (\cdot ∣ s, a)) \cdot {\hat{V}}_{h + 1}^{n} | ≲ β ∥ ϕ (s, a) ∥_{{(Λ_{h}^{n})}^{- 1}},

$\mathcal{V}$ $\widehat{V}_{h+1}^n \in \mathcal{V}$ . This concludes the proof.

Regret Decomposition

Now we can upper bound the per-episode regret as follows:

V^{⋆} - V^{π_{n}} \leq {\hat{V}}_{0}^{n} (s_{0}) - V_{0}^{π_{n}} (s_{0})

We can further bound the RHS of the above inequality using simulation lemma. Recall Eq. 0.4 that we derived in the note for tabular MDP (Chapter 7):

{\hat{V}}_{0}^{n} (s_{0}) - V_{0}^{π_{n}} (s_{0}) \leq \sum_{h = 0}^{H - 1} E_{s, a \sim d_{h}^{π_{n}}} [b_{h}^{n} (s, a) + ({\hat{P}}_{h}^{n} (\cdot ∣ s, a) - P (\cdot ∣ s, a)) \cdot {\hat{V}}_{h + 1}^{n}] .

(recall that the simulation lemma holds for any MDPs!

$\mathcal{E}_{\text {model }}$ $s, a, h, n$ $\left(\widehat{P}_h^n(\cdot \mid s, a)-P(\cdot \mid s, a)\right) \cdot \widehat{V}_{h+1}^n \lesssim \beta\|\phi(s, a)\|_{\left(\Lambda_h^n\right)^{-1}}=$ $b_h^n(s, a)$ $\mathcal{E}_{\text {model }}$ , we have:

{\hat{V}}_{0}^{n} (s_{0}) - V_{0}^{π_{n}} (s_{0}) \leq \sum_{h = 0}^{H - 1} E_{s, a \sim d_{h}^{π_{n}}} [2 b_{h}^{n} (s, a)] ≲ \sum_{h = 0}^{H - 1} E_{s, a \sim d_{h}^{π_{n}}} [b_{h}^{n} (s, a)]

Sum over all episodes, we have the following statement.

Lemma 8.11 $\mathcal{E}_{\text {model }}$ holds. We have:

\sum_{n = 0}^{N - 1} (V_{0}^{⋆} (s_{0}) - V_{0}^{π_{n}} (s_{0})) \leq \sum_{n = 0}^{N - 1} \sum_{h = 0}^{H - 1} E_{s_{h}^{n}, a_{h}^{n} \sim d_{h}^{π_{n}}} [b_{h}^{n} (s_{h}^{n}, a_{h}^{n})]

Conclusion

Lemma 8.12 $s_h^i, a_h^i$ $\sup _{s, a}\|\phi(s, a)\|_2 \leq$ $\Lambda_h^n=I+\sum_{i=0}^{n-1} \phi\left(s_h^i, a_h^i\right) \phi\left(s_h^i, a_h^i\right)^{\top}$ . We have:

\sum_{i = 0}^{N - 1} ϕ {(s_{h}^{i}, a_{h}^{i})}^{⊤} {(Λ_{h}^{i})}^{- 1} ϕ (s_{h}^{i}, a_{h}^{i}) \leq 2 \ln (\frac{\det (Λ_{h}^{N})}{\det (I)}) ≲ 2 d \ln (N) .

Proof: By the Lemma 3.7 and 3.8 in the linear bandit lecture note,

\begin{aligned} \sum_{i = 0}^{N - 1} ϕ {(s_{h}^{i}, a_{h}^{i})}^{⊤} {(Λ_{h}^{i})}^{- 1} ϕ (s_{h}^{i}, a_{h}^{i}) & \leq 2 \sum_{i = 0}^{N - 1} \ln (1 + ϕ {(s_{h}^{i}, a_{h}^{i})}^{⊤} {(Λ_{h}^{i})}^{- 1} ϕ (s_{h}^{i}, a_{h}^{i})) \\ \leq 2 \ln (\frac{\det (Λ_{h}^{N})}{\det (I)}) \\ ≲ 2 d \ln (N) \end{aligned}

$0 \leq y \leq 1, \ln (1+y) \geq y / 2$ .

Remarks $\phi(s_h^i, a_h^i)$ $\phi_{i}$ for simplicity. Here we use:

\begin{aligned} det (Λ_{h}^{n}) = & det (Λ_{h}^{n - 1} + ϕ_{n - 1} ϕ_{n - 1}^{⊤}) \\ = & det ((Λ_{h}^{n - 1})^{\frac{1}{2}} (I + (Λ_{h}^{n - 1})^{- \frac{1}{2}} ϕ_{n - 1} ϕ_{n - 1}^{⊤} (Λ_{h}^{n - 1})^{- \frac{1}{2}}) (Λ_{h}^{n - 1})^{\frac{1}{2}}) \\ = & det (Λ_{h}^{n - 1}) det (I + (Λ_{h}^{n - 1})^{- \frac{1}{2}} ϕ_{n - 1} ((Λ_{h}^{n - 1})^{- \frac{1}{2}} ϕ_{n - 1})^{⊤}) \\ = & det (Λ_{h}^{n - 1}) (1 + ϕ_{n - 1}^{⊤} (Λ_{h}^{n - 1})^{- 1} ϕ_{n - 1}), \end{aligned}

$\det(I_p + AB) = \det(I_q + BA)$ in the last equality. Therefore, we have

\prod_{i = 0}^{N - 1} (1 + ϕ (s_{h}^{i}, a_{h}^{i})^{⊤} (Λ_{h}^{i})^{- 1} ϕ (s_{h}^{i}, a_{h}^{i})) = \prod_{i = 0}^{N - 1} \frac{det (Λ_{h}^{i + 1})}{det (Λ_{h}^{i})} = \frac{det (Λ_{h}^{N})}{det (I)},

which is the second inequality.

Now we use Lemma 8.11 together with the above inequality to conclude the proof.

Now we give the proof of the regret bound.

$\mathcal{E}_{\text {model }}$ .

\begin{aligned} E [N V^{⋆} - \sum_{n = 1}^{N} V^{π_{n}}] = [E {E_{model} holds} (N V^{⋆} - \sum_{n = 1}^{N} V^{π_{n}})] \\ + E [1 {E_{model} doesn’t hold} (N V^{⋆} - \sum_{n = 1}^{N} V^{π_{n}})] \\ \leq E [1 {E_{model} holds} (N V^{⋆} - \sum_{n = 1}^{N} V^{π_{n}})] + δ N H \\ ≲ E [\sum_{n = 1}^{N} \sum_{h = 0}^{H - 1} b_{h}^{n} (s_{h}^{n}, a_{h}^{n})] + δ N H . \end{aligned}

Note that:

\begin{aligned} \sum_{n = 1}^{N} \sum_{h = 0}^{H - 1} b_{h}^{n} (s_{h}^{n}, a_{h}^{n}) & = β \sum_{n = 1}^{N} \sum_{h = 0}^{H - 1} \sqrt{ϕ {(s_{h}^{n}, a_{h}^{n})}^{⊤} {(Λ_{h}^{n})}^{- 1} ϕ (s_{h}^{n}, a_{h}^{n})} \\ = β \sum_{h = 0}^{H - 1} \sum_{n = 1}^{N} \sqrt{ϕ {(s_{h}^{n}, a_{h}^{n})}^{⊤} {(Λ_{h}^{n})}^{- 1} ϕ (s_{h}^{n}, a_{h}^{n})} \\ \leq β \sum_{h = 0}^{H - 1} \sqrt{N \sum_{n = 1}^{N} ϕ (s_{h}^{n}, a_{h}^{n}) {(Λ_{h}^{n})}^{- 1} ϕ (s_{h}^{n}, a_{h}^{n})} ≲ β H \sqrt{N d \ln (N)} \end{aligned}

$\beta=\widetilde{O}(H d)$ . This concludes the proof.

Remarks $\tilde O(H^2 d \sqrt{d N})$ $|\mathcal S|$ $|\mathcal A|$ !