OverviewMulti-Armed & Linear BanditsThe K-Armed Bandit ProblemSome attemptsThe Upper Confidence Bound (UCB) AlgorithmLinear Bandits: Handling Large Action SpacesLinUCB Analysis

Overview

This is my note for Chapter 6 of Reinforcement Learning Theory.

Multi-Armed & Linear Bandits

M = {s_{0}, {a_{1}, . . ., a_{K}}, H = 1, R}

Why it is important?

$\gamma = 0$ $H=1$ , the problem of MDP is reduced to the multi-armed bandit problem.

Note that reward is stochastic in this chapter.

The K-Armed Bandit Problem

$a \in \{1,2,\ldots K\}$ $r_a\in[-1,1]$ $R(a)\in\Delta([-1,1])$ which has mean reward:

E_{r_{a} \sim R (a)} [r_{a}] = μ_{a}

$\mu_a\in[-1,1].$

More formally, we have the following interactive learning process

$t=0\to T-1$

$I_t \in \{1,...,K\}$ (# based on historical information)
$r_t\sim R(I_t)$ $I_t$

Objective:

R_{T} = T \cdot max_{i} μ_{i} - \sum_{t = 0}^{T - 1} μ_{I_{t}}

$R_T/T \to 0$ $T \to \infty$ .

$a^\star=\arg\max_i\mu_i$ $\Delta_a=\mu_{a^{\star}}-\mu_a$ $a$ .

Theorem 6.1. $1-\delta,$ we have

R_{T} = O (min {\sqrt{K T \cdot \ln (T K / δ)}, \sum_{a \neq a^{*}} \frac{\ln (T K / δ)}{Δ_{a}}} + K) .

Proof: shown later in P65, after Lemma 6.2.

$\mu_{a^\star} \le UCB(a^\star) \le UCB(I_t)$ .

Remarks $\tilde{O}(\sqrt{KT})$ .

Some attempts

$[0, 1]$ .

constant regret $R_T/T \to C$ ) if in the first round, the optimal arm did not show up.
Explore and Commit algo:
$k= 1, \ldots, K:$
$k$ $N$ $\left\{r_i\right\}_{i=1}^N\sim R(k)$
$k$ $\hat{\mu}_k=\sum_{i=1}^N r_i/N$
$t = NK, \ldots, T-1$ :
$I_t = \arg \max_{i \in [K]} \hat{\mu}_i$

Some light analysis: Hoeffding inequality (not very sharp but good for a sum of bounded r.v.)

$\mu \in \Delta([0,1]),$ $N$ $\{r_i\}_{i=1}^N\sim\mu,$ $1-\delta$ we have:

\begin{matrix} (1) & | \sum_{i = 1}^{N} r_{i} / N - μ | \leq O (\sqrt{\frac{\ln (1 / δ)}{N}}) \end{matrix}

Remarks $(\hat{\mu} - \sqrt{\ln(1/\delta)/N}, \mu + \sqrt{\ln(1/\delta)/N})$ .

$1-\delta$ for all $k\in[K]$ , we have:

| {\hat{μ}}_{k} - μ_{k} | \leq O (\sqrt{\frac{\ln (K / δ)}{N}})

Remarks $1-K\delta$ $k$ $K\delta \leftarrow \delta$ , we have the above statement.

Thus we have

R_{T} = R_{explore} + R_{exploit},

$R_{\text{explore}} \le NK$ $R_{\text{exploit}} \le (T-NK) (\mu_{I^\star} - \mu_{\hat{I}})$ $I^\star$ $\hat{I}$ denotes the best empirical arm.

Thus we only need to bound the second part:

\begin{aligned} μ_{I^{⋆}} - μ_{\hat{I}} \leq & [{\hat{μ}}_{I^{⋆}} + \sqrt{\ln (K / δ) / N}] - [{\hat{μ}}_{\hat{I}} - \sqrt{\ln (K / δ) / N}] \\ = & {\hat{μ}}_{I^{⋆}} - {\hat{μ}}_{\hat{I}} + 2 \sqrt{\ln (K / δ) / N} \\ \leq & 2 \sqrt{\ln (K / δ) / N}, \end{aligned}

$\hat{\mu}_{\hat{I}} \ge \hat{\mu}_{i}, \forall i$ .

Now we have

R_{T} \leq N K + 2 T \sqrt{\frac{\ln (K / δ)}{N}},

$N$ $N$ and obtain

R_{T} \leq O (T^{2 / 3} K^{1 / 3} \cdot \ln^{1 / 3} (K / δ))

$N$ $\left(\frac{T\sqrt{\ln(K/\delta)}}{2K}\right)^{2/3}$ .

$\sqrt{T}$ bound?

The Upper Confidence Bound (UCB) Algorithm

We will give the algo first and then dive into analysis and why the name

$r_a$ $a \in \{1, 2, \ldots, K\}$ .

$t = 1, \ldots, T-K$ :

$I_t = \arg \max_{i \in [K]} \Big (\hat{\mu}_i^t + \sqrt{\frac{\ln(TK / \delta)}{N_i^t}} \Big)$

$r_t := r_{I_t}$

$N^t_i$ $\hat{\mu}_a^t$ are random r.v. defined as

N_{a}^{t} = 1 + \sum_{i = 1}^{t - 1} I {I_{i} = a}

{\hat{μ}}_{a}^{t} = \frac{1}{N_{a}^{t}} (r_{a} + \sum_{i = 1}^{t - 1} I {I_{i} = a} r_{i})

which denotes the counts and empirical mean of each arm.

Just like the Explore and Commit algo, we also construct the confidence interval as follows:

Lemma 6.2 $1-\delta,$

| {\hat{μ}}_{a}^{t} - μ_{a} | \leq 2 \sqrt{\frac{\ln (T K / δ)}{N_{a}^{t}}} .

Proof at P64 of the book.

Remarks: Here we have to use the martingale version of Hoeffding's inequality, i.e., Azuma-Hoeffding's inequality. (here I use the version from wiki)

P (| X_{N} - X_{0} | \geq ϵ) \leq 2 \exp (\frac{- ϵ^{2}}{2 \sum_{k = 1}^{N} c_{k}^{2}})

$c_k = \mathbb I\{I_k = a\}$ $\sum_{k=0}^{t-1} c_k^2 = N_a^t$ , and we obtain the version used in P64 by some simple calculation.

Directly checking appendix A.5 is more straightforward though.

Remarks: Now it should be clear why this is called UCB, since it chooses the arm with the highest upper confidence bound every iteration.

What is the intuition behind UCB?

Remarks: When UCB is great, it may be 2 cases. Case 1: uncertainty high->need for exploration. Case 2: uncertainty low->simply a good arm

Linear Bandits: Handling Large Action Spaces

Assumptions for larger action spaces:

$D \subset \mathbb{R}^d$ $x_t \in D$ $r_t\in[-1,1]$ .

Now we assume linearity:

E [r_{t} | x_{t} = x] = μ^{⋆} \cdot x \in [- 1, 1],

and define the noise sequence as:

η_{t} = r_{t} - μ^{⋆} \cdot x_{t},

which is a martingale difference sequence.

Remarks $\mathbb E [\eta_{t+1} | x_t] = \mathbb E[\eta_{t+1}] = \mathbb E[r_{t+1} - \mathbb E[r_{t+1}|x_{t+1}]] = 0$ .

$x_0, \ldots, x_{T-1}$ are the decisions made in the game, then define the cumulative regret by

R_{T} = T μ^{⋆} \cdot x^{⋆} - \sum_{t = 0}^{T - 1} μ^{⋆} \cdot x_{t}

$x^{\star}\in D$ $\mu^\star$ , i.e

x^{⋆} \in \arg max_{x \in D} μ^{⋆} \cdot x .

Remarks $x^\star$ $D$ $D$ .

Remarks: It might be hard to directly interpret this model as a linear model, though there exists some linearity in the conditional expectation. Personally, this formulation in 1.2 seems more natural as it is just linear regression in some sense.

$t$ $\text{BALL}_t$ $\widehat{\mu}_t$ , is the solution of the following regularized least squares problem:

\begin{aligned} {\hat{μ}}_{t} & = \arg min_{μ} \sum_{τ = 0}^{t - 1} ∥ μ \cdot x_{τ} - r_{τ} ∥_{2}^{2} + λ ∥ μ ∥_{2}^{2} \\ = Σ_{t}^{- 1} \sum_{τ = 0}^{t - 1} r_{τ} x_{τ}, \end{aligned}

$\lambda$ is a parameter and where

Σ_{t} = λ I + \sum_{τ = 0}^{t - 1} x_{τ} x_{τ}^{⊤}, with Σ_{0} = λ I .

$\text {BALL}_t$ $\Sigma_t$ . More precisely, we can define the uncertainty ball as

{BALL}_{t} = {μ | ({\hat{μ}}_{t} - μ)^{⊤} Σ_{t} ({\hat{μ}}_{t} - μ) \leq β_{t}},

$\beta_t$ is a hyperparameter.

Now we present the Linear UCB algo:

$\lambda, \beta_t$

$t=0, 1, \ldots$ do

$x_t=\arg\max_{x\in D}\max\limits_{\mu\in\mathrm{BALL}_t}\mu\cdot x$ $r_t$ .

$BALL_{t+1}$

Remarks: Thought the book first give bound results, I suppose it would be more natural to see why the ball is some uncertainty region, so I move this part here.

Theoretical results: valid confidence ball

Proposition 6.6. $\delta>0.$ We have that

Pr (\forall t, μ^{⋆} \in {BALL}_{t}) \geq 1 - δ .

Proof see next section

Theoretical results: upper and lower bounds

Theorem 6.3 $1$ $\left|\mu^{\star}\cdot x\right|\le1$ $x\in D;$ $\|\mu^{\star}\|\le W$ $\|x\|\leq B$ $x \in D;$ $\eta_t$ $\sigma^2$ sub-Gaussian. Set

λ = σ^{2} / W^{2}, β_{t} := σ^{2} (2 + 4 d \log (1 + \frac{t B^{2} W^{2}}{d}) + 8 \log (4 / δ)) .

$1 - \delta$ $T \geq 0,$

R_{T} \leq c σ \sqrt{T} (d \log (1 + \frac{T B^{2} W^{2}}{d σ^{2}}) + \log (4 / δ))

$c$ $R_T$ $\tilde{O}(d\sqrt{T})$ with high probability.

Proof at P69 of the book.

Remarks $|\mathcal D|$ , which is a great property.

Theorem 6.4. $\mu$ $\sigma^2 \leq 1,$ $n \ge\max\{256,d^2/16\},$

E_{μ} E R_{T} \geq \frac{1}{2500} d \sqrt{T} .

where the inner expectation is with respect to randomness in the problem and the algorithm.

Proof not provided.

Remarks: This shows that LinUCB is minimax optimal.

LinUCB Analysis

$(\widehat{\mu}_t-\mu)^{\top}\Sigma_t(\widehat{\mu}_t-\mu)$ . The first analysis is about regret, and the second one is about confidence.

Proposition 6.7. $\|x\|\leq B$ $x\in D.$ $\beta_t$ $\beta_t \ge 1$ $\mu^\star \in \text{BALL}_t$ $t$ , then

\sum_{k = 0}^{T - 1} {regret}_{t}^{2} \leq 8 β_{T} d \log (1 + \frac{T B^{2}}{d λ}) .

Proof in this section is somewhat tedious, so I leave that to the book and only write remarks here.

Remarks $A=A^T>0$ the set

E = {x ∣ x^{T} A x \leq 1}

$\mathbb{R}^n$ $0$ .

semi-axes are given by $s_i=\lambda_i^{-1/2}q_i, i.e,...

eigenvectors determine directions of semiaxes
eigenvalues determine lengths of semiaxes

$q$ as

sup_{z \in E} q^{⊤} z - inf_{z \in E} q^{⊤} z = 2 ∥ A^{1 / 2} q ∥_{2} .

$2 \lambda_{\min}(A)^{1/2}$ $2 \lambda_{\max}(A)^{1/2}$ .

Remarks $w_t := \sqrt{x_t^\top \Sigma_t^{-1} x_t}$ $x_t$ .

Remarks: Typo at P69, should be maximizes

Remarks $\beta_t$ $\text{BALL}_t$ is indeed the uncertainty ball.

However, the exact computational method is not clear for LinUCB yet. In some cases, LinUCB can be NP-hard.