Overview

This is my note for Chapter 6 of Reinforcement Learning Theory.

Multi-Armed & Linear Bandits

M={s0,{a1,...,aK},H=1,R}

Why it is important?

when γ=0 or H=1, the problem of MDP is reduced to the multi-armed bandit problem.

Note that reward is stochastic in this chapter.

The K-Armed Bandit Problem

The setting is where we have K decisions (the “arms'”), where when we play arm a{1,2,K} we obtain a random reward ra[1,1] from R(a)Δ([1,1]) which has mean reward:

EraR(a)[ra]=μa

where it is easy to see that μa[1,1].

More formally, we have the following interactive learning process

For t=0T1

  1. Learner pulls arm It{1,...,K} (# based on historical information)

  2. Learner observes an i.i.d reward rtR(It) of arm It

Objective:

RT=Tmaxiμit=0T1μIt

Goal: no-regret, i.e., RT/T0 as T.

We denote a=argmaxiμi as the optimal arm. We define gap Δa=μaμa for any arm a.

Theorem 6.1. There exists an algorithm such that with probability at least 1δ, we have

RT=O(min{KTln(TK/δ),aaln(TK/δ)Δa}+K).

Proof: shown later in P65, after Lemma 6.2.

Note that μaUCB(a)UCB(It).


Remarks: This is O~(KT).

Some attempts

For ease of analysis, here we assume the reward is in [0,1].

  1. Greedy algo: try each arm and then pull the highest one observed. This algo brings constant regret (RT/TC) if in the first round, the optimal arm did not show up.

  2. Explore and Commit algo:

    For k=1,,K:

    Pull arm-k N times, observe {ri}i=1NR(k)

    Calculate arm k's empirical mean: μ^k=i=1Nri/N

    For t=NK,,T1:

    Pull the best empirical arm, It=argmaxi[K]μ^i

Some light analysis: Hoeffding inequality (not very sharp but good for a sum of bounded r.v.)

Given a distribution μΔ([0,1]), and N i.i.d samples {ri}i=1Nμ,w/ probability at least 1δ we have:

(1)|i=1Nri/Nμ|O(ln(1/δ)N)

Remarks: A sharper analysis is possible, yet Hoeffding is simple and straightforward in giving a confidence interval (μ^ln(1/δ)/N,μ+ln(1/δ)/N).

After the Exploration phase, with probability at least 1δ, for all arm k[K], we have:

|μ^kμk|O(ln(K/δ)N)

Remarks: By Union Bound, with probability 1Kδ, we have for all k that Equation 1. holds. Substitute Kδδ, we have the above statement.

Thus we have

RT=Rexplore+Rexploit,

where RexploreNK, and Rexploit(TNK)(μIμI^), in which I denotes the best arm and I^ denotes the best empirical arm.

Thus we only need to bound the second part:

μIμI^ [μ^I+ln(K/δ)/N][μ^I^ln(K/δ)/N]= μ^Iμ^I^+2ln(K/δ)/N 2ln(K/δ)/N,

where the second line is due to μ^I^μ^i,i.

Now we have

RTNK+2Tln(K/δ)N,

where N is a hyperparameter. Thus we minimize w.r.t. N and obtain

RTO(T2/3K1/3ln1/3(K/δ))

when N is set as (Tln(K/δ)2K)2/3.

This is not good enough, so can we get a T bound?

The Upper Confidence Bound (UCB) Algorithm

We will give the algo first and then dive into analysis and why the name

Play all arms and denote received reward as ra for all a{1,2,,K}.

for t=1,,TK:

Execute arm It=argmaxi[K](μ^it+ln(TK/δ)Nit)

Observe rt:=rIt

where Nit and μ^at are random r.v. defined as

Nat=1+i=1t1I{Ii=a}
μ^at=1Nat(ra+i=1t1I{Ii=a}ri)

which denotes the counts and empirical mean of each arm.

Just like the Explore and Commit algo, we also construct the confidence interval as follows:

Lemma 6.2 (Upper Confidence Bound). For all t ∈[1,...,T| and a∈[1,2,...K|, we have that with probability a least 1δ,

|μ^atμa|2ln(TK/δ)Nat.

Proof at P64 of the book.

Remarks: Here we have to use the martingale version of Hoeffding's inequality, i.e., Azuma-Hoeffding's inequality. (here I use the version from wiki)

P(|XNX0|ϵ)2exp(ϵ22k=1Nck2)

Notice that ck=I{Ik=a}, so k=0t1ck2=Nat , and we obtain the version used in P64 by some simple calculation.

Directly checking appendix A.5 is more straightforward though.


Remarks: Now it should be clear why this is called UCB, since it chooses the arm with the highest upper confidence bound every iteration.

What is the intuition behind UCB?

Remarks: When UCB is great, it may be 2 cases. Case 1: uncertainty high->need for exploration. Case 2: uncertainty low->simply a good arm

Linear Bandits: Handling Large Action Spaces

Assumptions for larger action spaces:

Let DRd be a compact (but otherwise arbitrary) set of decisions. On each round, we must choose a decision xtD. Each such choice results in a reward rt[1,1].

Now we assume linearity:

E[rt|xt=x]=μx[1,1],

and define the noise sequence as:

ηt=rtμxt,

which is a martingale difference sequence.

Remarks: Proof is simple due to E[ηt+1|xt]=E[ηt+1]=E[rt+1E[rt+1|xt+1]]=0.

lf x0,,xT1 are the decisions made in the game, then define the cumulative regret by

RT=Tμxt=0T1μxt

where xD is an optimal decision for μ, i.e

xargmaxxDμx.

Remarks: Note that x exists due to D is compact in our assumption, which says every subset has a supremum in D.

Remarks: It might be hard to directly interpret this model as a linear model, though there exists some linearity in the conditional expectation. Personally, this formulation in 1.2 seems more natural as it is just linear regression in some sense.

 

LinUCB is based on “optimism in the face of uncertainty”. At episode t, we use all previous experience to define an uncertainty region (an ellipse) BALLt. The center of this region, μ^t, is the solution of the following regularized least squares problem:

μ^t=argminμτ=0t1μxτrτ22+λμ22=Σt1τ=0t1rτxτ,

where λ is a parameter and where

Σt=λI+τ=0t1xτxτ,withΣ0=λI.

The shape of the region BALLt, is defined through the feature covariance Σt. More precisely, we can define the uncertainty ball as

BALLt={μ|(μ^tμ)Σt(μ^tμ)βt},

where βt is a hyperparameter.

Now we present the Linear UCB algo:

Input: λ,βt

for t=0,1, do

Execute xt=argmaxxDmaxμBALLtμx and observe the reward rt.

Update BALLt+1

Remarks: Thought the book first give bound results, I suppose it would be more natural to see why the ball is some uncertainty region, so I move this part here.

Theoretical results: valid confidence ball

Proposition 6.6.(Confidence) Let δ>0. We have that

Pr(t,μBALLt)1δ.

Proof see next section


Theoretical results: upper and lower bounds

Theorem 6.3. Suppose that the expected costs are bounded, in magnitude, by 1, i.e. that |μx|1 for all xD; that μW and xB for all xD; and that the noise ηt is σ2 sub-Gaussian. Set

λ=σ2/W2,βt:=σ2(2+4dlog(1+tB2W2d)+8log(4/δ)).

We have that with probability greater than 1δ, that (simultaneously) for all T0,

RTcσT(dlog(1+TB2W2dσ2)+log(4/δ))

where c is an absolute constant. In other words, we have that RT is O~(dT) with high probability.


Proof at P69 of the book.


Remarks: Note that there is no dependence on |D|, which is a great property.

Theorem 6.4.(Lower bound) There exists a distribution over linear bandit problems (i.e. a distribution over μ)with the rewards being bounded by 1, in magnitude, and σ21, such that for every (randomized) algorithm, we have for nmax{256,d2/16},

EμERT12500dT.

where the inner expectation is with respect to randomness in the problem and the algorithm.


Proof not provided.


Remarks: This shows that LinUCB is minimax optimal.

LinUCB Analysis

This section is to understand the growth of (μ^tμ)Σt(μ^tμ). The first analysis is about regret, and the second one is about confidence.

Proposition 6.7.(Sum of Squares Regret Bound) Suppose that xB for xD. Suppose βt is increasing and that βt1. For LinUCB, if μBALLt for all t, then

k=0T1regrett28βTdlog(1+TB2dλ).

Proof in this section is somewhat tedious, so I leave that to the book and only write remarks here.

Remarks: Recap for ellipsoid/ball if A=AT>0 the set

E={xxTAx1}

is an ellipsoid in Rn , centered at 0.

semi-axes are given by $s_i=\lambda_i^{-1/2}q_i, i.e,...

Define width in the direction q as

supzEqzinfzEqz=2A1/2q2.

Thus the minimum width and maximum width is given by 2λmin(A)1/2 and 2λmax(A)1/2.

Remarks: Now we can see why wt:=xtΣt1xt is the "normalized width" of direction xt.

Remarks: Typo at P69, should be maximizes

Remarks: Though not indicated, we only need to substitute βt with the upper bound obtained in 6.3.2 to verify that BALLt is indeed the uncertainty ball.

However, the exact computational method is not clear for LinUCB yet. In some cases, LinUCB can be NP-hard.