OverviewMulti-Armed & Linear BanditsThe K-Armed Bandit ProblemSome attemptsThe Upper Confidence Bound (UCB) AlgorithmLinear Bandits: Handling Large Action SpacesLinUCB Analysis
This is my note for Chapter 6 of Reinforcement Learning Theory.
Why it is important?
when
Note that reward is stochastic in this chapter.
The setting is where we have K decisions (the “arms'”), where when we play arm
where it is easy to see that
More formally, we have the following interactive learning process
For
Learner pulls arm
Learner observes an i.i.d reward
Objective:
Goal: no-regret, i.e.,
We denote
Theorem 6.1. There exists an algorithm such that with probability at least
Proof: shown later in P65, after Lemma 6.2.
Note that
Remarks: This is
For ease of analysis, here we assume the reward is in
Greedy algo: try each arm and then pull the highest one observed. This algo brings constant regret (
Explore and Commit algo:
For
Pull arm-
Calculate arm
For
Pull the best empirical arm,
Some light analysis: Hoeffding inequality (not very sharp but good for a sum of bounded r.v.)
Given a distribution
Remarks: A sharper analysis is possible, yet Hoeffding is simple and straightforward in giving a confidence interval
After the Exploration phase, with probability at least
Remarks: By Union Bound, with probability
Thus we have
where
Thus we only need to bound the second part:
where the second line is due to
Now we have
where
when
This is not good enough, so can we get a
We will give the algo first and then dive into analysis and why the name
Play all arms and denote received reward as
for
Execute arm
Observe
where
which denotes the counts and empirical mean of each arm.
Just like the Explore and Commit algo, we also construct the confidence interval as follows:
Lemma 6.2 (Upper Confidence Bound). For all t ∈[1,...,T| and a∈[1,2,...K|, we have that with probability a least
Proof at P64 of the book.
Remarks: Here we have to use the martingale version of Hoeffding's inequality, i.e., Azuma-Hoeffding's inequality. (here I use the version from wiki)
Notice that
Directly checking appendix A.5 is more straightforward though.
Remarks: Now it should be clear why this is called UCB, since it chooses the arm with the highest upper confidence bound every iteration.
What is the intuition behind UCB?
Remarks: When UCB is great, it may be 2 cases. Case 1: uncertainty high->need for exploration. Case 2: uncertainty low->simply a good arm
Assumptions for larger action spaces:
Let
Now we assume linearity:
and define the noise sequence as:
which is a martingale difference sequence.
Remarks: Proof is simple due to
lf
where
Remarks: Note that
Remarks: It might be hard to directly interpret this model as a linear model, though there exists some linearity in the conditional expectation. Personally, this formulation in 1.2 seems more natural as it is just linear regression in some sense.
LinUCB is based on “optimism in the face of uncertainty”. At episode
where
The shape of the region
where
Now we present the Linear UCB algo:
Input:
for
Execute
Update
Remarks: Thought the book first give bound results, I suppose it would be more natural to see why the ball is some uncertainty region, so I move this part here.
Theoretical results: valid confidence ball
Proposition 6.6.(Confidence) Let
Proof see next section
Theoretical results: upper and lower bounds
Theorem 6.3. Suppose that the expected costs are bounded, in magnitude, by
We have that with probability greater than
where
Proof at P69 of the book.
Remarks: Note that there is no dependence on
Theorem 6.4.(Lower bound) There exists a distribution over linear bandit problems (i.e. a distribution over
where the inner expectation is with respect to randomness in the problem and the algorithm.
Proof not provided.
Remarks: This shows that LinUCB is minimax optimal.
This section is to understand the growth of
Proposition 6.7.(Sum of Squares Regret Bound) Suppose that
Proof in this section is somewhat tedious, so I leave that to the book and only write remarks here.
Remarks: Recap for ellipsoid/ball
if
is an ellipsoid in
semi-axes are given by $s_i=\lambda_i^{-1/2}q_i, i.e,...
eigenvectors determine directions of semiaxes
eigenvalues determine lengths of semiaxes
Define width in the direction
Thus the minimum width and maximum width is given by
Remarks: Now we can see why
Remarks: Typo at P69, should be maximizes
Remarks: Though not indicated, we only need to substitute
However, the exact computational method is not clear for LinUCB yet. In some cases, LinUCB can be NP-hard.