Overview

This is my note for Chapter 7 of Reinforcement Learning Theory.

Strategic Exploration in Tabular MDPs

Assumptions in this chapter:

Learning Protocol

  1. Learner initializes a policy π1

  2. At episode n, learner executes πn: {shn,ahn,rhn}h=0H1, with ahn=πn(shn),rhn=r(shn,ahn),sh+1nP(|shn,ahn)

  3. Learner updates policy πn+1

The performance measure for K episodes:

Regret:=E[KV(s0)k=0K1h=0H1r(shk,ahk)]=E[k=1K(V(s0)Vπn(s0))].

If we convert this problem to MAB and run UCB, what is the complexity?

Unique policies: (|A||S|)H -> all treated as arms

Therefore UCB gives O(|A||S|HK) which is really bad.

WHY? Policies should not be treated as independent arms!

UCB-VI

 

Let us consider the very beginning of episode k:

Part 1: Model estimation

Nhk(s,a,s)=i=0k1I{(shi,ahi,sh+1i)=(s,a,s)},Nhk(s,a)=i=0k1I{(shi,ahi)=(s,a)},h,s,a.

And we have empirical estimation:

P^hk(s|s,a)=Nhk(s,a,s)Nhk(s,a),h,s,a,s.

Part 2: Reward Bonus Design and Value Iteration

We use a bonus to encourage exploration:

bhk(s,a)=cHln(SAHK/δ)Nhk(s,a)

Now we perform VI starting from H:

V^Hk(s)=0,s,Q^hk(s,a)=min{rh(s,a)+bhk(s,a)+P^hk(|s,a)V^h+1k,H},V^hk(s)=maxaQ^hk(s,a),πhk(s)=argmaxaQ^hk(s,a),h,s,a.

Remarks: The H truncation for Q^hn(s,a) is due to that the reward is bounded and thus no policy's Q value will be larger than H (V^hnH,h,n).

Analysis

Theorem 7.1 (Regret Bound of UCBVI). UCBVI achieves the following regret bound

Regret:=E[k=0K1(V(s0)Vπk(s0))]10H2SAKln(SAH2K2)=O~(H2SAK)

Remarks: A sharper bound of O~(H2SAK) can be obtained. UCBVI is suboptimal w.r.t. H2.

Remarks: Note that here is an expectation statement, but a high probability version should not be hard with a martingale argument.

Sketch of proof:

Step 1:

Lemma 7.2 (State-action wise model error). Fix δ(0,1). For all k[0,,K1],sS,aA,h [0,,H1], with probability at least 1δ, we have that for any f:S[0,H]:

|(P^hk(|s,a)Ph(|s,a))f|8HSln(SAHK/δ)Nhk(s,a).

Proof: Azuma-Hoeffding + Union Bound

Remarks: Here the union bound includes f, so we need to utiliza an ϵ-net argument. Refer to https://arxiv.org/pdf/1011.3027.pdf Lemma 5.2. for why we have |Nϵ|(1+2H|S|/ϵ)|S| for f:S[0,H]. (Note that f2H|S|, so we essentially need an ϵ-net for an L2 ball with radii H|S|)

Remarks: After Union Bound, we only need to bound the points in the ϵ-net.


Lemma 7.3 (State-action wise average model error under V). Fix ∈(0,1). For all k[1,,K1],sS,a A,h[0,,H1],and consider Vh:S[0,H], with probability at least 1δ, we have:

|P^hk(|s,a)Vh+1Ph(|s,a)Vh+1|2Hln(SAHK/δ)Nhk(s,a)

Proof: This is simply Lemma 7.2 without union bound on f.


Step 2:

Denote Emodel as the above events.

Lemma 7.4 (Optimism). Assume Emodel is true. For all episode k, we have:

V^0k(s0)V0(s0),s0S;

where V^hk is computed based on Vl in Eg.0.3.


Proof: We use induction to exploit the fact that VI is an iterative algo. Proof at P76 of the book.


Lemma 7.5. Consider arbitrary K sequence of trajectories τk={shk,ahk}h=0H1 for k=0,,K1 We have

k=0K1h=0H11Nhk(shk,ahk)2HSAK.

Proof at P77 of the book


Now we can give proof to the main theorem.


Proof at P77 of the book.

Remarks: The simulation lemma is at lecture (basically induction). Also check this note for simulation lemma.

Remarks: Note that V^h+1πk is data-dependent, so we cannot apply Lemma 7.2. Instead, we need to use Hölder.


An Improved Regret Bound

In this section, we show a regret bound of O~(H2|S||A|K+H3|S|2|A|). The core of the analysis is to use a sharper concentration (Bernstein) than Hoeffding. Check appendix A.7.