Overview

This is my note for Chapter 8 of Reinforcement Learning Theory.

Linearly Parameterized MDPs

The whole idea is similar to Chapter 6, where we extend from MAB to linear bandits to deal with a huge action space. We seek some sort of extension to discrete MDPs in Chapter 6.

Setting

Setting is similar to Chapter 7, i.e., finite horizon+episodic setting+minimize regret

Regret:=E[k=0K1(VVπk)]

Low-Rank MDPs and Linear MDPs

Here S and A can be infinite. Without any further structural assumption, the lower bounds we saw in the Generalization Lecture forbid us to get a polynomial regret bound.

Remarks: Remind that in Chapter 7, we have O~(H2SAK) regret bound for UCBVI. If S and A are infinite/continuous, This is obviously quite bad. Also remind that in Chapter 6, we have O~(dT) regret bound for LinUCB (here d is dimension and T is iterations). We see the main motivation is that we need to impose some structural assumption, such that we can get a polynomial bound for regret.

The structural assumption we make in this note is a linear structure in both the reward and the transition.

Definition 8.1 (Linear MDPs). Consider transition {Ph} and {rh}. A linear MDP has the following structures on {rh} and {Ph}:

rh(s,a)=θhϕ(s,a),Ph(|s,a)=μhϕ(s,a),h

where ϕ is a known state-action feature map ϕ:S×ARd, and μhR|S|×d. Here ϕ, θh are known to the learner, while μ is unknown. We further assume the following norm bound on the parameters:(1) sups,aϕ(s,a)21, (2) vμh2d for any v such that v1, and all h, and (3) θh2W for all h. We assume rh(s,a)[0,1] for all h and s,a.

image-20230606151847528

The intuition of this assumption is that now we can have poly(d) instead of |S| or |A|.

Remarks: The connection to tabular MDPs is that if ϕ is a one-hot mapping, then μhϕ(s,a)=(μh)i, where (μh)i denotes the i-th column vector and known reward rh(s,a)=[θh]i. This reduces to tabular MDPs.

An example is latent variable models, see slides.

Remarks: Linear MDP = low-rank+known ϕ.

Planning in Linear MDPs

An important observation in linear MDPs is the linearity of everything, i.e.,

Qh(s,a)=ϕ(s,a)(θh+μhVh+1)=:ϕ(s,a)wh

where denotes the inner product. And thus we can define π=argmaxaQh(s,a) and V(s)=maxaQh(s,a). More generally, we have

Claim 8.2. Consider any arbitrary function f:S[0,H]. At any time step h[0,H1] there must exist a wRd, such that, for all s,aS×A:

Tf=rh(s,a)+Ph(|s,a)f=wϕ(s,a),

where T is the bellman operator.


Proof: derive by definition.


UCBVI?

  1. Learn transmission model {P^hn}h=0H1 from all previous data {shi,ahi,sh+1i}i=0n1

  2. Design reward bonus bhn(s,a),s,a

  3. Plan: πn+1=Value-lter({P^n}h,{rh+bhn})

  4. Execute for H steps

Remarks: We can see that the general framework is identical, so the point falls on how to design bonus and how to learn the transition kernel.

Learning Transition using Ridge Linear Regression

Remarks: Recall that in tabular MDPs, we use empirical estimation for transition. However, in this setting, this becomes infeasible as |S| and |A| are super big/continuous. So we need a way out.

In this section, we consider the following simple question: given a dataset of state-action-next state tuples, how can we learn the transition Ph for all h?

We consider a particular episode n. Similar to Tabular-UCBVI, we learn a model at the very beginning of episode n using all data from the previous episodes (episode 1 to the end of episode n1). We denote such dataset as:

Dhn={shi,ahi,sh+1i}i=0n1.

We maintain the following statistics using Dhn:

Λhn=i=0n1ϕ(shi,ahi)ϕ(shi,ahi)+λI.

Remarks: when ϕ is a one-hot mapping, the Λhn is a diagonal matrix with Nhn(s,a) as elements. (Remind that Nhn(s,a)=i=0n1I{(shi,ahi)=(s,a)}.)

Remarks: A bound: |Λhn|=|i=0n1ϕ(shi,ahi)ϕ(shi,ahi)+λI||n+λI|(n+λ)d. The interchange inside the determinant is simply due to xx and xx having the same non-zero eigenvalues.

We consider the following multi-variate linear regression problem. Denote δ(s) as a one-hot vector that has zero everywhere except that the entry corresponding to s is one. Denote ϵhi=P(shi,ahi)δ(sh+1i). Conditioned on history Hhi (history Hhi denotes all information from the very beginning of the learning process up to and including (shi,ahi)), we have:

E[ϵhiHhi]=0

simply because sh+1i is sampled from Ph(shi,ahi) conditioned on (shi,ahi) (Remarks: In other words, E[δ(sh+1i)|(shi,ahi)]=Ph(|shi,ahi)). Also note that ϵhi12 for all h,i.

Since μhϕ(shi,ahi)=Ph(shi,ahi), and δ(sh+1i) is an unbiased estimate of Ph(shi,ahi) conditioned on shi,ahi, it is reasonable to learn μ via regression from ϕ(shi,ahi) to δ(sh+1i). This leads us to the following ridge linear regression:

μ^hn=argminμR|S|×di=0n1μϕ(shi,ahi)δ(sh+1i)22+λμF2.

Ridge linear regression has the following closed-form solution:

μ^hn=i=0n1δ(sh+1i)ϕ(shi,ahi)(Λhn)1

Note that μ^hnR|S|×d, so we never want to explicitly store it. Note that we will always use μ^hn together with a specific s,a pair and a value function V (think about value iteration case), i.e., we care about P^hn(s,a)V:=(μ^hnϕ(s,a))V, which can be re-written as:

P^hn(s,a)V:=(μ^hnϕ(s,a))V=ϕ(s,a)i=0n1(Λhn)1ϕ(shi,ahi)V(sh+1i),

Thus we only need to compute P^hn(s,a)V, which is simply poly(d,n), not depending on |S|.

Lemma 8.3 (Difference between μ^h and μh ). For all n and h, we must have:

μ^hnμh=λμh(Λhn)1+i=1n1ϵhiϕ(shi,ahi)(Λhn)1

Proof: We start from the closed-form solution of μ^hn :

μ^hn=i=0n1δ(sh+1i)ϕ(shi,ahi)(Λhn)1=i=0n1(P(shi,ahi)+ϵhn)ϕ(shi,ahi)(Λhn)1=i=0n1(μhϕ(shi,ahi)+ϵhi)ϕ(shi,ahi)(Λhn)1=i=0n1μhϕ(shi,ahi)ϕ(shi,ahi)(Λhn)1+i=0n1ϵhiϕ(shi,ahi)(Λhn)1=μh(ΛhnλI)(Λhn)1+i=0n1ϵhiϕ(shi,ahi)(Λhn)1=μhλμh(Λhn)1+i=0n1ϵhiϕ(shi,ahi)(Λhn)1

Rearrange terms, we conclude the proof.


Lemma 8.4. Fix V:S[0,H]. For all n and s,aS×A, and h, with probability at least 1δ, we have:

i=0n1ϕ(shi,ahi)(Vϵhi)(Λhn)13HlnHdet(Λhn)1/2det(λI)1/2δ.

Proof: We first check the noise terms {Vϵhi}h,i. Since V is independent of the data (it's a pre-fixed function), and by linear property of expectation, we have:

E[VϵhiHhi]=0,|Vϵhi|Vϵhi12H,h,i

Hence, this is a Martingale difference sequence. Using the Self-Normalized vector-valued Martingale Bound (Lemma A.9), we have that for all n, with probability at least 1δ :

i=0n1ϕ(shi,ahi)(Vϵhi)(Λhn)13Hlndet(Λhn)1/2det(λI)1/2δ.

Apply union bound over all h[H], we get that with probability at least 1δ, for all n,h :

i=0n1ϕ(shi,ahi)(Vϵhi)(Λhn)13HlnHdet(Λhn)1/2det(λI)1/2δ

Uniform Convergence via Covering

The reason to use covering here is that the function class is infinite.

Lemma 8.5. The ϵ-covering number of the ball Θ={θRd:θ2RR+}is upper bounded by (1+2R/ϵ)d.


Proof: We already show this in Chapter 7. Refer to https://arxiv.org/pdf/1011.3027.pdf Lemma 5.2.


We can extend the above definition to a function class. Specifically, we look at the following function. For a triple of (w,β,Λ) where wRd and w2L,β[0,B], and Λ such that σmin(Λ)λ, we define fw,β,Λ:S[0,R] as follows:

fw,β,Λ(s)=min{maxa(wϕ(s,a)+βϕ(s,a)Λ1ϕ(s,a)),H},sS

We denote the function class F as:

F={fw,β,Λ:w2L,β[0,B],σmin(Λ)λ}

Note that F contains infinitely many functions as the parameters are continuous. However we will show that it has finite covering number that scales exponentially with respect to the number of parameters in (w,β,Λ).

Why do we look at F ? As we will see later in this chapter F contains all possible Q^h functions one could encounter during the learning process.

Lemma 8.6 ( ϵ-covering dimension of F). Consider F defined in Eq. 0.6. Denote its ϵ-cover as Nϵ with the norm as the distance metric, i.e., d(f1,f2)=f1f2 for any f1,f2F. We have that:

ln(|Nϵ|)dln(1+6L/ϵ)+ln(1+6B/(λϵ))+d2ln(1+18B2d/(λϵ2))

Note that the ϵ-covering dimension scales quadratically with respect to d.


Proof: We start from building a net over the parameter space (w,β,Λ), and then we convert the net over parameter space to an ϵ-net over F under the distance metric.

We pick two functions that corresponding to parameters (w,β,Λ) and (w^,β^,Λ^).

|f(s)f^(s)||maxa(wϕ(s,a)+βϕ(s,a)Λ1ϕ(s,a))maxa(w^ϕ(s,a)+β^ϕ(s,a)Λ1ϕ(s,a)^)|maxa|(wϕ(s,a)+βϕ(s,a)Λ1ϕ(s,a))(w^ϕ(s,a)+β^ϕ(s,a)Λ^1ϕ(s,a))|maxa|(ww^)ϕ(s,a)|+maxa|(ββ^)ϕ(s,a)Λ1ϕ(s,a)|+maxaβ^(ϕ(s,a)Λ1ϕ(s,a)ϕ(s,a)Λ^1ϕ(s,a))ww^2+|ββ^|/λ+B|ϕ(s,a)(Λ1Λ^1)ϕ(s,a)|ww^2+|ββ^|/λ+BΛ1Λ^1F

Remarks: Assume maxaf(a)maxag(a)0, then |maxaf(a)maxag(a)|maxaf(a)g(a0),a0. Thus we have |maxaf(a)maxag(a)|maxa|f(a)g(a)|.

Remarks: The forth line is due to the assumption that ϕ1.

Note that Λ1 is a PD matrix with σmax(Λ1)1/λ. Now we consider the ϵ/3-Net Nϵ/3,w over {w:w2L},λϵ/3-net Nλϵ/3,β over interval [0,B] for β, and ϵ2/(9B2)-net Nϵ2/(9B),Λ over {Λ:ΛFd/λ}.

Remarks: This is just the ϵ/3 method, covering net version.

The product of these three nets provide a ϵ-cover for F, which means that that size of the ϵ-net Nϵ for F is upper bounded as:

ln|Nϵ|ln|Nϵ/3,w|+ln|Nλϵ/3,β|+ln|Nϵ2/(9B2),Λ|dln(1+6L/ϵ)+ln(1+6B/(λϵ))+d2ln(1+18B2d/(λϵ2)).

Now we can build a uniform convergence argument for all fF. Yeyah :) ! Lemma 8.7 (Uniform Convergence Results). Set λ=1. Fix δ(0,1). For all n,h, all s,a, and all fF, with probability at least 1δ, we have:

|(P^hn(s,a)P(s,a))f|Hϕ(s,a)(Λhn)1(dln(1+6LN)+dln(1+18B2dN)+lnHδ)

Proof: Recall Lemma 8.4 , we have with probability at least 1δ, for all n,h, for a pre-fixed V (independent of the random process):

i=1n1ϕ(shi,ahi)(Vϵhi)(Λhn)129H2lnHdet(Λhn)1/2det(λI)1/2δ9H2(lnHδ+dln(1+N))

where we have used the fact that ϕ21,λ=1, and Λhn2N+1.

Remarks: N is the total number of episodes. Note that det(Λhn)1/2det(λI)1/2(n+λ)d/2λd/2(N/λ+1)d/2=(N+1)d/2.

Denote the ϵ-cover of F as Nϵ. With an application of a union bound over all functions in Nϵ, we have that with probability at least 1δ, for all VNϵ, all n,h, we have:

i=1n1ϕ(shi,ahi)(Vϵhi)(Λhn)129H2(lnHδ+ln(|Nϵ|)+dln(1+N))

Recall Lemma 8.6, substitute the expression of ln|Nϵ| into the above inequality, we get:

i=1n1ϕ(shi,ahi)(Vϵhi)(Λhn)129H2(lnHδ+dln(1+6L/ϵ)+d2ln(1+18B2d/ϵ2)+dln(1+N)).

Now consider an arbitrary fF. By the definition of ϵ-cover, we know that for f, there exists a VNϵ, such that fVϵ. Thus, we have:

i=1n1ϕ(shi,ahi)(fϵhi)(Λhn)122i=1n1ϕ(shi,ahi)(Vϵhi)(Λhn)12+2i=1n1ϕ(shi,ahi)((Vf)ϵhi)(Λhn)122i=1n1ϕ(shi,ahi)(Vϵhi)(Λhn)12+8ϵ2N9H2(lnHδ+dln(1+6L/ϵ)+d2ln(1+18B2d/ϵ2)+dln(1+N))+8ϵ2N,

where in the second inequality we use the fact that i=1n1ϕ(shi,ahi)(Vf)ϵhi(Λhn)124ϵ2N, which is from

i=1n1ϕ(shi,ahi)(Vf)ϵhi(Λhn)12i=1n1ϕ(shi,ahi)2ϵ(Λhn)124ϵ2λi=1n1ϕ(shi,ahi)24ϵ2N

Remarks: Remind that ϵhi12, which gives (Vf)ϵhiVfϵhi12ϵ.

Set ϵ=1/N, we get:

i=1n1ϕ(shi,ahi)(fϵhi)(Λhn)129H2(lnHδ+dln(1+6LN)+d2ln(1+18B2dN)+dln(1+N))+8H2(lnHδ+dln(1+6LN)+d2ln(1+18B2dN))

where we recall ignores absolute constants. Now recall that we can express (P^hn(s,a)P(s,a))f=ϕ(s,a)(μ^hnμh)f. Recall Lemma 8.3, we have:

|(μ^hnϕ(s,a)μhϕ(s,a))f||λϕ(s,a)(Λhn)1(μh)f|+|i=1n1ϕ(s,a)(Λhn)1ϕ(shi,ahi)(ϵhi)f|Hdϕ(s,a)(Λhn)1+ϕ(s,a)(Λhn)1(H2(lnHδ+dln(1+6LN)+d2ln(1+18B2dN)))Hϕ(s,a)(Λhn)1(lnHδ+dln(1+6LN)+dln(1+18B2dN)).

Remarks: For the second inequality, we have |λϕ(s,a)(Λhn)1(μh)f|=|ϕ(s,a)(Λhn)1/2(Λhn)1/2(μh)f|ϕ(s,a)(Λhn)1/22(Λhn)1/2(μh)f2=ϕ(s,a)(Λhn)1(μh)f(Λhn)1 . Note that since fF, thus fH, and assumption 2 yields (μh)f(Λhn)1(Λhn)1/22(μh)f2Hd/λ. (Note that the smallest eigenvalue of Λhnλ.) Since λ=1, we obtain the first part of the second inequality. The second part is similar.


Remarks: Note that this is O~(Hd)ϕ(s,a)(Λhn)1, which will be our bonus term.

Algorithm

Our algorithm, Upper Confidence Bound Value Iteration (UCB-VI) will use reward bonus to ensure optimism. Specifically, we will the following reward bonus, which is motivated from the reward bonus used in linear bandit:

bhn(s,a)=βϕ(s,a)(Λhn)1ϕ(s,a),

where β contains poly of H and d, and other constants and log terms (O~(Hd)). Again to gain intuition, please think about what this bonus would look like when we specialize linear MDP to tabular MDP.

Algorithm 6 UCBVI for Linear MDPs

  1. Input: parameters β,λ

  2. for n=1N do

  3. Compute P^hn for all h (Eq. 0.3)

  4. Compute reward bonus bhn for all h (Eq. 0.7)

  5. Run Value-Iteration on {P^hn,rh+bhn}h=0H1 (Eq. 0.8)

  6. Set πn as the returned policy of VI.

  7. end for

With the above setup, now we describe the algorithm. Every episode n, we learn the model μ^hn via ridge linear regression. We then form the quadratic reward bonus as shown in Eq. 0.7. With that, we can perform the following truncated Value Iteration (always truncate the Q function at H ):

V^Hn(s)=0,s,Q^hn(s,a)=θϕ(s,a)+βϕ(s,a)(Λhn)1ϕ(s,a)+ϕ(s,a)(μ^hn)V^h+1n=βϕ(s,a)(Λhn)1ϕ(s,a)+(θ+(μ^hn)V^h+1n)ϕ(s,a),V^hn(s)=min{maxaQ^hn(s,a),H},πhn(s)=argmaxaQ^hn(s,a).

Note that above Q^hn contains two components: a quadratic component and a linear component (w.r.t. ϕ(s,a)). And V^hn has the format of fw,β,Λ defined in Eq. 0.5.

Now we show that V^hn falls into the category of F:

Lemma 8.8. Assume β[0,B]. For all n,h, we have V^hn is in the form of Eq. 0.5, and V^hn falls into the following class:

V={fw,β,Λ:w2W+HNλ,β[0,B],σmin(Λ)λ}

Proof: We just need to show that θ+(μ^hn)V^h+1n has its 2 norm bounded. This is easy to show as we always have V^h+1nH as we do truncation at Value Iteration:

θ+(μ^hn)V^h+1n2W+(μ^hn)V^h+1n2

Remarks: This is from assumption 3.

Now we use the closed-form of μ^hn from Eq. 0.3:

(μ^hn)V^h+1n2=i=1n1V^h+1n(sh+1i)ϕ(shi,ahi)(Λhn)12H(Λhn)1i=0n1ϕ(shi,ahi)2Hnλ,

where we use the fact that V^h+1nH,σmax(Λ1)1/λ, and sups,aϕ(s,a)21


Analysis of UCBVI for Linear MDPs

We finally arrive at regret bound!!

In this section, we prove the following regret bound for UCBVI. Theorem 8.9 (Regret Bound). Set β=O~(Hd),λ=1. UCBVI (Algorithm 6) achieves the following regret bound:

E[NVi=0NVπn]O~(H2d3N)

The main steps of the proof are similar to the main steps of UCBVI in tabular MDPs. We first prove optimism via induction, and then we use optimism to upper bound per-episode regret. Finally, we use simulation lemma to decompose the per-episode regret. In this section, to make notation simple, we set λ=1 directly.

Proving Optimism

Here we use a different β, as the function class is slightly different from the one we use in Lemma 8.7. Check P93 of the book. We denote the event of Lemma 8.7 with Emodel.

Lemma 8.10 (Optimism). Assume event Emodel  is true. for all n and h,

V^hn(s)Vh(s),s

Proof: We consider a fixed episode n. We prove via induction. Assume that V^h+1n(s)Vh+1(s) for all s. For time step h, we have:

Q^hn(s,a)Qh(s,a)=θϕ(s,a)+βϕ(s,a)(Λhn)1ϕ(s,a)+ϕ(s,a)(μ^hn)V^h+1nθϕ(s,a)ϕ(s,a)(μh)Vh+1βϕ(s,a)(Λhn)1ϕ(s,a)+ϕ(s,a)(μ^hnμh)V^h+1n,

where in the last inequality we use the inductive hypothesis that V^h+1n(s)Vh+1(s), and μhϕ(s,a) is a valid distribution (note that μ^hnϕ(s,a) is not necessarily a valid distribution). We need to show that the bonus is big enough to offset the model error ϕ(s,a)(μ^hnμh)V^h+1n. Since we have event Emodel  being true, we have that:

|(P^hn(s,a)P(s,a))V^h+1n|βϕ(s,a)(Λhn)1,

as by the construction of V, we know that V^h+1nV. This concludes the proof.


Regret Decomposition

Now we can upper bound the per-episode regret as follows:

VVπnV^0n(s0)V0πn(s0)

We can further bound the RHS of the above inequality using simulation lemma. Recall Eq. 0.4 that we derived in the note for tabular MDP (Chapter 7):

V^0n(s0)V0πn(s0)h=0H1Es,adhπn[bhn(s,a)+(P^hn(s,a)P(s,a))V^h+1n].

(recall that the simulation lemma holds for any MDPs!

In the event Emodel , we already know that for any s,a,h,n, we have (P^hn(s,a)P(s,a))V^h+1nβϕ(s,a)(Λhn)1= bhn(s,a). Hence, under Emodel , we have:

V^0n(s0)V0πn(s0)h=0H1Es,adhπn[2bhn(s,a)]h=0H1Es,adhπn[bhn(s,a)]

Sum over all episodes, we have the following statement.

Lemma 8.11 (Regret Bound). Assume the event Emodel  holds. We have:

n=0N1(V0(s0)V0πn(s0))n=0N1h=0H1Eshn,ahndhπn[bhn(shn,ahn)]

Conclusion

Lemma 8.12 (Elliptical Potential). Consider an arbitrary sequence of state action pairs shi,ahi. Assume sups,aϕ(s,a)2 1. Denote Λhn=I+i=0n1ϕ(shi,ahi)ϕ(shi,ahi). We have:

i=0N1ϕ(shi,ahi)(Λhi)1ϕ(shi,ahi)2ln(det(ΛhN)det(I))2dln(N).

Proof: By the Lemma 3.7 and 3.8 in the linear bandit lecture note,

i=0N1ϕ(shi,ahi)(Λhi)1ϕ(shi,ahi)2i=0N1ln(1+ϕ(shi,ahi)(Λhi)1ϕ(shi,ahi))2ln(det(ΛhN)det(I))2dln(N)

where the first inequality uses that for 0y1,ln(1+y)y/2.

Remarks: We denote ϕ(shi,ahi) as ϕi for simplicity. Here we use:

det(Λhn)= det(Λhn1+ϕn1ϕn1)= det((Λhn1)12(I+(Λhn1)12ϕn1ϕn1(Λhn1)12)(Λhn1)12)= det(Λhn1)det(I+(Λhn1)12ϕn1((Λhn1)12ϕn1))= det(Λhn1)(1+ϕn1(Λhn1)1ϕn1),

where we use det(Ip+AB)=det(Iq+BA) in the last equality. Therefore, we have

i=0N1(1+ϕ(shi,ahi)(Λhi)1ϕ(shi,ahi))=i=0N1det(Λhi+1)det(Λhi)=det(ΛhN)det(I),

which is the second inequality.

Now we use Lemma 8.11 together with the above inequality to conclude the proof.


Now we give the proof of the regret bound.


Proof: [Proof of main Theorem 8.9] We split the expected regret based on the event Emodel .

E[NVn=1NVπn]=[E{Emodel  holds }(NVn=1NVπn)]+E[1{Emodel  doesn’t hold }(NVn=1NVπn)]E[1{Emodel  holds }(NVn=1NVπn)]+δNHE[n=1Nh=0H1bhn(shn,ahn)]+δNH.

Note that:

n=1Nh=0H1bhn(shn,ahn)=βn=1Nh=0H1ϕ(shn,ahn)(Λhn)1ϕ(shn,ahn)=βh=0H1n=1Nϕ(shn,ahn)(Λhn)1ϕ(shn,ahn)βh=0H1Nn=1Nϕ(shn,ahn)(Λhn)1ϕ(shn,ahn)βHNdln(N)

Recall that β=O~(Hd). This concludes the proof.


Remarks: Now we derive a great bound that is O~(H2ddN), with no dependence on |S| or |A|!