Overview

This is my note for Chapter 1 of Reinforcement Learning Theory.

Markov Decision Process

Discounted (Infinite-Horizon) MDP

M=(S,A,P,r,γ,μ)

objective, policies, and values

Bellman Consistency Equations for Stationary Policies

Lemma 1.4. Suppose that π is a stationary policy. Then Vπ and Qπ satisfy the following Bellman consistency equations: for all sS,aA,

Vπ(s)=Qπ(s,π(s)).Qπ(s,a)=r(s,a)+γEaπ(|s),sP(|s,a)[Vπ(s)].

Proof:

Vπ(s)=Eaπ(|s)[Qπ(s,a)]=Qπ(s,π(s))
Qπ(s,a)= E[t=0γtr(st,at)|π,s0=s,a0=a]= r(s,a)+γE[t=1γt1r(st,at)|π,s0=s,a0=a]= r(s,a)+γEaπ(|s),sP(|s,a)[Vπ(s)].

It is helpful to view Vπ as vector of length |S| and Qπ and r as vectors of length |S|·|A|. We overload notation and let P also refer to a matrix of size (|S||A|)×|S| where the entry P(s,a),s is equal to P(s|s,a).

Remarks: This is straightforward since a function is in some sense an infinite dimensional vector where each point is a coordinate.

We also will define Pπ to be the transition matrix on state-action pairs induced by a stationary policy π, specifically

P(s,a),(s,a)π:=P(s|s,a)π(a|s).

Remarks: PπR(|S||A|)×(|S||A|)

In particular, for deterministic policies, we have:

P(s,a),(s,a)π:={P(s|s,a)if a=π(s)0if aπ(s)

With this notation, it is straightforward to verify

Qπ=r+γPVπQπ=r+γPπQπ.

Proof: Notice that

[PVπ](s,a)=sSP(s|s,a)Vπ(s)=EsP(|s,a)[Vπ(s)].

Thus, we have

[Qπ](s,a)=r(s,a)+γEsP(|s,a)[Vπ(s)].

Also, note that

 [PπQπ](s,a)= s,aP(s|s,a)π(a|s)Qπ(s,a)= sP(s|s,a)Eaπ(|s)Qπ(s,a)= sP(s|s,a)Vπ(s).

Corollary 1.5. Suppose that π is a stationary policy. We have that.

Qπ=(IγPπ)1r

where I is the identity matrix.


Proof: we need to show that IγPπ is invertible, which is to show that for all non-zero vector x, (IγPπ)x0.

To see that the IγPπ is invertible, observe that for any non-zero vector xR|S||A|

(IγPπ)x=xγPπxxγPπxxγx(each element of Pπx is an average of x)=(1γ)x>0

which implies IγPπ is full rank.


Lemma 1.6. We have that

[(1γ)(IγPπ)1](s,a),(s,a)=(1γ)t=0γtPπ(st=s,at=a|s0=s,a0=a)

so we can view the (s,a)-th row of this matrix as an induced distribution over states and actions when following π after starting with s0=s and a0=a


Proof: we need to show t=0γtPtπ(IγPπ)=I, where we use Ptπ to denote the transition matrix where [Ptπ](s,a),(s,a)=Pπ(st=s,at=a|s0=s,a0=a).

Thus, for the element at (s,a),(s,a), we need to show

s,at=0γtPπ(st=s,at=a|s0=s,a0=a)(I{(s,a)=(s,a)}γP(s|s,a)π(a|s))=I{(s,a)=(s,a)}

For the LHS, we have

LHS= t=0γtPπ(st=s,at=a|s0=s,a0=a)t=0γt+1s,aPπ(st=s,at=a|s0=s,a0=a)P(s|s,a)π(a|s))= t=0γtPπ(st=s,at=a|s0=s,a0=a)t=0γt+1Pπ(st+1=s,at+1=a|s0=s,a0=a)= Pπ(s0=s,a0=a|s0=s,a0=a)

where in the second equation we use the Chapman-Kolmogorov equation. Thus the proof is completed by that Pπ(s0=s,a0=a|s0=s,a0=a)=I{(s,a)=(s,a)}.

Bellman Optimality Equations

There exists a stationary and deterministic policy that simultaneously maximizes Vπ(s) for all s.

Theorem 1.7. Let Π be the set of all non-stationary and randomized policies. Define

V(s):=supπΠVπ(s)Q(s,a):=supπΠQπ(s,a).

which is finite since Vπ(s) and Qπ(s,a) are bounded between 0 and 1/(1γ).

There exists a stationary and deterministic policy π such that for all sS and aA,

Vπ(s)=V(s)Qπ(s,a)=Q(s,a).

We refer to such a π as an optimal policy.


Proof at P9 of the book.


Remarks: The optimal policy used in the proof is π~(s)=argsupaAE[r(s,a)+γV(s1)|(S0,A0)=(s,a)].

Theorem 1.8 (Bellman optimality equations). We say that a vector QR|S||A| satisfies the Bellman optimality equations if:

Q(s,a)=r(s,a)+γEsP(|s,a)[maxaAQ(s,a)].

For any QR|S||A|, we have that Q=Q if and only if Q satisfies the Bellman optimality equations. Furthermore, the deterministic policy defined by π(s)argmaxaAQ(s,a) is an optimal policy (where ties are broken in some arbitrary manner).


Proof at P10 of the book.

Notice that V(s)=Vπ(s)=Qπ(s,π(s))Q(s,a)(a), when π is the optimal stationary and deterministic policy.


Remarks: Here more notations are introduced:

This allows us to rewrite the Bellman optimality equation in the concise form:

Q=TQ,

and, so, the previous theorem states that Q=Q if and only if Q is a fixed point of the operator T.

Remarks: The equivalent form in V can be written as:

V(s)=maxa{r(s,a)+γEsP(|s,a)[V(s)]}

where V is a vector (It is straightforward to verify since V(s)=maxaQ(s,a)), and we the equation is given by V=TV.

The two theorems basically proves that

Finite-Horizon MDP

M=(S,A,{P}h,{r}h,H,μ)

Changes:

Theorem 1.9.(Bellman optimality equations) Define

Qh(s,a)=supπΠQhπ(s,a)

where the sup is over all non-stationary and randomized policies. Suppose that QH=0. We have that Qh=Qh for all h[H] if and only if for all h[H],

Qh(s,a)=rh(s,a)+EsPh(|s,a)[maxaAQh+1(s,a)].

Furthermore, π(s,h)=argmaxaAQh(s,a) is an optimal policy.


Proof: The flavor of this proof is similar to the infinite-horizon case.

First note that theorem 1.7 also applies to the finite horizon, though the stationary and deterministic π now depends on h: π(,h). Proof is straightforward.

Qh(s,a)=supπΠQhπ(s,a)= supπΠrh(s,a)+E[t=h+1H1rh(st,at)|π,sh=s,ah=a]= supπΠrh(s,a)+E[EsPh(|s,a)[t=h+1H1rh(st,at)|π,sh+1=s]]= rh(s,a)+supπΠEsPh(|s,a)[Vh+1π(s)].

Likewise, we define Vh+1(s)=supπΠVh+1π(s) for all s, thus we have

Qh(s,a)=rh(s,a)+EsPh(|s,a)[Vh+1(s)].

So we only need to show that

Vh(s)=maxaQh(s,a),

which follows the exact same proof as the infinite-horizon MDP with the use of an optimal deterministic policy π(,h).

The reverse is highly similar as well.


Overall, the finite MDP is O(H) larger than the infinite one.

Computational Complexity

Suppose that (P,r,γ) in our MDP M is specified with rational entries. Let L(P,r,γ) denote the total bit-size required to specify M, and assume that basic arithmetic operations +,,×,÷ take unit time. Here, we may hope for an algorithm which (exactly) returns an optimal policy whose runtime is polynomial in L(P,r,γ) and the number of states and actions. More generally, it may also be helpful to understand which algorithms are strongly polynomial. Here, we do not want to explicitly restrict (P,r,γ) to be specified by rationals. An algorithm is said to be strongly polynomial if it returns an optimal policy with runtime that is polynomial in only the number of states and actions (with no dependence on L(P,r,γ)).

Remarks: For computational time, there are some ''standards'' in optimization that are not mentioned here, such as what exactly is L(P,r,γ) and its relation to ϵ and more... refer to slides for a detailed introduction.

Value Iteration

Q-value iteration: starting at some Q, we iteratively apply T.

Lemma 1.10. (contraction) For any two vectors Q,QR|S||A|,

TQTQγQQ

Proof at P13 of the book.


Lemma 1.11. (Q-Error Amplification)) For any vector QR|S||A|,

VπQV2QQ1γ1,

where 1 denotes the vector of all ones.


Proof at P14 of the book.


Theorem 1.12.(Q-value iteration convergence). Set Q(0)=0. For k=0,1,, suppose:

Q(k+1)=TQ(k)

Let π(k)=πQ(k). For klog2(1γ)2ϵ1γ,

Vπ(k)Vϵ1.

It is actually a corollary of lemma 1.10 and 1.11 and that (1x)kexp(xk).


Remarks: Setting ϵ as 2L(P,r,γ), and notice that TQ gives a complexity of O(|S|2|A|), we obtain the result in the table.

Policy Iteration

The policy iteration algorithm, for discounted MDPs, starts from an arbitrary policy π0, and repeats the following iterative procedure: for k=0,1,2,..

  1. Policy evaluation. Compute Qπk

  2. Policy improvement. Update the policy:

πk+1=πQπk

In each iteration, we compute the Q-value function of πk, using the analytical form given in Equation 0.2, and update the policy to be greedy with respect to this new Q-value. The first step is often called policy evaluation, and the second step is often called policy improvement.

Remarks: 0.2 is that Qπ=(I+γPπ)1r

Lemma 1.13. We have that.

Qπk+1TQπkQπk
Qπk+1QγQπkQ

Proof at P15 of the book.

Remarks: [PπQπ](s,a)=s,aP(s|s,a)π(a|s)Qπ(s,a), when deterministic, we have [PπQπ](s,a)=sP(s|s,a)Qπ(s,π(s)). since πk+1 is greedy policy with regards to Qπk, thus we have Pπk+1QkπPπkQkπ. Recursion means that Qπk=r+γPπkQπkr+γPπk+1Qπkr+γPπk+1(r+γPπk+1Qπk)t=0γt(Pπk+1)tr


Theorem 1.14.(Policy iteration convergence). Let π0 be any initial policy. For klog1(1γ)ϵ1γ, the k-th policy in policy iteration has the following performance bound.

QπkQϵ1.

Proof: Notice that

QπkQγkQπ0Q(1(1γ))kQexp((1γ)k)1γ

Remarks: Note that PI and VI have the same convergence rate for Q. However, every iteration of PI is much more expensive than VI, as PI costs O(|S|3+|S|2|A|) (Since if we use a greedy policy, the Pπk matrix can be viewed as a |S|×|S| instead of |S||A|×|S||A| since Pπk(s,a|s0,a0)=Pπk(s,πk(s0)|s0). Thus the inverse contributes |S|3 and the multiplication with r contributes |S|2|A| if we again view r as a |S|×|A| matrix.

PI can be strongly poly due to some analysis. (The intuition is that police is finite (|A||S|) and PI is monotonic)

Value Iteration for Finite Horizon MDPs

Let us now specify the value iteration algorithm for finite-horizon MDPs. For the finize-horizon setting, it turns out that the analogues of value iteration and policy iteration lead to identical algorithms. The value iteration algorithm is specified as follows:

  1. Set QH1(s,a)=rH1(s,a)

  2. For h=H2,0 , set:

Qh(s,a)=rh(s,a)+EsPh(|s,a)[maxaAQh+1(s,a)].

By Theorem 1.9, it follows that Qh(s,a)=Qh(s,a) and that π(s,h)=argmaxaAQh(s,a) is an optimal policy.

Remarks: 1 is just a notation for simplicity, and there is a typo in 2 in the book.

The Linear Programming Approach

VI and PI are not fully polynomial time algos due to dependence on γ. But with linear programming, we can have fully poly algo!

By the Bellman optimality equation (with the V form), we have

V(s)=maxa{r(s,a)+γEsP(|s,a)[V(s)]},

which gives

V(s)r(s,a)+γEsP(|s,a)[V(s)]a,s.

Therefore the LP is

minsμ(s)V(s)subject toV(s)r(s,a)+γsP(s|s,a)V(s)aA,sS

or equivalently

minEsμ()[V(s)]subject toV(s)r(s,a)+γEsP(|s,a)[V(s)]aA,sS

Conceptually, LP provides a cool poly time algo.

Comments from the slides:

Dual LP

For a fixed (possibly stochastic) policy π, let us define a visitation measure over states and actions induced by following π after starting at s0. Precisely, define this distribution, ds0π, as follows:

ds0π(s,a):=(1γ)t=0γtPrπ(st=s,at=a|s0)

where Prπ(st=s,at=a|s0) is the probability that st=s and at=a, after starting at state s0 and following π thereafter. It is straightforward to verify that ds0π is a distribution over S×A. We also overload notation and write:

dμπ(s,a)=Es0μ[ds0π(s,a)]

for a distribution μ over S. Recall Lemma 1.6 provides a way to easily compute dμπ(s,a) through an appropriate vector-matrix multiplication.

Remarks: Lemma 1.6 gives

[(1γ)(IγPπ)1](s,a),(s,a)=(1γ)t=0γtPπ(st=s,at=a|s0=s,a0=a)

Thus, actually we have

ds0π(s,a)=[(1γ)(IγPπ)1](s0,π(s0)),(s,a)

and

dμπ(s,a)=s0μ(s0)[(1γ)(IγPπ)1](s0,π(s0)),(s,a).

It is straightforward to verify that dμπ satisfies, for all states sS

adμπ(s,a)=(1γ)μ(s)+γs,aP(s|s,a)dμπ(s,a).

Remarks: Simply we have

53d5636f2055b5b0081be0efc07194d

Let us define the state-action polytope as follows:

Kμ:={d|d0 and ad(s,a)=(1γ)μ(s)+γs,aP(s|s,a)d(s,a)}

We now see that this set precisely characterizes all state-action visitation distributions.

Proposition 1.15. We have that Kμ is equal to the set of all feasible state-action distributions, i.e. dKμ, if and only if there exists a stationary (and possibly randomized) policy π such that dμπ=d.

With respect the variables dR|S||A| , the dual LP formulation is as follows

max11γs,adμ(s,a)r(s,a)subject todKμ

If d is the solution to this LP, and provided that μ has full support, then we have that:

π(a|s)=d(s,a)ad(s,a)

is an optimal policy. An alternative optimal policy is argmaxd(s,a) (and these policies are identical if the optimal policy is unique).

Sample Complexity and Sampling Models

we are interested understanding the number of samples required to find a near optimal policy, i.e. the sample complexity.

A generative model provides us with a sample sP(·|s,a) and the reward r(s,a) upon input of a state action pair (s,a).

The offline RL setting. The offline RL setting is where the agent has access to an offline dataset, say generated under some policy (or a collection of policies). In the simplest of these settings, we may assume our dataset is of the form {(s,a,s,r)} where r is the reward (corresponding to r(s,a) if the reward is deterministic) and sP(|s,a). Furthermore, for simplicity it can be helpful to assume that the (s,a) pairs in this dataset were sampled i.i.d. from some fixed distribution ν over S×A.

Bonus: Advantages and The Performance Difference Lemma

Vπ(μ)=Esμ[Vπ(s)]

The advantage Aπ(s,a) of a policy π is defined as

Aπ(s,a):=Qπ(s,a)Vπ(s).

Note that:

A(s,a):=Aπ(s,a)0

for all state-action pairs.

Analogous to the state-action visitation distribution (see Equation 0.8), we can define a visitation measure over jus the states. When clear from context, we will overload notation and also denote this distribution by ds0π, where:

ds0π(s)=(1γ)t=0γtPrπ(st=s|s0).

Here, Prπ(st=s|s0)is the state visitation probability, under r starting at state so. Again, we write:

dμπ(s)=Es0μ[ds0π(s)].

for a distribution μ over S.

Remarks: In Equation 0.8., we define ds0π(s,a):=(1γ)t=0γtPrπ(st=s,at=a|s0), so the only difference is in a, such that ds0π(s)=ads0π(s,a).

Lemma 1.16.(The performance difference lemma) For all policies π,π and distributions μ over S,

Vπ(μ)Vπ(μ)=11γEsdμπEaπ(|s)[Aπ(s,a)].

Proof at P19 of the book