transition function (conditional distribution for state), where is the probability of transitioning into given and . Note that denotes .
bounded in . Deterministic unless specified.
bounded in , which defines a horizon.
Remarks: 1. This is why it is called discounted MDP. 2. I assume it works like how eigenvalue defines the smoothness for an RKHS?
is an initial state distribution.
objective, policies, and values
trajectory
Remarks: 1. It goes like state->action->reward->state->..., and each state is a sample. 2. Sort of like a filtration for a stochastic process
policy is a mapping: . A stationary policy mapping . A deterministic, stationary policy mapping .
Remarks: 1. ? sort of weird
value is defined as:
which is bounded by .
Similarly, action-value (Q-value) is defined as
which is also bounded by .
Goal: find optimal policy given starting state :
Bellman Consistency Equations for Stationary Policies
Lemma 1.4. Suppose that is a stationary policy. Then and satisfy the following Bellman consistency equations: for all
Proof:
It is helpful to view as vector of length and and as vectors of length . We overload notation and let also refer to a matrix of size where the entry is equal to .
Remarks: This is straightforward since a function is in some sense an infinite dimensional vector where each point is a coordinate.
We also will define to be the transition matrix on state-action pairs induced by a stationary policy , specifically
Remarks:
In particular, for deterministic policies, we have:
With this notation, it is straightforward to verify
Proof: Notice that
Thus, we have
Also, note that
Corollary 1.5. Suppose that is a stationary policy. We have that.
where is the identity matrix.
Proof: we need to show that is invertible, which is to show that for all non-zero vector , .
To see that the is invertible, observe that for any non-zero vector
which implies is full rank.
Lemma 1.6. We have that
so we can view the -th row of this matrix as an induced distribution over states and actions when following after starting with and
Proof: we need to show , where we use to denote the transition matrix where .
Thus, for the element at , we need to show
For the LHS, we have
where in the second equation we use the Chapman-Kolmogorov equation. Thus the proof is completed by that .
Bellman Optimality Equations
There exists a stationary and deterministic policy that simultaneously maximizes for all .
Theorem 1.7. Let be the set of all non-stationary and randomized policies. Define
which is finite since and are bounded between 0 and .
There exists a stationary and deterministic policy such that for all and
We refer to such a as an optimal policy.
Proof at P9 of the book.
Remarks: The optimal policy used in the proof is .
Theorem 1.8 (Bellman optimality equations). We say that a vector satisfies the Bellman optimality equations if:
For any we have that if and only if satisfies the Bellman optimality equations. Furthermore, the deterministic policy defined by is an optimal policy (where ties are broken in some arbitrary manner).
Proof at P10 of the book.
Notice that , when is the optimal stationary and deterministic policy.
Remarks: Here more notations are introduced:
greedy policy: with respect to a vector. (Note that this is different from , the latter one is the q-value vector corresponding to a policy .) And the optimal policy given by the above theorem can be denoted as .
The Bellman optimality operator is defined as: , where for .
This allows us to rewrite the Bellman optimality equation in the concise form:
and, so, the previous theorem states that if and only if is a fixed point of the operator .
Remarks: The equivalent form in V can be written as:
where V is a vector (It is straightforward to verify since ), and we the equation is given by .
The two theorems basically proves that
finding an optimal policy = finding an optimal deterministic and stationary policy
an optimal deterministic and stationary policy is a greedy policy
Finite-Horizon MDP
Changes:
is the probability of transitioning into state upon taking action in state at time step . Note that the time-dependent setting generalizes the stationary setting where all steps share the same transition
A time-dependent reward function is the immediate reward associated with taking action in state at time step
A integer which defines the horizon of the problem
and
Theorem 1.9.(Bellman optimality equations) Define
where the sup is over all non-stationary and randomized policies. Suppose that We have that for all if and only if for all ,
Furthermore, is an optimal policy.
Proof: The flavor of this proof is similar to the infinite-horizon case.
First note that theorem 1.7 also applies to the finite horizon, though the stationary and deterministic now depends on : . Proof is straightforward.
Likewise, we define for all , thus we have
So we only need to show that
which follows the exact same proof as the infinite-horizon MDP with the use of an optimal deterministic policy .
The reverse is highly similar as well.
Overall, the finite MDP is O(H) larger than the infinite one.
Computational Complexity
Suppose that in our MDP is specified with rational entries. Let denote the total bit-size required to specify , and assume that basic arithmetic operations take unit time. Here, we may hope for an algorithm which (exactly) returns an optimal policy whose runtime is polynomial in and the number of states and actions.More generally, it may also be helpful to understand which algorithms are strongly polynomial. Here, we do not want to explicitly restrict to be specified by rationals. An algorithm is said to be strongly polynomial if it returns an optimal policy with runtime that is polynomial in only the number of states and actions (with no dependence on .
Remarks: For computational time, there are some ''standards'' in optimization that are not mentioned here, such as what exactly is and its relation to and more... refer to slides for a detailed introduction.
Value Iteration
Q-value iteration: starting at some , we iteratively apply .
Lemma 1.10. (contraction) For any two vectors
Proof at P13 of the book.
Lemma 1.11. (Q-Error Amplification)) For any vector
𝟙
where 𝟙 denotes the vector of all ones.
Proof at P14 of the book.
Theorem 1.12.(Q-value iteration convergence). Set For suppose:
Let For
𝟙
It is actually a corollary of lemma 1.10 and 1.11 and that .
Remarks: Setting as , and notice that gives a complexity of , we obtain the result in the table.
Policy Iteration
The policy iteration algorithm, for discounted MDPs, starts from an arbitrary policy , and repeats the following iterative procedure: for
Policy evaluation. Compute
Policy improvement. Update the policy:
In each iteration, we compute the Q-value function of , using the analytical form given in Equation 0.2, and update the policy to be greedy with respect to this new Q-value. The first step is often called policy evaluation, and the second step is often called policy improvement.
Remarks: 0.2 is that
Lemma 1.13. We have that.
Proof at P15 of the book.
Remarks: , when deterministic, we have . since is greedy policy with regards to , thus we have . Recursion means that
Theorem 1.14.(Policy iteration convergence). Let be any initial policy. For the -th policy in policy iteration has the following performance bound.
𝟙
Proof: Notice that
Remarks: Note that PI and VI have the same convergence rate for . However, every iteration of PI is much more expensive than VI, as PI costs (Since if we use a greedy policy, the matrix can be viewed as a instead of since . Thus the inverse contributes and the multiplication with contributes if we again view as a matrix.
PI can be strongly poly due to some analysis. (The intuition is that police is finite () and PI is monotonic)
Value Iteration for Finite Horizon MDPs
Let us now specify the value iteration algorithm for finite-horizon MDPs. For the finize-horizon setting, it turns out that the analogues of value iteration and policy iteration lead to identical algorithms. The value iteration algorithm is specified as follows:
Set
For , set:
By Theorem 1.9, it follows that and that is an optimal policy.
Remarks: 1 is just a notation for simplicity, and there is a typo in 2 in the book.
The Linear Programming Approach
VI and PI are not fully polynomial time algos due to dependence on . But with linear programming, we can have fully poly algo!
By the Bellman optimality equation (with the V form), we have
which gives
Therefore the LP is
or equivalently
Conceptually, LP provides a cool poly time algo.
Comments from the slides:
VI is best thought of as a fixed point algorithm
Pl is equivalent to a (block) simplex algorithm
Dual LP
For a fixed (possibly stochastic) policy , let us define a visitation measure over states and actions induced by following after starting at . Precisely, define this distribution, as follows:
where is the probability that and after starting at state and following thereafter. It is straightforward to verify that is a distribution over We also overload notation and write:
for a distribution over . Recall Lemma 1.6 provides a way to easily compute through an appropriate vector-matrix multiplication.
Remarks: Lemma 1.6 gives
Thus, actually we have
and
It is straightforward to verify that satisfies, for all states
Remarks: Simply we have
Let us define the state-action polytope as follows:
We now see that this set precisely characterizes all state-action visitation distributions.
Proposition 1.15. We have that is equal to the set of all feasible state-action distributions, i.e. , if and only if there exists a stationary (and possibly randomized) policy such that .
With respect the variables , the dual LP formulation is as follows
If is the solution to this LP, and provided that has full support, then we have that:
is an optimal policy. An alternative optimal policy is (and these policies are identical if the optimal policy is unique).
Sample Complexity and Sampling Models
we are interested understanding the number of samples required to find a near optimal policy, i.e. the sample complexity.
A generative model provides us with a sample and the reward upon input of a state action pair .
The offline RL setting. The offline RL setting is where the agent has access to an offline dataset, say generated under some policy (or a collection of policies). In the simplest of these settings, we may assume our dataset is of the form where is the reward (corresponding to if the reward is deterministic) and Furthermore, for simplicity it can be helpful to assume that the pairs in this dataset were sampled i.i.d. from some fixed distribution over .
Bonus: Advantages and The Performance Difference Lemma
The advantage of a policy is defined as
Note that:
for all state-action pairs.
Analogous to the state-action visitation distribution (see Equation 0.8), we can define a visitation measure over jus the states. When clear from context, we will overload notation and also denote this distribution by where:
Here, is the state visitation probability, under r starting at state so. Again, we write:
for a distribution over .
Remarks: In Equation 0.8., we define , so the only difference is in , such that .
Lemma 1.16.(The performance difference lemma) For all policies and distributions over ,