OverviewMarkov Decision ProcessDiscounted (Infinite-Horizon) MDPobjective, policies, and valuesBellman Consistency Equations for Stationary PoliciesBellman Optimality EquationsFinite-Horizon MDPComputational ComplexityValue IterationPolicy IterationValue Iteration for Finite Horizon MDPsThe Linear Programming ApproachDual LPSample Complexity and Sampling ModelsBonus: Advantages and The Performance Difference Lemma

Overview

This is my note for Chapter 1 of Reinforcement Learning Theory.

Markov Decision Process

Discounted (Infinite-Horizon) MDP

M = (S, A, P, r, γ, μ)

$\mathcal{S}$ finite or countably infinite
$\mathcal{A}$ finite
$P$ $P(s' | s, a)$ $s'$ $s$ $a$ $P_{s,a}$ $P(\cdot| s, a)$ .
$r$ $[0,1]$ . Deterministic unless specified.
$\gamma$ $[0, 1)$ , which defines a horizon.
- Remarks: 1. This is why it is called discounted MDP. 2. I assume it works like how eigenvalue defines the smoothness for an RKHS?
$\mu$ is an initial state distribution.

objective, policies, and values

$\tau_t=(s_0,a_0,r_0,s_1,\dots,s_t,a_t,r_t)$
- Remarks: 1. It goes like state->action->reward->state->..., and each state is a sample. 2. Sort of like a filtration for a stochastic process
$\pi$ $\mathcal{H}\to\Delta(\mathcal{A})$ stationary $\pi:\mathcal{S}\to\Delta(\mathcal{A})$ deterministic, stationary $\pi:\mathcal{S}\to\mathcal{A}$ .
- Remarks $a_t\sim \pi(\cdot|\tau_{t-1}, s_t)$ ? sort of weird
$V_M^\pi:\mathcal{S} \to \mathbb{R}$ is defined as:
$V_{M}^{π} (s) = E [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t}) | π, s_{0} = s],$
$1 / (1- \gamma)$ .
$Q_M^\pi:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ is defined as
$Q_{M}^{π} (s, a) = E [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t}) | π, s_{0} = s, a_{0} = a],$
$1 / (1- \gamma)$ .
$s$ :
$max_{π} V_{M}^{π} (s)$

Bellman Consistency Equations for Stationary Policies

Lemma 1.4. $\pi$ $V^\pi$ $Q^\pi$ $s\in\mathcal{S},a\in\mathcal{A},$

\begin{matrix} V^{π} (s) = Q^{π} (s, π (s)) . \\ Q^{π} (s, a) = r (s, a) + γ E_{a \sim π (\cdot | s), s^{'} \sim P (\cdot | s, a)} [V^{π} (s^{'})] . \end{matrix}

Proof:

V^{π} (s) = E_{a \sim π (\cdot | s)} [Q^{π} (s, a)] = Q^{π} (s, π (s))

\begin{aligned} Q^{π} (s, a) = & E [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t}) | π, s_{0} = s, a_{0} = a] \\ = & r (s, a) + γ E [\sum_{t = 1}^{\infty} γ^{t - 1} r (s_{t}, a_{t}) | π, s_{0} = s, a_{0} = a] \\ = & r (s, a) + γ E_{a \sim π (\cdot | s), s^{'} \sim P (\cdot | s, a)} [V^{π} (s^{'})] . \end{aligned}

$V^\pi$ $|\mathcal S|$ $Q^\pi$ $r$ $|\mathcal S|·|\mathcal A|$ $P$ $(|\mathcal S|\cdot|\mathcal A|)\times|\mathcal S|$ $P_{(s,a),s'}$ $P(s'|s,a)$ .

Remarks: This is straightforward since a function is in some sense an infinite dimensional vector where each point is a coordinate.

$P^{\pi}$ stationary $\pi$ , specifically

P_{(s, a), (s^{'}, a^{'})}^{π} := P (s^{'} | s, a) π (a^{'} | s^{'}) .

Remarks $P^\pi \in \mathbb R^{(|\mathcal S|\cdot|\mathcal A|)\times(|\mathcal S|\cdot|\mathcal A|)}$

In particular, for deterministic policies, we have:

\begin{matrix} P_{(s, a), (s^{'}, a^{'})}^{π} := {\begin{array}{cl} P (s^{'} | s, a) & if a^{'} = π (s^{'}) \\ 0 & if a^{'} \neq π (s^{'}) \end{array} \end{matrix}

With this notation, it is straightforward to verify

\begin{matrix} Q^{π} = r + γ P V^{π} \\ Q^{π} = r + γ P^{π} Q^{π} . \end{matrix}

Proof: Notice that

[P V^{π}]_{(s, a)} = \sum_{s^{'} \in S} P (s^{'} | s, a) V^{π} (s^{'}) = E_{s^{'} \sim P (\cdot | s, a)} [V^{π} (s^{'})] .

Thus, we have

[Q^{π}]_{(s, a)} = r (s, a) + γ E_{s^{'} \sim P (\cdot | s, a)} [V^{π} (s^{'})] .

Also, note that

\begin{aligned} [P^{π} Q^{π}]_{(s, a)} = & \sum_{s^{'}, a^{'}} P (s^{'} | s, a) π (a^{'} | s^{'}) Q^{π} (s^{'}, a^{'}) \\ = & \sum_{s^{'}} P (s^{'} | s, a) E_{a^{'} \sim π (\cdot | s^{'})} Q^{π} (s^{'}, a^{'}) \\ = & \sum_{s^{'}} P (s^{'} | s, a) V^{π} (s^{'}) . \end{aligned}

Corollary 1.5. $\pi$ is a stationary policy. We have that.

Q^{π} = (I - γ P^{π})^{- 1} r

$I$ is the identity matrix.

$I - \gamma P^\pi$ $x$ $(I - \gamma P^\pi)x \ne 0$ .

$I-\gamma P^\pi$ $x\in\mathbb{R}^{|\mathcal{S}||\mathcal{A}|}$

\begin{aligned} ∥ (I - γ P^{π}) x ∥_{\infty} & = ∥ x - γ P^{π} x ∥_{\infty} \\ \geq ∥ x ∥_{\infty} - γ ∥ P^{π} x ∥_{\infty} \\ \geq ∥ x ∥_{\infty} - γ ∥ x ∥_{\infty} (each element of P^{π} x is an average of x) \\ = (1 - γ) ∥ x ∥_{\infty} > 0 \end{aligned}

$I-\gamma P^\pi$ is full rank.

Lemma 1.6. We have that

[(1 - γ) (I - γ P^{π})^{- 1}]_{(s, a), (s^{'}, a^{'})} = (1 - γ) \sum_{t = 0}^{\infty} γ^{t} P^{π} (s_{t} = s^{'}, a_{t} = a^{'} | s_{0} = s, a_{0} = a)

$(s,a)$ $\pi$ $s_0=s$ $a_0=a$

$\sum_{t=0}^{\infty} \gamma^t \mathbb{P}_t^\pi (I-\gamma P^\pi) = I$ $\mathbb{P}_t^\pi$ $[\mathbb{P}_t^\pi]_{(s, a), (s', a')} = \mathbb P^\pi(s_t=s',a_t=a'|s_0=s,a_0=a)$ .

$(s, a), (s'', a'')$ , we need to show

\sum_{s^{'}, a^{'}} \sum_{t = 0}^{\infty} γ^{t} P^{π} (s_{t} = s^{'}, a_{t} = a^{'} | s_{0} = s, a_{0} = a) \cdot (I {(s^{'}, a^{'}) = (s^{″}, a^{″})} - γ P (s^{″} | s^{'}, a^{'}) π (a^{″} | s^{″})) = I {(s, a) = (s^{″}, a^{″})}

For the LHS, we have

\begin{aligned} L H S = & \sum_{t = 0}^{\infty} γ^{t} P^{π} (s_{t} = s^{″}, a_{t} = a^{″} | s_{0} = s, a_{0} = a) - \sum_{t = 0}^{\infty} γ^{t + 1} \sum_{s^{'}, a^{'}} P^{π} (s_{t} = s^{'}, a_{t} = a^{'} | s_{0} = s, a_{0} = a) P (s^{″} | s^{'}, a^{'}) π (a^{″} | s^{″})) \\ = & \sum_{t = 0}^{\infty} γ^{t} P^{π} (s_{t} = s^{″}, a_{t} = a^{″} | s_{0} = s, a_{0} = a) - \sum_{t = 0}^{\infty} γ^{t + 1} P^{π} (s_{t + 1} = s^{″}, a_{t + 1} = a^{″} | s_{0} = s, a_{0} = a) \\ = & P^{π} (s_{0} = s^{″}, a_{0} = a^{″} | s_{0} = s, a_{0} = a) \end{aligned}

$\mathbb P^\pi(s_0=s'',a_{0}=a''|s_0=s,a_0=a) = \mathbb{I}\{(s, a)=(s'',a'')\}$ .

Bellman Optimality Equations

There exists a stationarydeterministic $V^\pi(s)$ $s$ .

Theorem 1.7. $\Pi$ be the set of all non-stationary and randomized policies. Define

\begin{matrix} V^{⋆} (s) := sup_{π \in Π} V^{π} (s) \\ Q^{⋆} (s, a) := sup_{π \in Π} Q^{π} (s, a) . \end{matrix}

$V^{\pi}(s)$ $Q^{\pi}(s,a)$ $1/(1-\gamma)$ .

$\pi$ $s\in\mathcal{S}$ $a\in\mathcal{A},$

\begin{matrix} V^{π} (s) = V^{⋆} (s) \\ Q^{π} (s, a) = Q^{⋆} (s, a) . \end{matrix}

$\pi$ as an optimal policy.

Proof at P9 of the book.

Remarks $\widetilde{\pi}(s)=\arg \sup\limits_{a\in \mathcal A}\mathbb{E}[r(s,a)+\gamma V^{\star}(s_1)|(S_0,A_0)=(s,a)]$ .

Theorem 1.8 (Bellman optimality equations). $Q \in \mathbb{R}^{|\mathcal{S}||\mathcal{A}|}$ satisfies the Bellman optimality equations if:

Q (s, a) = r (s, a) + γ E_{s^{'} \sim P (\cdot | s, a)} [max_{a^{'} \in A} Q (s^{'}, a^{'})] .

$Q\in\mathbb{R}^{|\mathcal{S}||\mathcal{A}|},$ $Q=Q^\star$ $Q$ $\pi(s)\in\operatorname{argmax}_{a\in\mathcal{A}}Q^{\star}(s,a)$ is an optimal policy (where ties are broken in some arbitrary manner).

Proof at P10 of the book.

$V^*(s) = V^{\pi^*}(s) = Q^{\pi^*}(s, \pi^*(s)) \ge Q(s, a) \quad(\forall a)$ $\pi^*$ is the optimal stationary and deterministic policy.

Remarks: Here more notations are introduced:

$\pi_Q(s):=\operatorname{argmax}_{a\in\mathcal{A}}Q(s,a)$ vector $Q\in\mathbb{R}^{|\mathcal{S}||{\mathcal{A}}|}$ $Q$ $Q^\pi$ $\pi$ $\pi^\star = \pi_{Q^\star}$ .
$\mathcal{T}_{M}:\mathbb{R}^{|\mathcal{S}||\mathcal{A}|}\to\mathbb{R}^{|\mathcal{S}||\mathcal{A}|}$ $\mathcal{T}Q:=r+\gamma PV_Q$ $V_Q(s):=\max\limits_{a\in\mathcal{A}}Q(s,a)$ $V_Q: \mathbb{R}^{|\mathcal{S}||\mathcal{A}|}\to\mathbb{R}^{|\mathcal{S}|}$ .

This allows us to rewrite the Bellman optimality equation in the concise form:

Q = T Q,

$Q=Q^\star$ $Q$ $\mathcal{T}$ .

Remarks: The equivalent form in V can be written as:

V (s) = max_{a} {r (s, a) + γ E_{s^{'} \sim P (\cdot | s, a)} [V (s^{'})]}

$V(s) = \max_a Q(s, a)$ $V = \mathcal T V$ .

The two theorems basically proves that

finding an optimal policy = finding an optimal deterministic and stationary policy
$\pi_Q$

Finite-Horizon MDP

M = (S, A, {P}_{h}, {r}_{h}, H, μ)

Changes:

$P_h(s'|s,a)$ $s'$ $a$ $s$ $h$ . Note that the time-dependent setting generalizes the stationary setting where all steps share the same transition
$r_h:\mathcal{S}\times\mathcal{A} \to [0,1].r_h(s,a)$ $a$ $s$ $h$
$H$ which defines the horizon of the problem
$V_h^{\pi}(s)=\mathbb{E}\Big[\sum_{t=h}^{H-1}r_h(s_t,a_t)\bigm|\pi,s_h=s\Big]$ $V^\pi(s) := V^\pi_0(s)$
$Q_h^{\pi}(s,a)=\mathbb{E}\Big[\sum_{t=h}^{H-1}r_h(s_t,a_t)\Big|\pi,s_h=s,a_h=a\Big]$

Theorem 1.9.（Bellman optimality equations) Define

Q_{h}^{⋆} (s, a) = sup_{π \in Π} Q_{h}^{π} (s, a)

$Q_H=0.$ $Q_h=Q_h^\star$ $h \in [H]$ $h \in [H]$ ,

Q_{h} (s, a) = r_{h} (s, a) + E_{s^{'} \sim P_{h} (\cdot | s, a)} [max_{a^{'} \in A} Q_{h + 1} (s^{'}, a^{'})] .

$\pi(s,h)=\operatorname{argmax}_{a\in\mathcal{A}}Q_h^\star(s,a)$ is an optimal policy.

Proof: The flavor of this proof is similar to the infinite-horizon case.

$\pi$ $h$ $\pi(\cdot, h)$ . Proof is straightforward.

\begin{aligned} Q_{h}^{⋆} (s, a) = sup_{π \in Π} Q_{h}^{π} (s, a) = & sup_{π \in Π} r_{h} (s, a) + E [\sum_{t = h + 1}^{H - 1} r_{h} (s_{t}, a_{t}) | π, s_{h} = s, a_{h} = a] \\ = & sup_{π \in Π} r_{h} (s, a) + E [E_{s^{'} \sim P_{h} (\cdot | s, a)} [\sum_{t = h + 1}^{H - 1} r_{h} (s_{t}, a_{t}) | π, s_{h + 1} = s^{'}]] \\ = & r_{h} (s, a) + sup_{π \in Π} E_{s^{'} \sim P_{h} (\cdot | s, a)} [V_{h + 1}^{π} (s^{'})] \end{aligned} .

$V^*_{h+1}(s') = \sup_{\pi \in \Pi} V^\pi_{h+1}(s')$ $s'$ , thus we have

Q_{h}^{⋆} (s, a) = r_{h} (s, a) + E_{s^{'} \sim P_{h} (\cdot | s, a)} [V_{h + 1}^{*} (s^{'})] .

So we only need to show that

V_{h}^{*} (s) = max_{a} Q_{h}^{*} (s, a),

$\pi(\cdot, h)$ .

The reverse is highly similar as well.

Overall, the finite MDP is O(H) larger than the infinite one.

Computational Complexity

$(P,r,\gamma)$ $M$ $L(P,r,\gamma)$ $M$ $+,-,\times, \div$ take unit time. Here, we may hope for an algorithm which (exactly) returns an optimal policy whose runtime is polynomial in $L(P,r,\gamma)$ and the number of states and actions $(P,r,\gamma)$ number of states and actions $L(P,r,\gamma))$ .

Remarks $L(P, r, \gamma)$ $\epsilon$ and more... refer to slides for a detailed introduction.

Value Iteration

$Q$ $\mathcal{T}$ .

Lemma 1.10. $Q,Q'\in\mathbb{R}^{|\mathcal{S}||\mathcal{A}|},$

∥ T Q - T Q^{'} ∥_{\infty} \leq γ ∥ Q - Q^{'} ∥_{\infty}

Proof at P13 of the book.

Lemma 1.11. $Q\in\mathbb{R}^{|\mathcal{S}||\mathcal{A}|},$

V^{π_{Q}} \geq V^{⋆} - \frac{2 ∥ Q - Q^{⋆} ∥_{\infty}}{1 - γ} 1,

$\mathbb 1$ denotes the vector of all ones.

Proof at P14 of the book.

Theorem 1.12. $Q^{(0)}=0.$ $k=0,1,\ldots,$ suppose:

Q^{(k + 1)} = T Q^{(k)}

$\pi^{(k)}=\pi_{Q^{(k)}}.$ $k \geq \frac{\log \frac{2}{(1-\gamma)^2 \epsilon}}{1-\gamma},$

V^{π^{(k)}} \geq V^{⋆} - ϵ 1 .

$(1-x)^k \le \exp(-xk)$ .

Remarks $\epsilon$ $2^{-L(P, r, \gamma)}$ $\mathcal{T}Q$ $O(|\mathcal S|^2 |\mathcal A|)$ , we obtain the result in the table.

Policy Iteration

$\pi_0$ $k=0,1,2,..$

$Q^{\pi_k}$
Policy improvement. Update the policy:

π_{k + 1} = π_{Q^{π_{k}}}

$\pi_k$ , using the analytical form given in Equation 0.2, and update the policy to be greedy with respect to this new Q-value. The first step is often called policy evaluation, and the second step is often called policy improvement.

Remarks $Q^\pi = (I+\gamma P^\pi)^{-1} r$

Lemma 1.13. We have that.

Q^{π_{k + 1}} \geq T Q^{π_{k}} \geq Q^{π_{k}}

∥ Q^{π_{k + 1}} - Q^{⋆} ∥_{\infty} \leq γ ∥ Q^{π_{k}} - Q^{⋆} ∥_{\infty}

Proof at P15 of the book.

Remarks $[P^\pi Q^\pi]_{(s, a)} = \sum_{s', a'} P(s'|s, a) \pi(a'|s')Q^\pi(s', a')$ $[P^\pi Q^\pi]_{(s, a)} = \sum_{s'} P(s'|s, a) Q^\pi(s', \pi(s'))$ $\pi_{k+1}$ $Q^{\pi_k}$ $P^{\pi_{k+1}} Q^\pi_k \ge P^{\pi_{k}} Q^\pi_k$ $Q^{\pi_k}=r+\gamma P^{\pi_k}Q^{\pi_k}\le r+\gamma P^{\pi_{k+1}}Q^{\pi_k} \le r+\gamma P^{\pi_{k+1}}(r+\gamma P^{\pi_{k+1}}Q^{\pi_k}) \le \ldots \le \sum_{t=0}^{\infty}\gamma^{t}(P^{\pi_{k+1}})^{t}r$

Theorem 1.14. $\pi_0$ $k\geq\frac{\log\frac{1}{(1-\gamma)\epsilon}}{1-\gamma},$ $k$ -th policy in policy iteration has the following performance bound.

Q^{π_{k}} \geq Q^{⋆} - ϵ 1 .

Proof: Notice that

∥ Q^{π_{k}} - Q^{⋆} ∥_{\infty} \leq γ^{k} ∥ Q^{π_{0}} - Q^{⋆} ∥_{\infty} \leq (1 - (1 - γ))^{k} ∥ Q^{⋆} ∥_{\infty} \leq \frac{\exp (- (1 - γ) k)}{1 - γ}

Remarks $Q$ $O(|\mathcal S|^3 + |\mathcal S|^2|\mathcal A|)$ $P^{\pi_k}$ $|\mathcal S| \times |\mathcal S|$ $|\mathcal S||\mathcal A| \times |\mathcal S||\mathcal A|$ $P^{\pi_k}(s, a|s_0, a_0) = P^{\pi_k}(s, \pi_k(s_0)|s_0)$ $|\mathcal S|^3$ $r$ $|\mathcal S|^2|\mathcal A|$ $r$ $|S| \times |\mathcal A|$ matrix.

$|\mathcal A|^{|\mathcal S|}$ ) and PI is monotonic)

Value Iteration for Finite Horizon MDPs

Let us now specify the value iteration algorithm for finite-horizon MDPs. For the finize-horizon setting, it turns out that the analogues of value iteration and policy iteration lead to identical algorithms. The value iteration algorithm is specified as follows:

$Q_{H-1}(s,a)=r_{H-1}(s,a)$
$h=H-2,\dots0$ , set:

Q_{h} (s, a) = r_{h} (s, a) + E_{s^{'} \sim P_{h} (\cdot | s, a)} [max_{a^{'} \in A} Q_{h + 1} (s^{'}, a^{'})] .

$Q_h(s,a) = Q_h^\star(s,a)$ $\pi(s,h)=\operatorname{argmax}_{a\in\mathcal{A}}Q_h^\star(s,a)$ is an optimal policy.

Remarks: 1 is just a notation for simplicity, and there is a typo in 2 in the book.

The Linear Programming Approach

$\gamma$ . But with linear programming, we can have fully poly algo!

By the Bellman optimality equation (with the V form), we have

V (s) = max_{a} {r (s, a) + γ E_{s^{'} \sim P (\cdot | s, a)} [V (s^{'})]},

which gives

V (s) \geq r (s, a) + γ E_{s^{'} \sim P (\cdot | s, a)} [V (s^{'})] \forall a, s .

Therefore the LP is

\begin{aligned} min & \sum_{s} μ (s) V (s) \\ subject to & V (s) \geq r (s, a) + γ \sum_{s^{'}} P (s^{'} | s, a) V (s^{'}) \forall a \in A, s \in S \end{aligned}

or equivalently

\begin{aligned} min & E_{s \sim μ (\cdot)} [V (s)] \\ subject to & V (s) \geq r (s, a) + γ E_{s^{'} \sim P (\cdot | s, a)} [V (s^{'})] \forall a \in A, s \in S \end{aligned}

Conceptually, LP provides a cool poly time algo.

Comments from the slides:

VI is best thought of as a fixed point algorithm
Pl is equivalent to a (block) simplex algorithm

Dual LP

fixed $\pi$ $\pi$ $s_0$ $d^{\pi}_{s_0},$ as follows:

d_{s_{0}}^{π} (s, a) := (1 - γ) \sum_{t = 0}^{\infty} γ^{t} \overset{π}{Pr} (s_{t} = s, a_{t} = a | s_{0})

$\Pr^{\pi}(s_t=s,a_t=a|s_0)$ $s_t=s$ $a_t=a,$ $s_0$ $\pi$ $d^\pi_{s_0}$ $\mathcal{S}\times\mathcal{A}.$ We also overload notation and write:

d_{μ}^{π} (s, a) = E_{s_{0} \sim μ} [d_{s_{0}}^{π} (s, a)]

$\mu$ $\mathcal S$ $d_\mu^\pi(s,a)$ through an appropriate vector-matrix multiplication.

Remarks: Lemma 1.6 gives

[(1 - γ) (I - γ P^{π})^{- 1}]_{(s, a), (s^{'}, a^{'})} = (1 - γ) \sum_{t = 0}^{\infty} γ^{t} P^{π} (s_{t} = s^{'}, a_{t} = a^{'} | s_{0} = s, a_{0} = a)

Thus, actually we have

d_{s_{0}}^{π} (s, a) = [(1 - γ) (I - γ P^{π})^{- 1}]_{(s_{0}, π (s_{0})), (s, a)}

and

d_{μ}^{π} (s, a) = \sum_{s_{0}} μ (s_{0}) [(1 - γ) (I - γ P^{π})^{- 1}]_{(s_{0}, π (s_{0})), (s, a)} .

$d_\mu^\pi$ $s\in\mathcal{S}$

\sum_{a} d_{μ}^{π} (s, a) = (1 - γ) μ (s) + γ \sum_{s^{'}, a^{'}} P (s | s^{'}, a^{'}) d_{μ}^{π} (s^{'}, a^{'}) .

Remarks: Simply we have

53d5636f2055b5b0081be0efc07194d

Let us define the state-action polytope as follows:

K_{μ} := {d | d \geq 0 and \sum_{a} d (s, a) = (1 - γ) μ (s) + γ \sum_{s^{'}, a^{'}} P (s | s^{'}, a^{'}) d (s^{'}, a^{'})}

We now see that this set precisely characterizes all state-action visitation distributions.

Proposition 1.15. $\mathcal{K}_\mu$ $d \in \mathcal K_\mu$ $\pi$ $d_\mu^\pi=d$ .

$d\in\mathbb{R}^{|\mathcal{S}|\cdot|\mathcal{A}|}$ , the dual LP formulation is as follows

\begin{aligned} max & \frac{1}{1 - γ} \sum_{s, a} d_{μ} (s, a) r (s, a) \\ subject to & d \in K_{μ} \end{aligned}

$d^\star$ $\mu$ has full support, then we have that:

π^{⋆} (a | s) = \frac{d^{⋆} (s, a)}{\sum_{a^{'}} d^{⋆} (s, a^{'})}

$\arg\max d^{\star}(s,a)$ (and these policies are identical if the optimal policy is unique).

Sample Complexity and Sampling Models

we are interested understanding the number of samples required to find a near optimal policy, i.e. the sample complexity.

generative model $s'\sim P(·|s,a)$ $r(s, a)$ $(s, a)$ .

offline RL setting $\{(s,a,s',r)\}$ $r$ $r(s,a)$ $s'\sim P(\cdot|s,a).$ $(s,a)$ $\nu$ $\mathcal{S}\times\mathcal{A}$ .

Bonus: Advantages and The Performance Difference Lemma

V^{π} (μ) = E_{s \sim μ} [V^{π} (s)]

$A^{\pi}(s,a)$ $\pi$ is defined as

A^{π} (s, a) := Q^{π} (s, a) - V^{π} (s) .

Note that:

A^{*} (s, a) := A^{π^{*}} (s, a) \leq 0

for all state-action pairs.

$d^\pi_{s_0},$ where:

d_{s_{0}}^{π} (s) = (1 - γ) \sum_{t = 0}^{\infty} γ^{t} \overset{π}{Pr} (s_{t} = s | s_{0}) .

$\Pr^{\pi}(s_t=s|s_0)$ is the state visitation probability, under r starting at state so. Again, we write:

d_{μ}^{π} (s) = E_{s_{0} \sim μ} [d_{s_{0}}^{π} (s)] .

$\mu$ $\mathcal S$ .

Remarks $d_{s_0}^{\pi}(s,a):=(1-\gamma)\sum\limits_{t=0}^{\infty}\gamma^t\Pr^{\pi}(s_t=s,a_t=a|s_0)$ $a$ $d_{s_0}^{\pi}(s) = \sum_a d_{s_0}^{\pi}(s,a)$ .

Lemma 1.16. $\pi, \pi'$ $\mu$ $\mathcal S$ ,

V^{π} (μ) - V^{π^{'}} (μ) = \frac{1}{1 - γ} E_{s^{'} \sim d_{μ}^{π}} E_{a^{'} \sim π (\cdot | s^{'})} [A^{π^{'}} (s^{'}, a^{'})] .

Proof at P19 of the book