Index

$\epsilon$ -greedy policy, 89

$n$ -stepSarsa,138

action, 2

action space, 2

action value, 30

illustrative examples, 31

relationship to state value, 30

undiscounted case, 205

acter-critic, 216

advantage actor-critic, 217

deterministic actor-critic, 227

off-policy actor-critic, 221

QAC, 216

advantage actor-critic, 217

advantage function, 220

baseline invariance, 217

optimal baseline, 218

pseudocode, 221

agent, 12

Bellman equation, 20

closed-form solution, 27

elementwise expression, 21

equivalent expressions, 22

expression in action values, 32

illustrative examples, 22

iterative solution, 28

matrix-vector expression, 26

policy evaluation, 27

Bellman error, 173

Bellman expectation equation, 127

Bellman optimality equation, 38

contraction property, 44

elementwise expression, 38

matrix-vector expression, 40

optimal policy, 47

optimal state value, 47

solution and properties, 46

bootstrapping, 18

Cauchy sequence, 42

contraction mapping, 41

contraction mapping theorem, 42

deterministic actor-critic, 227

policy gradient theorem, 228

pseudocode, 235

deterministic policy gradient, 235

discount rate, 9

discounted return, 9

Dvoretzky's convergence theorem, 109

environment, 12

episode, 10

episodic tasks, 10

expectedSarsa,137

experience replay, 183

exploration and exploitation, 92

policy gradient, 212

feature vector, 152

fixed point, 41

gridworldexample,1

importanc sampling, 221

illustrative examples, 223

importance weight, 222

law of large numbers, 80

least-squaresTD,177

recursive least squares, 178

Markov decision process, 11

model and dynamics, 11

Markov process, 12

Markov property, 11

mean estimation, 78

incremental manner, 102

metrics for policy gradient

averagereward,195

averagevalue,193

equivalent expressions, 197

metrics for value function approximation

Bellman error, 173

projected Bellman error, 174

MonteCarlo methods,78

MC $\epsilon$ -Greedy, 90

MC Basic, 81

MC Exploring Starts, 86

comparison with TD learning, 129

on-policy, 142

off-policy, 141

off-policy actor-critic, 221

importance sampling, 221

policy gradient theorem, 224

pseudocode, 226

on-policy, 141

online and offline, 130

optimal policy, 37

greedy is optimal, 47

impact of the discount rate, 49

impact of the reward values, 51

optimal state value, 37

Poisson equation, 205

policy, 4

function representation, 192

deterministic policy, 5

stochastic policy, 5

tabular representation, 6

policy evaluation

illustrative examples, 17

solving the Bellman equation, 27

policy gradient theorem, 198

deterministic case, 228

off-policy case, 224

policy iteration algorithm, 62

comparison with value iteration, 70

convergence analysis, 64

pseudo, 66

projected Bellman error, 174

Q-learning (deep Q-learning), 182

experience replay, 183

illustrative examples, 184

main network, 182

pseudocode, 184

replaybuffer,183

target network, 182

Q-learning (function representation), 180

Q-learning (tabular representation), 140

illustrative examples, 144

pseudocode, 143

off-policy, 141

QAC,216

REINFORCE, 210

replaybuffer,183

return, 8

reward, 6

Robbins-Monro algorithm, 103

application to mean estimation, 108

convergence analysis, 106

Sarsa (function representation), 179

Sarsa (tabular representation), 133

convergence analysis, 134

on-policy, 141

variant: $n$ -stepSarsa,138

variant: expected Sarsa, 137

algorithm, 133

optimal policy learning, 135

state, 2

state space, 2

state transition, 3

state value, 19

function representation, 152

relationship to action value, 30

undiscounted case, 205

stationary distribution, 157

metrics for policy gradient, 193

metrics for value function approximation, 156

stochastic gradient descent, 114

application to mean estimation, 116

comparison with batch gradient descent, 119

convergence analysis, 121

convergence pattern, 116

deterministicformulation,118

TD error, 128

TD target, 128

temporal-difference methods, 125

$n$ -stepSarsa,138

Q-learning, 140

Sarsa, 133

TD learning of state values, 126

a unified viewpoint, 145

expectedSarsa,137

value function approximation, 151

trajectory, 8

truncated policy iteration, 70

comparison with value iteration and policy iteration, 74

pseudocode, 72

value function approximation

Q-learning with function approximation, 180

Sarsa with function approximation, 179

TD learning of state values, 155

deep Q-learning, 182

functionapproximators,162

illustrative examples, 164

least-squaresTD,177

linear function, 155

theoretical analysis, 167

value iteration algorithm, 58

comparison with policy iteration, 70

pseudo, 60

README

Index