README
Index
-greedy policy, 89
-stepSarsa,138
action, 2
action space, 2
action value, 30
illustrative examples, 31
relationship to state value, 30
undiscounted case, 205
acter-critic, 216
advantage actor-critic, 217
deterministic actor-critic, 227
off-policy actor-critic, 221
QAC, 216
advantage actor-critic, 217
advantage function, 220
baseline invariance, 217
optimal baseline, 218
pseudocode, 221
agent, 12
Bellman equation, 20
closed-form solution, 27
elementwise expression, 21
equivalent expressions, 22
expression in action values, 32
illustrative examples, 22
iterative solution, 28
matrix-vector expression, 26
policy evaluation, 27
Bellman error, 173
Bellman expectation equation, 127
Bellman optimality equation, 38
contraction property, 44
elementwise expression, 38
matrix-vector expression, 40
optimal policy, 47
optimal state value, 47
solution and properties, 46
bootstrapping, 18
Cauchy sequence, 42
contraction mapping, 41
contraction mapping theorem, 42
deterministic actor-critic, 227
policy gradient theorem, 228
pseudocode, 235
deterministic policy gradient, 235
discount rate, 9
discounted return, 9
Dvoretzky's convergence theorem, 109
environment, 12
episode, 10
episodic tasks, 10
expectedSarsa,137
experience replay, 183
exploration and exploitation, 92
policy gradient, 212
feature vector, 152
fixed point, 41
gridworldexample,1
importanc sampling, 221
illustrative examples, 223
importance weight, 222
law of large numbers, 80
least-squaresTD,177
recursive least squares, 178
Markov decision process, 11
model and dynamics, 11
Markov process, 12
Markov property, 11
mean estimation, 78
incremental manner, 102
metrics for policy gradient
averagereward,195
averagevalue,193
equivalent expressions, 197
metrics for value function approximation
Bellman error, 173
projected Bellman error, 174
MonteCarlo methods,78
MC -Greedy, 90
MC Basic, 81
MC Exploring Starts, 86
comparison with TD learning, 129
on-policy, 142
off-policy, 141
off-policy actor-critic, 221
importance sampling, 221
policy gradient theorem, 224
pseudocode, 226
on-policy, 141
online and offline, 130
optimal policy, 37
greedy is optimal, 47
impact of the discount rate, 49
impact of the reward values, 51
optimal state value, 37
Poisson equation, 205
policy, 4
function representation, 192
deterministic policy, 5
stochastic policy, 5
tabular representation, 6
policy evaluation
illustrative examples, 17
solving the Bellman equation, 27
policy gradient theorem, 198
deterministic case, 228
off-policy case, 224
policy iteration algorithm, 62
comparison with value iteration, 70
convergence analysis, 64
pseudo, 66
projected Bellman error, 174
Q-learning (deep Q-learning), 182
experience replay, 183
illustrative examples, 184
main network, 182
pseudocode, 184
replaybuffer,183
target network, 182
Q-learning (function representation), 180
Q-learning (tabular representation), 140
illustrative examples, 144
pseudocode, 143
off-policy, 141
QAC,216
REINFORCE, 210
replaybuffer,183
return, 8
reward, 6
Robbins-Monro algorithm, 103
application to mean estimation, 108
convergence analysis, 106
Sarsa (function representation), 179
Sarsa (tabular representation), 133
convergence analysis, 134
on-policy, 141
variant: -stepSarsa,138
variant: expected Sarsa, 137
algorithm, 133
optimal policy learning, 135
state, 2
state space, 2
state transition, 3
state value, 19
function representation, 152
relationship to action value, 30
undiscounted case, 205
stationary distribution, 157
metrics for policy gradient, 193
metrics for value function approximation, 156
stochastic gradient descent, 114
application to mean estimation, 116
comparison with batch gradient descent, 119
convergence analysis, 121
convergence pattern, 116
deterministicformulation,118
TD error, 128
TD target, 128
temporal-difference methods, 125
-stepSarsa,138
Q-learning, 140
Sarsa, 133
TD learning of state values, 126
a unified viewpoint, 145
expectedSarsa,137
value function approximation, 151
trajectory, 8
truncated policy iteration, 70
comparison with value iteration and policy iteration, 74
pseudocode, 72
value function approximation
Q-learning with function approximation, 180
Sarsa with function approximation, 179
TD learning of state values, 155
deep Q-learning, 182
functionapproximators,162
illustrative examples, 164
least-squaresTD,177
linear function, 155
theoretical analysis, 167
value iteration algorithm, 58
comparison with policy iteration, 70
pseudo, 60