OpenTech
Books
Toggle theme
Toggle theme
Table of Contents
00_Overview_of_this_Book
README
01_Basic_Concepts
1.1_A_grid_world_example
1.2_State_and_action
1.3_State_transition
1.4_Policy
1.5_Reward
1.6_Trajectories_returns_and_episodes
1.7_Markov_decision_processes
1.8_Summary
1.9_Q&A
02_State_Values_and_Bellman_Equation
2.1_Motivating_example_1_Why_are_returns_important
2.2_Motivating_example_2_How_to_calculate_returns
2.3_State_values
2.4_Bellman_equation
2.5_Examples_for_illustrating_the_Bellman_equation
2.6_Matrix-vector_form_of_the_Bellman_equation
2.7_Solving_state_values_from_the_Bellman_equation
2.8_From_state_value_to_action_value
2.9_Summary
2.10_Q&A
03_Optimal_State_Values_and_Bellman_Optimality_Equation
3.1_Motivating_example_How_to_improve_policies
3.2_Optimal_state_values_and_optimal_policies
3.3_Bellman_optimality_equation
3.4_Solving_an_optimal_policy_from_the_BOE
3.5_Factors_that_influence_optimal_policies
3.6_Summary
3.7_Q&A
04_Value_Iteration_and_Policy_Iteration
4.1_Value_iteration
4.2_Policy_iteration
4.3_Truncated_policy_iteration
4.4_Summary
4.5_Q&A
05_Monte_Carlo_Methods
5.1_Motivating_example_Mean_estimation
5.2_MC_Basic_The_simplest_MC-based_algorithm
5.3_MC_Exploring_Starts
5.4_MC_epsilon-Greedy_Learning_without_exploring_starts
5.5_Exploration_and_exploitation_of_epsilon-greedy_policies
5.6_Summary
5.7_Q&A
06_Stochastic_Approximation
6.1_Motivating_example_Mean_estimation
6.2_Robbins-Monro_algorithm
6.3_Dvoretzkys_convergence_theorem
6.4_Stochastic_gradient_descent
6.5_Summary
6.6_Q&A
07_Temporal-Difference_Methods
7.1_TD_learning_of_state_values
7.2_TD_learning_of_action_values_Sarsa
7.3_TD_learning_of_action_values_n-step_Sarsa
7.4_TD_learning_of_optimal_action_values_Q-learning
7.5_A_unified_viewpoint
7.6_Summary
7.7_Q&A
08_Value_Function_Methods
8.1_Value_representation_From_table_to_function
8.2_TD_learning_of_state_values_based_on_function_approximation
8.3_TD_learning_of_action_values_based_on_function_approximation
8.4_Deep_Q-learning
8.5_Summary
8.6_Q&A
09_Policy_Gradient_Methods
9.1_Policy_representation_From_table_to_function
9.2_Metrics_for_defining_optimal_policies
9.3_Gradients_of_the_metrics
9.4_Monte_Carlo_policy_gradient_REINFORCE
9.5_Summary
9.6_Q&A
10_Actor-Critic_Methods
10.1_The_simplest_actor-critic_algorithm_QAC
10.2_Advantage_actor-critic_A2C
10.3_Off-policy_actor-critic
10.4_Deterministic_actor-critic
10.5_Summary
10.6_Q&A
11_A_Preliminaries_for_Probability_Theory
README
12_B_Measure-Theoretic_Probability_Theory
README
13_C_Convergence_of_Sequences
README
C.1_Convergence_of_deterministic_sequences
C.2_Convergence_of_stochastic_sequences
14_D_Preliminaries_for_Gradient_Descent
README
15_Bibliography
README
16_Symbols
README
17_Index
README