Bibliography

[1] M. Pinsky and S. Karlin, An introduction to stochastic modeling (3rd Edition). Academic Press, 1998.
[2] M. L. Puterman, Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, 2014.
[3] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction (2nd Edition). MIT Press, 2018.
[4] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge University Press, 2012.
[5] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming. Athena Scientific, 1996.
[6] H. K. Khalil, Nonlinear systems (3rd Edition). Patience Hall, 2002.
[7] G. Strang, Calculus. Wellesley-Cambridge Press, 1991.
[8] A. Besenyei, “A brief history of the mean value theorem,” 2012. Lecture notes.
[9] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in International Conference on Machine Learning, vol. 99, pp. 278–287, 1999.
[10] R. E. Bellman, Dynamic programming. Princeton University Press, 2010.
[11] R. E. Bellman and S. E. Dreyfus, Applied dynamic programming. Princeton University Press, 2015.
[12] J. Bibby, “Axiomatisations of the average and a further generalisation of monotonic sequences,” Glasgow Mathematical Journal, vol. 15, no. 1, pp. 63–65, 1974.
[13] A. S. Polydoros and L. Nalpantidis, “Survey of model-based reinforcement learning: Applications on robotics,” Journal of Intelligent & Robotic Systems, vol. 86, no. 2, pp. 153–173, 2017.

[14] T. M. Moerland, J. Broekens, A. Plaat, and C. M. Jonker, “Model-based reinforcement learning: A survey,” Foundations and Trends in Machine Learning, vol. 16, no. 1, pp. 1–118, 2023.
[15] F.-M. Luo, T. Xu, H. Lai, X.-H. Chen, W. Zhang, and Y. Yu, “A survey on model-based reinforcement learning,” arXiv:2206.09328, 2022.
[16] X. Wang, Z. Zhang, and W. Zhang, “Model-based multi-agent reinforcement learning: Recent progress and prospects,” arXiv:2203.10603, 2022.
[17] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Wiele, V. Mnih, N. Heess, and J. T. Springenberg, “Learning by playing solving sparse reward tasks from scratch,” in International Conference on Machine Learning, pp. 4344–4353, 2018.
[18] J. Ibarz, J. Tan, C. Finn, M. Kalakrishnan, P. Pastor, and S. Levine, “How to train your robot with deep reinforcement learning: Lessons we have learned,” The International Journal of Robotics Research, vol. 40, no. 4-5, pp. 698–721, 2021.
[19] S. Narvekar, B. Peng, M. Leonetti, J. Sinapov, M. E. Taylor, and P. Stone, “Curriculum learning for reinforcement learning domains: A framework and survey,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 7382–7431, 2020.
[20] C. Szepesvári, Algorithms for reinforcement learning. Springer, 2010.
[21] A. Maroti, “RBED: Reward based epsilon decay,” arXiv:1910.13701, 2019.
[22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[23] W. Dabney, G. Ostrovski, and A. Barreto, “Temporally-extended epsilon-greedy exploration,” arXiv:2006.01782, 2020.
[24] H.-F. Chen, Stochastic approximation and its applications, vol. 64. Springer Science & Business Media, 2006.
[25] H. Robbins and S. Monro, “A stochastic approximation method,” The Annals of Mathematical Statistics, pp. 400–407, 1951.
[26] J. Venter, “An extension of the Robbins-Monro procedure,” The Annals of Mathematical Statistics, vol. 38, no. 1, pp. 181-190, 1967.

[27] D. Ruppert, "Efficient estimations from a slowly convergent Robbins-Monro process," tech. rep., Cornell University Operations Research and Industrial Engineering, 1988.
[28] J. Lagarias, “Euler's constant: Euler's work and modern developments,” Bulletin of the American Mathematical Society, vol. 50, no. 4, pp. 527–628, 2013.
[29] J. H. Conway and R. Guy, The book of numbers. Springer Science & Business Media, 1998.
[30] S. Ghosh, “The Basel problem,” arXiv:2010.03953, 2020.
[31] A. Dvoretzky, “On stochastic approximation,” in The Third Berkeley Symposium on Mathematical Statistics and Probability, 1956.
[32] T. Jaakkola, M. I. Jordan, and S. P. Singh, “On the convergence of stochastic iterative dynamic programming algorithms,” Neural Computation, vol. 6, no. 6, pp. 1185–1201, 1994.
[33] T. Kailath, A. H. Sayed, and B. Hassibi, Linear estimation. Prentice Hall, 2000.
[34] C. K. Chui and G. Chen, Kalman filtering. Springer, 2017.
[35] G. A. Rummery and M. Niranjan, On-line $Q$ -learning using connectionist systems. Technical Report, Cambridge University, 1994.
[36] H. Van Seijen, H. Van Hasselt, S. Whiteson, and M. Wiering, “A theoretical and empirical analysis of Expected Sarsa,” in IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 177–184, 2009.
[37] M. Ganger, E. Duryea, and W. Hu, “Double Sarsa and double expected Sarsa with shallow and deep learning,” Journal of Data Analysis and Information Processing, vol. 4, no. 4, pp. 159–176, 2016.
[38] C. J. C. H. Watkins, Learning from delayed rewards. PhD thesis, King's College, 1989.
[39] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, p. 279-292, 1992.
[40] T. C. Hesterberg, Advances in importance sampling. PhD Thesis, Stanford University, 1988.
[41] H. Hasselt, “Double Q-learning,” Advances in Neural Information Processing Systems, vol. 23, 2010.

[42] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in AAAI Conference on Artificial Intelligence, vol. 30, 2016.
[43] C. Dann, G. Neumann, and J. Peters, “Policy evaluation with temporal differences: A survey and comparison,” Journal of Machine Learning Research, vol. 15, pp. 809–883, 2014.
[44] J. Clifton and E. Laber, “Q-learning: Theory and applications,” Annual Review of Statistics and Its Application, vol. 7, pp. 279–301, 2020.
[45] B. Jang, M. Kim, G. Harerimana, and J. W. Kim, “Q-learning algorithms: A comprehensive classification and applications,” IEEE Access, vol. 7, pp. 133653–133667, 2019.
[46] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine Learning, vol. 3, no. 1, pp. 9–44, 1988.
[47] G. Strang, Linear algebra and its applications (4th Edition). Belmont, CA: Thomsonson, Brooks/Cole, 2006.
[48] C. D. Meyer and I. Stewart, Matrix analysis and applied linear algebra. SIAM, 2023.
[49] M. Pinsky and S. Karlin, An introduction to stochastic modeling. Academic Press, 2010.
[50] M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” The Journal of Machine Learning Research, vol. 4, pp. 1107–1149, 2003.
[51] R. Munos, “Error bounds for approximate policy iteration,” in International Conference on Machine Learning, vol. 3, pp. 560–567, 2003.
[52] A. Geramifard, T. J. Walsh, S. Tellex, G. Chowdhary, N. Roy, and J. P. How, “A tutorial on linear function approximators for dynamic programming and reinforcement learning,” Foundations and Trends in Machine Learning, vol. 6, no. 4, pp. 375–451, 2013.
[53] B. Scherrer, “Should one compute the temporal difference fix point or minimize the Bellman residual? the unified oblique projection view,” in International Conference on Machine Learning, 2010.
[54] D. P. Bertsekas, Dynamic programming and optimal control: Approximate dynamic programming (Volume II). Athena Scientific, 2011.
[55] S. Abramovich, G. Jameson, and G. Sinnamon, “Refining Jensen's inequality,” Bulletin mathématique de la Société des Sciences Mathématiques de Roumanie, pp. 3–14, 2004.

[56] S. S. Dragomir, “Some reverses of the Jensen inequality with applications,” Bulletin of the Australian Mathematical Society, vol. 87, no. 2, pp. 177–194, 2013.
[57] S. J. Bradtke and A. G. Barto, “Linear least-squares algorithms for temporal difference learning,” Machine Learning, vol. 22, no. 1, pp. 33-57, 1996.
[58] K. S. Miller, “On the inverse of the sum of matrices,” Mathematics Magazine, vol. 54, no. 2, pp. 67–72, 1981.
[59] S. A. U. Islam and D. S. Bernstein, “Recursive least squares for real-time implementation,” IEEE Control Systems Magazine, vol. 39, no. 3, pp. 82–85, 2019.
[60] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
[61] J. Fan, Z. Wang, Y. Xie, and Z. Yang, “A theoretical analysis of deep Q-learning,” in Learning for Dynamics and Control, pp. 486–489, 2020.
[62] L.-J. Lin, Reinforcement learning for robots using neural networks. 1992. Technical report.
[63] J. N. Tsitsiklis and B. Van Roy, “An analysis of temporal-difference learning with function approximation,” IEEE Transactions on Automatic Control, vol. 42, no. 5, pp. 674–690, 1997.
[64] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” Advances in Neural Information Processing Systems, vol. 12, 1999.
[65] P. Marbach and J. N. Tsitsiklis, “Simulation-based optimization of Markov reward processes,” IEEE Transactions on Automatic Control, vol. 46, no. 2, pp. 191–209, 2001.
[66] J. Baxter and P. L. Bartlett, “Infinite-horizon policy-gradient estimation,” Journal of Artificial Intelligence Research, vol. 15, pp. 319–350, 2001.
[67] X.-R. Cao, “A basic formula for online policy gradient algorithms,” IEEE Transactions on Automatic Control, vol. 50, no. 5, pp. 696-699, 2005.
[68] R. J. Williams, "Simple statistical gradient-following algorithms for connectionist reinforcement learning," Machine Learning, vol. 8, no. 3, pp. 229-256, 1992.
[69] J. Peters and S. Schaal, “Reinforcement learning of motor skills with policy gradients,” Neural Networks, vol. 21, no. 4, pp. 682–697, 2008.

[70] E. Greensmith, P. L. Bartlett, and J. Baxter, “Variance reduction techniques for gradient estimates in reinforcement learning,” Journal of Machine Learning Research, vol. 5, no. 9, 2004.
[71] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, pp. 1928–1937, 2016.
[72] M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, and J. Kautz, “Reinforcement learning through asynchronous advantage actor-critic on a GPU,” arXiv:1611.06256, 2016.
[73] T. Degris, M. White, and R. S. Sutton, “Off-policy actor-critic,” arXiv:1205.4839, 2012.
[74] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in International Conference on Machine Learning, pp. 387–395, 2014.
[75] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv:1509.02971, 2015.
[76] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International Conference on Machine Learning, pp. 1861–1870, 2018.
[77] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, and P. Abbeel, “Soft actor-critic algorithms and applications,” arXiv:1812.05905, 2018.
[78] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, pp. 1889–1897, 2015.
[79] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv:1707.06347, 2017.
[80] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in International Conference on Machine Learning, pp. 1587–1596, 2018.
[81] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” in AAAI Conference on Artificial Intelligence, vol. 32, 2018.

[82] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Multiagent actor-critic for mixed cooperative-competitive environments,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[83] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang, “Mean field multiagent reinforcement learning,” in International Conference on Machine Learning, pp. 5571–5580, 2018.
[84] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al., “Grandmaster level in StarCraft II using multi-agent reinforcement learning,” Nature, vol. 575, no. 7782, p.p. 350–354, 2019.
[85] Y. Yang and J. Wang, “An overview of multi-agent reinforcement learning from game theoretical perspective,” arXiv:2011.00583, 2020.
[86] S. Levine and V. Koltun, “Guided policy search,” in International Conference on Machine Learning, pp. 1–9, 2013.
[87] M. Janner, J. Fu, M. Zhang, and S. Levine, “When to trust your model: Model-based policy optimization,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[88] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” in International Conference on Machine Learning, pp. 449–458, 2017.
[89] M. G. Bellemare, W. Dabney, and M. Rowland, Distributional Reinforcement Learning. MIT Press, 2023.
[90] H. Zhang, D. Liu, Y. Luo, and D. Wang, Adaptive dynamic programming for control: algorithms and stability. Springer Science & Business Media, 2012.
[91] F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,” IEEE Control Systems Magazine, vol. 32, no. 6, pp. 76–105, 2012.
[92] F. L. Lewis and D. Liu, Reinforcement learning and approximate dynamic programming for feedback control. John Wiley & Sons, 2013.
[93] Z.-P. Jiang, T. Bian, and W. Gao, “Learning-based control: A tutorial and some recent results,” Foundations and Trends in Systems and Control, vol. 8, no. 3, pp. 176–284, 2020.
[94] S. Meyn, Control systems and reinforcement learning. Cambridge University Press, 2022.

[95] S. E. Li, Reinforcement learning for sequential decision and optimal control. Springer, 2023.
[96] J. S. Rosenthal, First look at rigorous probability theory (2nd Edition). World Scientific Publishing Company, 2006.
[97] D. Pollard, A user's guide to measure theoretic probability. Cambridge University Press, 2002.
[98] P. J. Spreij, “Measure theoretic probability,” UvA Course Notes, 2012.
[99] R. G. Bartle, The elements of integration and Lebesgue measure. John Wiley & Sons, 2014.
[100] M. Taboga, Lectures on probability theory and mathematical statistics (2nd Edition). CreateSpace Independent Publishing Platform, 2012.
[101] T. Kennedy, “Theory of probability,” 2007. Lecture notes.
[102] A. W. Van der Vaart, Asymptotic statistics. Cambridge University Press, 2000.
[103] L. Bottou, “Online learning and stochastic approximations,” Online Learning in Neural Networks, vol. 17, no. 9, p. 142, 1998.
[104] D. Williams, Probability with martingales. Cambridge University Press, 1991.
[105] M. Métivier, Semimartingales: A course on stochastic processes. Walter de Gruyter, 1982.
[106] S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex optimization. Cambridge University Press, 2004.
[107] S. Bubeck et al., “Convex optimization: Algorithms and complexity,” Foundations and Trends in Machine Learning, vol. 8, no. 3-4, pp. 231-357, 2015.
[108] A. Jung, “A fixed-point of view on gradient methods for big data,” Frontiers in Applied Mathematics and Statistics, vol. 3, p. 18, 2017.

README

Bibliography