Preliminaries for Probability Theory
Reinforcement learning heavily relies on probability theory. We next summarize some concepts and results frequently used in this book.
⋄ Random variable: The term "variable" indicates that a random variable can take values from a set of numbers. The term "random" indicates that taking a value must follow a probability distribution.
A random variable is usually denoted by a capital letter. Its value is usually denoted by a lowercase letter. For example, X is a random variable, and x is a value that X can take.
This book mainly considers the case where a random variable can only take a finite number of values. A random variable can be a scalar or a vector.
Like normal variables, random variables have normal mathematical operations such as summation, product, and absolute value. For example, if X,Y are two random variables, we can calculate X+Y , X+1 , and XY .
⋄ A stochastic sequence is a sequence of random variables.
One scenario we often encounter is collecting a stochastic sampling sequence {xi}i=1n of a random variable X . For example, consider the task of tossing a die n times. Let xi be a random variable representing the value obtained for the i th toss. Then, {x1,x2,…,xn} is a stochastic process.
It may be confusing to beginners why xi is a random variable instead of a deterministic value. In fact, if the sampling sequence is {1,6,3,5,…} , then this sequence is not a stochastic sequence because all the elements are already determined. However, if we use a variable xi to represent the values that can possibly be sampled, it is a random variable since xi can take any value in {1,…,6} . Although xi is a lowercase letter, it still represents a random variable.
⋄ Probability: The notation p(X=x) or pX(x) describes the probability of the random variable X taking the value x . When the context is clear, p(X=x) is often written as p(x) for short.
⋄ Joint probability: The notation p(X=x,Y=y) or p(x,y) describes the probability of the random variable X taking the value x and Y taking the value y . One useful identity is as follows:
y∑p(x,y)=p(x). ⋄ Conditional probability: The notation p(X=x∣A=a) describes the probability of the random variable X taking the value x given that the random variable A has already taken the value a . We often write p(X=x∣A=a) as p(x∣a) for short.
It holds that
p(x,a)=p(x∣a)p(a) and
p(x∣a)=p(a)p(x,a). Since p(x)=∑ap(x,a) , we have
p(x)=a∑p(x,a)=a∑p(x∣a)p(a), which is called the law of total probability.
⋄ Independence: Two random variables are independent if the sampling value of one random variable does not affect the other. Mathematically, X and Y are independent if
p(x,y)=p(x)p(y). Since p(x,y)=p(x∣y)p(y) , the above equation implies
p(x∣y)=p(x). ⋄ Conditional independence: Let X,A,B be three random variables. X is said to be conditionally independent of A given B if
p(X=x∣A=a,B=b)=p(X=x∣B=b). In the context of reinforcement learning, consider three consecutive states: st , st+1 , st+2 . Since they are obtained consecutively, st+2 is dependent on st+1 and also st . However, if st+1 is already given, then st+2 is conditionally independent of st . That is
p(st+2∣st+1,st)=p(st+2∣st+1). This is also the memoryless property of Markov processes.
⋄ Law of total probability: The law of total probability was already mentioned when we
introduced the concept of conditional probability. Due to its importance, we list it again below:
p(x)=y∑p(x,y) and
p(x∣a)=y∑p(x,y∣a). ⋄ Chain rule of conditional probability and joint probability. By the definition of conditional probability, we have
p(a,b)=p(a∣b)p(b). This can be extended to
p(a,b,c)=p(a∣b,c)p(b,c)=p(a∣b,c)p(b∣c)p(c), and hence p(a,b,c)/p(c)=p(a,b∣c)=p(a∣b,c)p(b∣c) . The fact that p(a,b∣c)=p(a∣b,c)p(b∣c) implies the following property:
p(x∣a)=b∑p(x,b∣a)=b∑p(x∣b,a)p(b∣a). ⋄ Expectation/expected value/mean: Suppose that X is a random variable and the probability of taking the value x is p(x) . The expectation, expected value, or mean of X is defined as
E[X]=x∑p(x)x. The linearity property of expectation is
E[X+Y]=E[X]+E[Y], E[aX]=aE[X]. The second equation above can be trivially proven by definition. The first equation is proven below:
E[X+Y]=∑x∑y(x+y)p(X=x,Y=y)=∑xx∑yp(x,y)+∑yy∑xp(x,y)=∑xxp(x)+∑yyp(y)=E[X]+E[Y]. Due to the linearity of expectation, we have the following useful fact:
E[i∑aiXi]=i∑aiE[Xi]. Similarly, it can be proven that
E[AX]=AE[X], where A∈Rn×n is a deterministic matrix and X∈Rn is a random vector.
Conditional expectation: The definition of conditional expectation is
E[X∣A=a]=x∑xp(x∣a). Similar to the law of total probability, we have the law of total expectation:
E[X]=a∑E[X∣A=a]p(a). The proof is as follows. By the definition of expectation, it holds that
∑aE[X∣A=a]p(a)=∑a[∑xp(x∣a)x]p(a)=∑x∑ap(x∣a)p(a)x=∑x[∑ap(x∣a)p(a)]x=∑xp(x)x=E[X]. The law of total expectation is frequently used in reinforcement learning.
Similarly, conditional expectation satisfies
E[X∣A=a]=b∑E[X∣A=a,B=b]p(b∣a). This equation is useful in the derivation of the Bellman equation. A hint of its proof is the chain rule: p(x∣a,b)p(b∣a)=p(x,b∣a) .
Finally, it is worth noting that E[X∣A=a] is different from E[X∣A] . The former is a value, whereas the latter is a random variable. In fact, E[X∣A] is a function of the random variable A . We need rigorous probability theory to define E[X∣A] .
⋄ Gradient of expectation: Let f(X,β) be a scalar function of a random variable X and a deterministic parameter vector β . Then,
∇βE[f(X,β)]=E[∇βf(X,β)]. Proof: Since E[f(X,β)]=∑xf(x,β)p(x) , we have ∇βE[f(X,β)]=∇β∑xf(x,β)p(x)=∑x∇βf(x,β)p(x)=E[∇βf(X,β)] .
⋄ Variance, covariance, covariance matrix: For a single random variable X , its variance is defined as var(X)=E[(X−xˉ)2] , where xˉ=E[X] . For two random variables X,Y , their covariance is defined as cov(X,Y)=E[(X−xˉ)(Y−yˉ)] . For a random vector X=[X1,…,Xn]T , the covariance matrix of X is defined as var(X)≐Σ=E[(X−xˉ)(X−xˉ)T]∈Rn×n . The ij th entry of Σ is [Σ]ij=E[[X−xˉ]i[X−xˉ]j]=E[(Xi−xˉi)(Xj−xˉj)]=cov(Xi,Xj) . One trivial property is var(a)=0 if a is deterministic. Moreover, it can be verified that var(AX+a)=var(AX)=Avar(X)AT=AΣAT .
Some useful facts are summarized below.
Fact: E[(X−xˉ)(Y−yˉ)]=E[XY]−xˉyˉ=E[XY]−E[X]E[Y] .
Proof: E[(X−xˉ)(Y−yˉ)]=E[XY−Xyˉ−xˉY+xˉyˉ]=E[XY]−E[X]yˉ−xˉE[Y]+xˉyˉ=
E[XY]−E[X]E[Y]−E[X]E[Y]+E[X]E[Y]=E[XY]−E[X]E[Y]. Fact: E[XY]=E[X]E[Y] if X,Y are independent.
Proof: E[XY]=∑x∑yp(x,y)xy=∑x∑yp(x)p(y)xy=∑xp(x)x∑yp(y)y=E[X]E[Y].
Fact: cov(X,Y)=0 if X,Y are independent.
Proof: When X,Y are independent, cov(X,Y)=E[XY]−E[X]E[Y]=E[X]E[Y]−E[X]E[Y]=0 .
Appendix B