README

Preliminaries for Probability Theory

Reinforcement learning heavily relies on probability theory. We next summarize some concepts and results frequently used in this book.

\diamond Random variable: The term "variable" indicates that a random variable can take values from a set of numbers. The term "random" indicates that taking a value must follow a probability distribution.

A random variable is usually denoted by a capital letter. Its value is usually denoted by a lowercase letter. For example, XX is a random variable, and xx is a value that XX can take.

This book mainly considers the case where a random variable can only take a finite number of values. A random variable can be a scalar or a vector.

Like normal variables, random variables have normal mathematical operations such as summation, product, and absolute value. For example, if X,YX, Y are two random variables, we can calculate X+YX + Y , X+1X + 1 , and XYXY .

\diamond A stochastic sequence is a sequence of random variables.

One scenario we often encounter is collecting a stochastic sampling sequence {xi}i=1n\{x_{i}\}_{i = 1}^{n} of a random variable XX . For example, consider the task of tossing a die nn times. Let xix_{i} be a random variable representing the value obtained for the ii th toss. Then, {x1,x2,,xn}\{x_{1}, x_{2}, \ldots, x_{n}\} is a stochastic process.

It may be confusing to beginners why xix_{i} is a random variable instead of a deterministic value. In fact, if the sampling sequence is {1,6,3,5,}\{1,6,3,5,\ldots\} , then this sequence is not a stochastic sequence because all the elements are already determined. However, if we use a variable xix_{i} to represent the values that can possibly be sampled, it is a random variable since xix_{i} can take any value in {1,,6}\{1,\dots,6\} . Although xix_{i} is a lowercase letter, it still represents a random variable.

\diamond Probability: The notation p(X=x)p(X = x) or pX(x)p_X(x) describes the probability of the random variable XX taking the value xx . When the context is clear, p(X=x)p(X = x) is often written as p(x)p(x) for short.

\diamond Joint probability: The notation p(X=x,Y=y)p(X = x, Y = y) or p(x,y)p(x, y) describes the probability of the random variable XX taking the value xx and YY taking the value yy . One useful identity is as follows:

yp(x,y)=p(x).\sum_ {y} p (x, y) = p (x).

\diamond Conditional probability: The notation p(X=xA=a)p(X = x|A = a) describes the probability of the random variable XX taking the value xx given that the random variable AA has already taken the value aa . We often write p(X=xA=a)p(X = x|A = a) as p(xa)p(x|a) for short.

It holds that

p(x,a)=p(xa)p(a)p (x, a) = p (x | a) p (a)

and

p(xa)=p(x,a)p(a).p (x | a) = \frac {p (x , a)}{p (a)}.

Since p(x)=ap(x,a)p(x) = \sum_{a} p(x, a) , we have

p(x)=ap(x,a)=ap(xa)p(a),p (x) = \sum_ {a} p (x, a) = \sum_ {a} p (x | a) p (a),

which is called the law of total probability.

\diamond Independence: Two random variables are independent if the sampling value of one random variable does not affect the other. Mathematically, XX and YY are independent if

p(x,y)=p(x)p(y).p (x, y) = p (x) p (y).

Since p(x,y)=p(xy)p(y)p(x,y) = p(x|y)p(y) , the above equation implies

p(xy)=p(x).p (x | y) = p (x).

\diamond Conditional independence: Let X,A,BX, A, B be three random variables. XX is said to be conditionally independent of AA given BB if

p(X=xA=a,B=b)=p(X=xB=b).p (X = x \mid A = a, B = b) = p (X = x \mid B = b).

In the context of reinforcement learning, consider three consecutive states: sts_t , st+1s_{t+1} , st+2s_{t+2} . Since they are obtained consecutively, st+2s_{t+2} is dependent on st+1s_{t+1} and also sts_t . However, if st+1s_{t+1} is already given, then st+2s_{t+2} is conditionally independent of sts_t . That is

p(st+2st+1,st)=p(st+2st+1).p \left(s _ {t + 2} \mid s _ {t + 1}, s _ {t}\right) = p \left(s _ {t + 2} \mid s _ {t + 1}\right).

This is also the memoryless property of Markov processes.

\diamond Law of total probability: The law of total probability was already mentioned when we

introduced the concept of conditional probability. Due to its importance, we list it again below:

p(x)=yp(x,y)p (x) = \sum_ {y} p (x, y)

and

p(xa)=yp(x,ya).p (x | a) = \sum_ {y} p (x, y | a).

\diamond Chain rule of conditional probability and joint probability. By the definition of conditional probability, we have

p(a,b)=p(ab)p(b).p (a, b) = p (a | b) p (b).

This can be extended to

p(a,b,c)=p(ab,c)p(b,c)=p(ab,c)p(bc)p(c),p (a, b, c) = p (a | b, c) p (b, c) = p (a | b, c) p (b | c) p (c),

and hence p(a,b,c)/p(c)=p(a,bc)=p(ab,c)p(bc)p(a, b, c) / p(c) = p(a, b|c) = p(a|b, c)p(b|c) . The fact that p(a,bc)=p(ab,c)p(bc)p(a, b|c) = p(a|b, c)p(b|c) implies the following property:

p(xa)=bp(x,ba)=bp(xb,a)p(ba).p (x | a) = \sum_ {b} p (x, b | a) = \sum_ {b} p (x | b, a) p (b | a).

\diamond Expectation/expected value/mean: Suppose that XX is a random variable and the probability of taking the value xx is p(x)p(x) . The expectation, expected value, or mean of XX is defined as

E[X]=xp(x)x.\mathbb {E} [ X ] = \sum_ {x} p (x) x.

The linearity property of expectation is

E[X+Y]=E[X]+E[Y],\mathbb {E} [ X + Y ] = \mathbb {E} [ X ] + \mathbb {E} [ Y ],
E[aX]=aE[X].\mathbb {E} [ a X ] = a \mathbb {E} [ X ].

The second equation above can be trivially proven by definition. The first equation is proven below:

E[X+Y]=xy(x+y)p(X=x,Y=y)=xxyp(x,y)+yyxp(x,y)=xxp(x)+yyp(y)=E[X]+E[Y].\begin{array}{l} \mathbb {E} [ X + Y ] = \sum_ {x} \sum_ {y} (x + y) p (X = x, Y = y) \\ = \sum_ {x} x \sum_ {y} p (x, y) + \sum_ {y} y \sum_ {x} p (x, y) \\ = \sum_ {x} x p (x) + \sum_ {y} y p (y) \\ = \mathbb {E} [ X ] + \mathbb {E} [ Y ]. \\ \end{array}

Due to the linearity of expectation, we have the following useful fact:

E[iaiXi]=iaiE[Xi].\mathbb {E} \left[ \sum_ {i} a _ {i} X _ {i} \right] = \sum_ {i} a _ {i} \mathbb {E} [ X _ {i} ].

Similarly, it can be proven that

E[AX]=AE[X],\mathbb {E} [ A X ] = A \mathbb {E} [ X ],

where ARn×nA\in \mathbb{R}^{n\times n} is a deterministic matrix and XRnX\in \mathbb{R}^n is a random vector.

Conditional expectation: The definition of conditional expectation is

E[XA=a]=xxp(xa).\mathbb {E} [ X | A = a ] = \sum_ {x} x p (x | a).

Similar to the law of total probability, we have the law of total expectation:

E[X]=aE[XA=a]p(a).\mathbb {E} [ X ] = \sum_ {a} \mathbb {E} [ X | A = a ] p (a).

The proof is as follows. By the definition of expectation, it holds that

aE[XA=a]p(a)=a[xp(xa)x]p(a)=xap(xa)p(a)x=x[ap(xa)p(a)]x=xp(x)x=E[X].\begin{array}{l} \sum_ {a} \mathbb {E} [ X | A = a ] p (a) = \sum_ {a} \left[ \sum_ {x} p (x | a) x \right] p (a) \\ = \sum_ {x} \sum_ {a} p (x | a) p (a) x \\ = \sum_ {x} \left[ \sum_ {a} p (x | a) p (a) \right] x \\ = \sum_ {x} p (x) x \\ = \mathbb {E} [ X ]. \\ \end{array}

The law of total expectation is frequently used in reinforcement learning.

Similarly, conditional expectation satisfies

E[XA=a]=bE[XA=a,B=b]p(ba).\mathbb {E} [ X | A = a ] = \sum_ {b} \mathbb {E} [ X | A = a, B = b ] p (b | a).

This equation is useful in the derivation of the Bellman equation. A hint of its proof is the chain rule: p(xa,b)p(ba)=p(x,ba)p(x|a,b)p(b|a) = p(x,b|a) .

Finally, it is worth noting that E[XA=a]\mathbb{E}[X|A = a] is different from E[XA]\mathbb{E}[X|A] . The former is a value, whereas the latter is a random variable. In fact, E[XA]\mathbb{E}[X|A] is a function of the random variable AA . We need rigorous probability theory to define E[XA]\mathbb{E}[X|A] .

\diamond Gradient of expectation: Let f(X,β)f(X, \beta) be a scalar function of a random variable XX and a deterministic parameter vector β\beta . Then,

βE[f(X,β)]=E[βf(X,β)].\nabla_ {\beta} \mathbb {E} [ f (X, \beta) ] = \mathbb {E} [ \nabla_ {\beta} f (X, \beta) ].

Proof: Since E[f(X,β)]=xf(x,β)p(x)\mathbb{E}[f(X,\beta)] = \sum_{x}f(x,\beta)p(x) , we have βE[f(X,β)]=βxf(x,β)p(x)=xβf(x,β)p(x)=E[βf(X,β)]\nabla_{\beta}\mathbb{E}[f(X,\beta)] = \nabla_{\beta}\sum_{x}f(x,\beta)p(x) = \sum_{x}\nabla_{\beta}f(x,\beta)p(x) = \mathbb{E}[\nabla_{\beta}f(X,\beta)] .

\diamond Variance, covariance, covariance matrix: For a single random variable XX , its variance is defined as var(X)=E[(Xxˉ)2]\operatorname{var}(X) = \mathbb{E}[(X - \bar{x})^2] , where xˉ=E[X]\bar{x} = \mathbb{E}[X] . For two random variables X,YX, Y , their covariance is defined as cov(X,Y)=E[(Xxˉ)(Yyˉ)]\operatorname{cov}(X, Y) = \mathbb{E}[(X - \bar{x})(Y - \bar{y})] . For a random vector X=[X1,,Xn]TX = [X_1, \ldots, X_n]^T , the covariance matrix of XX is defined as var(X)Σ=E[(Xxˉ)(Xxˉ)T]Rn×n\operatorname{var}(X) \doteq \Sigma = \mathbb{E}[(X - \bar{x})(X - \bar{x})^T] \in \mathbb{R}^{n \times n} . The ijij th entry of Σ\Sigma is [Σ]ij=E[[Xxˉ]i[Xxˉ]j]=E[(Xixˉi)(Xjxˉj)]=cov(Xi,Xj)[\Sigma]_{ij} = \mathbb{E}[[X - \bar{x}]_i[X - \bar{x}]_j] = \mathbb{E}[(X_i - \bar{x}_i)(X_j - \bar{x}_j)] = \operatorname{cov}(X_i, X_j) . One trivial property is var(a)=0\operatorname{var}(a) = 0 if aa is deterministic. Moreover, it can be verified that var(AX+a)=var(AX)=Avar(X)AT=AΣAT\operatorname{var}(AX + a) = \operatorname{var}(AX) = A\operatorname{var}(X)A^T = A\Sigma A^T .

Some useful facts are summarized below.

  • Fact: E[(Xxˉ)(Yyˉ)]=E[XY]xˉyˉ=E[XY]E[X]E[Y]\mathbb{E}[(X - \bar{x})(Y - \bar{y})] = \mathbb{E}[XY] - \bar{x}\bar{y} = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] .

Proof: E[(Xxˉ)(Yyˉ)]=E[XYXyˉxˉY+xˉyˉ]=E[XY]E[X]yˉxˉE[Y]+xˉyˉ=\mathbb{E}\big[(X - \bar{x})(Y - \bar{y})\big] = \mathbb{E}[XY - X\bar{y} -\bar{x} Y + \bar{x}\bar{y}] = \mathbb{E}[XY] - \mathbb{E}[X]\bar{y} -\bar{x}\mathbb{E}[Y] + \bar{x}\bar{y} =

E[XY]E[X]E[Y]E[X]E[Y]+E[X]E[Y]=E[XY]E[X]E[Y].\mathbb {E} [ X Y ] - \mathbb {E} [ X ] \mathbb {E} [ Y ] - \mathbb {E} [ X ] \mathbb {E} [ Y ] + \mathbb {E} [ X ] \mathbb {E} [ Y ] = \mathbb {E} [ X Y ] - \mathbb {E} [ X ] \mathbb {E} [ Y ].
  • Fact: E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] if X,YX, Y are independent.

Proof: E[XY]=xyp(x,y)xy=xyp(x)p(y)xy=xp(x)xyp(y)y=E[X]E[Y].\mathbb{E}[XY] = \sum_x\sum_yp(x,y)xy = \sum_x\sum_yp(x)p(y)xy = \sum_xp(x)x\sum_yp(y)y = \mathbb{E}[X]\mathbb{E}[Y].

  • Fact: cov(X,Y)=0\operatorname{cov}(X, Y) = 0 if X,YX, Y are independent.

Proof: When X,YX, Y are independent, cov(X,Y)=E[XY]E[X]E[Y]=E[X]E[Y]E[X]E[Y]=0\operatorname{cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] = \mathbb{E}[X]\mathbb{E}[Y] - \mathbb{E}[X]\mathbb{E}[Y] = 0 .

Appendix B