Preliminaries for Probability Theory

Reinforcement learning heavily relies on probability theory. We next summarize some concepts and results frequently used in this book.

$\diamond$ Random variable: The term "variable" indicates that a random variable can take values from a set of numbers. The term "random" indicates that taking a value must follow a probability distribution.

A random variable is usually denoted by a capital letter. Its value is usually denoted by a lowercase letter. For example, $X$ is a random variable, and $x$ is a value that $X$ can take.

This book mainly considers the case where a random variable can only take a finite number of values. A random variable can be a scalar or a vector.

Like normal variables, random variables have normal mathematical operations such as summation, product, and absolute value. For example, if $X, Y$ are two random variables, we can calculate $X + Y$ , $X + 1$ , and $XY$ .

$\diamond$ A stochastic sequence is a sequence of random variables.

One scenario we often encounter is collecting a stochastic sampling sequence $\{x_{i}\}_{i = 1}^{n}$ of a random variable $X$ . For example, consider the task of tossing a die $n$ times. Let $x_{i}$ be a random variable representing the value obtained for the $i$ th toss. Then, $\{x_{1}, x_{2}, \ldots, x_{n}\}$ is a stochastic process.

It may be confusing to beginners why $x_{i}$ is a random variable instead of a deterministic value. In fact, if the sampling sequence is $\{1,6,3,5,\ldots\}$ , then this sequence is not a stochastic sequence because all the elements are already determined. However, if we use a variable $x_{i}$ to represent the values that can possibly be sampled, it is a random variable since $x_{i}$ can take any value in $\{1,\dots,6\}$ . Although $x_{i}$ is a lowercase letter, it still represents a random variable.

$\diamond$ Probability: The notation $p(X = x)$ or $p_X(x)$ describes the probability of the random variable $X$ taking the value $x$ . When the context is clear, $p(X = x)$ is often written as $p(x)$ for short.

$\diamond$ Joint probability: The notation $p(X = x, Y = y)$ or $p(x, y)$ describes the probability of the random variable $X$ taking the value $x$ and $Y$ taking the value $y$ . One useful identity is as follows:

\sum_ {y} p (x, y) = p (x).

$\diamond$ Conditional probability: The notation $p(X = x|A = a)$ describes the probability of the random variable $X$ taking the value $x$ given that the random variable $A$ has already taken the value $a$ . We often write $p(X = x|A = a)$ as $p(x|a)$ for short.

It holds that

p (x, a) = p (x | a) p (a)

and

p (x | a) = \frac {p (x , a)}{p (a)}.

Since $p(x) = \sum_{a} p(x, a)$ , we have

p (x) = \sum_ {a} p (x, a) = \sum_ {a} p (x | a) p (a),

which is called the law of total probability.

$\diamond$ Independence: Two random variables are independent if the sampling value of one random variable does not affect the other. Mathematically, $X$ and $Y$ are independent if

p (x, y) = p (x) p (y).

Since $p(x,y) = p(x|y)p(y)$ , the above equation implies

p (x | y) = p (x).

$\diamond$ Conditional independence: Let $X, A, B$ be three random variables. $X$ is said to be conditionally independent of $A$ given $B$ if

p (X = x \mid A = a, B = b) = p (X = x \mid B = b).

In the context of reinforcement learning, consider three consecutive states: $s_t$ , $s_{t+1}$ , $s_{t+2}$ . Since they are obtained consecutively, $s_{t+2}$ is dependent on $s_{t+1}$ and also $s_t$ . However, if $s_{t+1}$ is already given, then $s_{t+2}$ is conditionally independent of $s_t$ . That is

p \left(s _ {t + 2} \mid s _ {t + 1}, s _ {t}\right) = p \left(s _ {t + 2} \mid s _ {t + 1}\right).

This is also the memoryless property of Markov processes.

$\diamond$ Law of total probability: The law of total probability was already mentioned when we

introduced the concept of conditional probability. Due to its importance, we list it again below:

p (x) = \sum_ {y} p (x, y)

and

p (x | a) = \sum_ {y} p (x, y | a).

$\diamond$ Chain rule of conditional probability and joint probability. By the definition of conditional probability, we have

p (a, b) = p (a | b) p (b).

This can be extended to

p (a, b, c) = p (a | b, c) p (b, c) = p (a | b, c) p (b | c) p (c),

p (x | a) = \sum_ {b} p (x, b | a) = \sum_ {b} p (x | b, a) p (b | a).

$\diamond$ Expectation/expected value/mean: Suppose that $X$ is a random variable and the probability of taking the value $x$ is $p(x)$ . The expectation, expected value, or mean of $X$ is defined as

\mathbb {E} [ X ] = \sum_ {x} p (x) x.

The linearity property of expectation is

\mathbb {E} [ X + Y ] = \mathbb {E} [ X ] + \mathbb {E} [ Y ],

\mathbb {E} [ a X ] = a \mathbb {E} [ X ].

The second equation above can be trivially proven by definition. The first equation is proven below:

\begin{array}{l} \mathbb {E} [ X + Y ] = \sum_ {x} \sum_ {y} (x + y) p (X = x, Y = y) \\ = \sum_ {x} x \sum_ {y} p (x, y) + \sum_ {y} y \sum_ {x} p (x, y) \\ = \sum_ {x} x p (x) + \sum_ {y} y p (y) \\ = \mathbb {E} [ X ] + \mathbb {E} [ Y ]. \\ \end{array}

Due to the linearity of expectation, we have the following useful fact:

\mathbb {E} \left[ \sum_ {i} a _ {i} X _ {i} \right] = \sum_ {i} a _ {i} \mathbb {E} [ X _ {i} ].

Similarly, it can be proven that

\mathbb {E} [ A X ] = A \mathbb {E} [ X ],

where $A\in \mathbb{R}^{n\times n}$ is a deterministic matrix and $X\in \mathbb{R}^n$ is a random vector.

Conditional expectation: The definition of conditional expectation is

\mathbb {E} [ X | A = a ] = \sum_ {x} x p (x | a).

Similar to the law of total probability, we have the law of total expectation:

\mathbb {E} [ X ] = \sum_ {a} \mathbb {E} [ X | A = a ] p (a).

The proof is as follows. By the definition of expectation, it holds that

\begin{array}{l} \sum_ {a} \mathbb {E} [ X | A = a ] p (a) = \sum_ {a} \left[ \sum_ {x} p (x | a) x \right] p (a) \\ = \sum_ {x} \sum_ {a} p (x | a) p (a) x \\ = \sum_ {x} \left[ \sum_ {a} p (x | a) p (a) \right] x \\ = \sum_ {x} p (x) x \\ = \mathbb {E} [ X ]. \\ \end{array}

The law of total expectation is frequently used in reinforcement learning.

Similarly, conditional expectation satisfies

\mathbb {E} [ X | A = a ] = \sum_ {b} \mathbb {E} [ X | A = a, B = b ] p (b | a).

This equation is useful in the derivation of the Bellman equation. A hint of its proof is the chain rule: $p(x|a,b)p(b|a) = p(x,b|a)$ .

Finally, it is worth noting that $\mathbb{E}[X|A = a]$ is different from $\mathbb{E}[X|A]$ . The former is a value, whereas the latter is a random variable. In fact, $\mathbb{E}[X|A]$ is a function of the random variable $A$ . We need rigorous probability theory to define $\mathbb{E}[X|A]$ .

$\diamond$ Gradient of expectation: Let $f(X, \beta)$ be a scalar function of a random variable $X$ and a deterministic parameter vector $\beta$ . Then,

\nabla_ {\beta} \mathbb {E} [ f (X, \beta) ] = \mathbb {E} [ \nabla_ {\beta} f (X, \beta) ].

Proof: Since $\mathbb{E}[f(X,\beta)] = \sum_{x}f(x,\beta)p(x)$ , we have $\nabla_{\beta}\mathbb{E}[f(X,\beta)] = \nabla_{\beta}\sum_{x}f(x,\beta)p(x) = \sum_{x}\nabla_{\beta}f(x,\beta)p(x) = \mathbb{E}[\nabla_{\beta}f(X,\beta)]$ .

$\diamond$ Variance, covariance, covariance matrix: For a single random variable $X$ , its variance is defined as $\operatorname{var}(X) = \mathbb{E}[(X - \bar{x})^2]$ , where $\bar{x} = \mathbb{E}[X]$ . For two random variables $X, Y$ , their covariance is defined as $\operatorname{cov}(X, Y) = \mathbb{E}[(X - \bar{x})(Y - \bar{y})]$ . For a random vector $X = [X_1, \ldots, X_n]^T$ , the covariance matrix of $X$ is defined as $\operatorname{var}(X) \doteq \Sigma = \mathbb{E}[(X - \bar{x})(X - \bar{x})^T] \in \mathbb{R}^{n \times n}$ . The $ij$ th entry of $\Sigma$ is $[\Sigma]_{ij} = \mathbb{E}[[X - \bar{x}]_i[X - \bar{x}]_j] = \mathbb{E}[(X_i - \bar{x}_i)(X_j - \bar{x}_j)] = \operatorname{cov}(X_i, X_j)$ . One trivial property is $\operatorname{var}(a) = 0$ if $a$ is deterministic. Moreover, it can be verified that $\operatorname{var}(AX + a) = \operatorname{var}(AX) = A\operatorname{var}(X)A^T = A\Sigma A^T$ .

Some useful facts are summarized below.

Fact: $\mathbb{E}[(X - \bar{x})(Y - \bar{y})] = \mathbb{E}[XY] - \bar{x}\bar{y} = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$ .

Proof: $\mathbb{E}\big[(X - \bar{x})(Y - \bar{y})\big] = \mathbb{E}[XY - X\bar{y} -\bar{x} Y + \bar{x}\bar{y}] = \mathbb{E}[XY] - \mathbb{E}[X]\bar{y} -\bar{x}\mathbb{E}[Y] + \bar{x}\bar{y} =$

\mathbb {E} [ X Y ] - \mathbb {E} [ X ] \mathbb {E} [ Y ] - \mathbb {E} [ X ] \mathbb {E} [ Y ] + \mathbb {E} [ X ] \mathbb {E} [ Y ] = \mathbb {E} [ X Y ] - \mathbb {E} [ X ] \mathbb {E} [ Y ].

Fact: $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$ if $X, Y$ are independent.

Proof: $\mathbb{E}[XY] = \sum_x\sum_yp(x,y)xy = \sum_x\sum_yp(x)p(y)xy = \sum_xp(x)x\sum_yp(y)y = \mathbb{E}[X]\mathbb{E}[Y].$

Fact: $\operatorname{cov}(X, Y) = 0$ if $X, Y$ are independent.

Proof: When $X, Y$ are independent, $\operatorname{cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] = \mathbb{E}[X]\mathbb{E}[Y] - \mathbb{E}[X]\mathbb{E}[Y] = 0$ .

README

Preliminaries for Probability Theory

Appendix B