---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.11.5
kernelspec:
display_name: Python 3
language: python
name: python3
---
# Probability theory & random variables
Last week was our first encounter with Monte Carlo simulations: the various incarnations of the pebble game used random numbers to estimate the ratio of the areas of a circle and a square. We have also witnessed that several factors affect the accuracy of the estimate: more samples means higher accuracy, while having independent samples like in the direct-sampling Monte Carlo method is better than the correlated samples of the Markov-chain approach. In order to reason properly about the simple examples and to be well equipped for the more involved problems we will encounter, it is essential to agree upon the probability theory background. This will be the focus of this week's lecture.
## Elements of probability theory
### Probability spaces, events, independence
The arena of probability theory is a **probability space** $(\Omega,\mathbb{P})$. Here $\Omega$ is the **sample space**, meaning the set of possible outcomes of an experiment, and $\mathbb{P}$ is a **probability measure** on $\Omega$. A subset $A\subset\Omega$ is called an **event** and to any event $\mathbb{P}$ associates a real number $\mathbb{P}(A) \in [0,1]$ telling us the probability of this event, i.e. the probability that the outcome of an experiment is in $A$. In particular it is required to satisfy $ \mathbb{P}(\Omega) = 1$, $\mathbb{P}(\emptyset) = 0$, and for any sequence of disjoint events $A_1,A_2,\ldots$ (meaning $A_i \cap A_j = \emptyset$ whenever $i\neq j$),
$$
\mathbb{P}( A_1 \cup A_2 \cup \cdots) =\mathbb{P}( A_1) + \mathbb{P}(A_2) + \cdots.
$$
In words: the chance of any of a set of mutually exclusive events to happen is the sum of the chances of the individual events.
:::{admonition} Example (direct-sampling pebble game)
The sample space of a single throw in the direct-sampling pebble game is the square $\Omega = (-1,1)^2$ and the probability measure $\mathbb{P}$ is the measure on $\Omega$ determined by the probability density $p(\mathbf{x}) = 1/4$, i.e. for any subset $A \subset \Omega$ we had $\mathbb{P}(A) = \int_A p(\mathbf{x}) \mathrm{d}\mathbf{x}$. We were particularly interested in the event $\text{hit} = \{ \mathbf{x} \in (-1,1)^2 : \mathbf{x}\cdot\mathbf{x} < 1\}$ and we convinced ourselves that $\mathbb{P}(\text{hit}) = \pi/4$.
:::
:::{admonition} Example (die roll)
When rolling a $6$-sided die, the sample space is $\Omega = \{1,2,\ldots,6\}$ and the probability measure is the one that satisfies $\mathbb{P}(\{i\}) = 1/6$ for any $i \in \Omega$. An example of an event is $\text{even} = \{2,4,6\} \subset \Omega$ which by formula (1) has probability $\mathbb{P}(\text{even}) = \mathbb{P}(\{1\}) + \mathbb{P}(\{2\}) + \mathbb{P}(\{3\}) = 1/2$.
:::
Several properties of a probability space follow directly from the definition. If $A$ and $B$ are two events then
$$
\mathbb{P}(\Omega \smallsetminus A) = 1- \mathbb{P}(A)\qquad \text{"probability of $A$ not happening"}, \\
\mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B) - \mathbb{P}(A \cap B) \qquad \text{"probability of $A$ or $B$"}, \\
\mathbb{P}(A \smallsetminus B) = \mathbb{P}(A) - \mathbb{P}(A \cap B) \qquad \text{"probability of $A$ and not $B$"}, \\
A \subset B\quad \implies\quad \mathbb{P}(A) \leq \mathbb{P}(B) \qquad\text{"if $A$ implies $B$ then $B$ cannot have smaller probability"}.
$$
Side note: If we want to be [mathematically rigorous](https://en.wikipedia.org/wiki/Probability_space) we should be specifying what subsets $A \subset \Omega$ constitute proper events (the so-called $\sigma$-algebra), because some subsets of a continuous space can be just too wild to assign a probability to. We will not delve into these measure-theoretic issues as they will play little role in the probability spaces we consider.
An important concept in probability is that of **conditional probabilities**. If $A$ and $B$ are events such that $B$ happens with non-zero probability $\mathbb{P}(B) > 0$ then the **conditional probability** of $A$ given $B$ is
$$
\mathbb{P}( A | B ) := \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}.
$$
You may check that $\mathbb{P}(\cdot|B)$ is a proper probability measure on $\Omega$, which tells us what the probability of outcomes of an experiment is if you already know that event $B$ happens.
Note that in general this measure is different from $\mathbb{P}$ itself, i.e. typically $\mathbb{P}(A) \neq \mathbb{P}( A | B)$, meaning that knowing that $B$ happens changes the probability of $A$ happening. Sometimes this is not the case: events $A$ and $B$ are said to be **independent** if
$$
\mathbb{P}(A) = \mathbb{P}( A | B)\quad \iff \quad \mathbb{P}( A\cap B) = \mathbb{P}(A)\mathbb{P}(B) \quad \iff \quad \mathbb{P}(B) = \mathbb{P}( B | A).
$$
:::{admonition} Example (die roll)
The events $\{2,4,6\}$ ("even") and $\{1,2,3,4\}$ ("at most 4") are independent, but $\{1,2,3\}$ ("at most 3") is independent of neither of those.
:::
The notion of independence can be extended to a sequence of events. Events $A_1,A_2,\ldots,A_n$ are **independent** if
$$
\mathbb{P}( \cap_{i\in I} A_i ) = \prod_{i\in I} \mathbb{P}(A_i) \qquad \text{for all subsets }I\subset\{1,2,\ldots,n\}.
$$
:::{admonition} Example (direct-sampling pebble game)
When considering $N$ throws in this game, the sample space is naturally given by the $N$-fold copy of the square $\Omega = \big( (-1,1)^2 \big)^N = \{ (\mathbf{x}_1,\ldots,\mathbf{x}_N) \}$. Then the event of a hit in the $k$th trial is given by $\text{hit}_k = \{ (\mathbf{x}_1,\ldots,\mathbf{x}_N) : \mathbf{x}_k\cdot\mathbf{x}_k < 1\}$. The events $\text{hit}_1,\ldots,\text{hit}_N$ are then independent: the probability of a hit in one trial is not affected by knowing the hits or misses in the $N-1$ other trials.
:::
(random-variables)=
### Random variables
A (real) **random variable** $X$ is a function from the sample space $\Omega$ to the real numbers $\mathbb{R}$. Sometimes one considers also functions into other spaces like $\mathbb{C}$ or $\mathbb{R}^n$, giving rise to complex random variables or random vectors, but for now let us concentrate on real random variables. Sometimes the sample space $\Omega$ is quite complicated (as it contains all details about an experiment) and we would like to limit ourselves to describing the information one gets from just looking at the random variable $X$. In other words we consider the **distribution** of $X$, which is a probability measure on $\mathbb{R}$ given by
$$
\mathbb{P}( X \in B) := \mathbb{P}(\{ x\in \Omega : X(x) \in B\}) \qquad\text{for every (decent) subset } B\subset \mathbb{R}.
$$
A convenient way to characterize a distribution is via the **cumulative distribution function (CDF)**
$$
F_X(x) := \mathbb{P}( X \leq x), \qquad x\in\mathbb{R}.
$$
It exists for any random variable (discrete or continuous) and satisfies: $F_X(x)$ is non-decreasing and [right-continuous](https://en.wikipedia.org/wiki/Continuous_function#Directional_and_semi-continuity) whenever it jumps; $\lim_{x\to -\infty} F_X(x) = 0$; $\lim_{x\to \infty} F_X=1$. This is really a 1-to-1 characterization: for any such real function there exists a random variable having that CDF. In fact we will see below an explicit way, called the **inversion method**, to construct such a random variable from the CDF. Two random variables (potentially on different probability spaces) are said to be **identically distributed** if they have the same distribution, or equivalently the same CDF.
### Expectation value
Formally, given a random variable $X$, the **expectation value** or **mean** of $X$ is
$$
\langle X \rangle \equiv \mathbb{E}[ X ] := \int_{\Omega} X \mathrm{d}\mathbb{P},
$$
but making sense of this requires abstract measure and integration theory. Luckily pretty much all random variables we will encounter are either **discrete** of **(absolutely) continuous** in which case the expectation value amounts to a sum or standard calculus integral (see below). In full generality it satisfies the important property of **linearity**: if $X$ and $Y$ are random variables on some probability space and $a,b\in\mathbb{R}$ then
$$
\mathbb{E}[a X + b Y] = a\, \mathbb{E}[X] + b \,\mathbb{E}[Y].
$$
Note that this holds even when $X$ and $Y$ are highly dependent, making it an extremely powerful property that we will use again and again. Note also that if $X$ is constant (or deterministic), i.e. $\mathbb{P}(X = x) = 1$ for some $x\in \mathbb{R}$, then $\mathbb{E}[X] = x$.
:::{admonition} Example (Buffon's needle and Buffon's noodle)
A classic example in which the additivity of expectation values helps surprisingly is Buffon's needle experiment. Suppose the floor has parallel lines with an even spacing of distance $1$ (think of a wooden floor with boards of width $1$) and we drop a needle of length $\ell < 1$ from considerable height. What is the probability $\mathbb{P}(\text{across})$ that it will lie across a line? By parametrizing the position and orientation of the needle appropriately one may compute the appropriate integral to derive $\mathbb{P}(\text{across}) = 2\ell / \pi$. But there is a clever way to arrive at this answer by generalizing the problem somewhat. Let $C$ be a 1-dimensional needle (or noodle) of any shape or size and denote by $N_C$ the random variable counting the number of times $C$ intersects a line when thrown. We can then subdivide $C$ into smaller parts $C_1,\ldots,C_k$ and note that the number of intersections just add up, $N_C = N_{C_1} + \cdots + N_{C_k}$. Even though $N_{C_1}, \ldots, N_{C_k}$ are far from independent, linearity of expectation tells us that
$$
\mathbb{E}[N_C] = \mathbb{E}[N_{C_1}] + \cdots + \mathbb{E}[N_{C_k}].
$$
Suppose the total length of $C$ is $\ell$, and we divide $C$ into $k=\ell/\epsilon$ parts that all approximately have the shape of a straight segment of small length $\epsilon$ up to rotation and translation. Then each of those parts contributes the same quantity to the right-hand side, giving $\mathbb{E}[N_C] = \frac{\ell}{\epsilon} \mathbb{E}[N_{C_1}] = c \ell$ for some universal constant $c$. In particular, $\mathbb{E}[N_C]$ only depends on the total length of the curve $C$. To determine the proportionality constant $c$, we consider a circular needle $C$ of diameter $1$ and length $\ell = \pi$, for which we know that always $N_C = 2$ and thus $\mathbb{E}[N_C] = 2$. This fixes $c=2/\pi$. Returning to our straight needle $C$ of length $\ell$, we now know that $\mathbb{E}[N_C] = 2\ell / \pi$. But we also know that when $\ell < 1$, we can only have $N_C = 0$ or $N_C = 1$, meaning that the probability and expectation value agree
$$
\frac{2\ell}{\pi} = \mathbb{E}[N_C] = 1\cdot\mathbb{P}(\text{across}) + 0\cdot\mathbb{P}(\text{across}) = \mathbb{P}(\text{across}).
$$
:::
Another important property holds for independent random variables. The notion of independence of random variables is similar to that of events: random variables $X_1,X_2,\ldots,X_n$ are called **independent** if the distribution of one of the random variables is not affected by knowing any of the values of the other $n-1$ variables. Mathematically this is expressed as
$$
\mathbb{P}( X_1 \in B_1, X_2 \in B_2, \ldots, X_n\in B_n) = \prod_{i=1}^n \mathbb{P}(X_i \in B_i),
$$
where the $B_i$ are arbitrary subsets of $\mathbb{R}$.
In this case
$$
\mathbb{E} \Big[ \prod_{i=1}^n X_i \Big] = \prod_{i=1}^n \mathbb{E}[ X_i].\qquad \text{(if $X_1,\ldots,X_n$ are independent)}
$$ (product-expectation)
In terms of the expectation value several other important quantities can be defined. The **variance** $\sigma_X^2\equiv \operatorname{Var}( X )$ of $X$ is defined as
$$
\operatorname{Var}( X ) = \mathbb{E}\Big[ (X- \mathbb{E}[X])^2 \Big]. $$
Using the linearity of expectation this can be equivalently expressed as
$$
\operatorname{Var}( X ) = \mathbb{E}\Big[ X^2 - 2\mathbb{E}[X]X + \mathbb{E}[X]^2\Big] = \mathbb{E}[ X^2] - 2\mathbb{E}[X]\mathbb{E}[X] + \mathbb{E}[X]^2 = \mathbb{E}[X^2] - \mathbb{E}[X]^2.
$$
An important property of the variance is that it is easily calculated for a sum of independent random variables,
$$
\operatorname{Var}\Big(\sum_{i=1}^n X_i\Big) = \sum_{i=1}^n \operatorname{Var}(X_i).\qquad \text{(if $X_1,\ldots,X_n$ are independent)}
$$
We can check this in the case $n=2$ (and conclude the general case by induction) as follows.
$$
\begin{split} \operatorname{Var}(X_1+X_2) &= \mathbb{E}[(X_1+X_2)^2] - \mathbb{E}[X_1+X_2]^2 \\
&= \mathbb{E}[X_1^2] + 2 \mathbb{E}[X_1 X_2]+\mathbb{E}[X_2^2] - \mathbb{E}[X_1]^2 - 2 \mathbb{E}[X_1]\mathbb{E}[X_2]-\mathbb{E}[X_2]^2 \\
&= \mathbb{E}[X_1^2] + \mathbb{E}[X_2^2] - \mathbb{E}[X_1]^2-\mathbb{E}[X_2]^2 \\
&= \operatorname{Var}(X_1) + \operatorname{Var}(X_2),
\end{split}
$$
where we used {eq}`product-expectation` that $\mathbb{E}[X_1 X_2] = \mathbb{E}[X_1]\mathbb{E}[X_2]$ when $X_1$ and $X_2$ are independent.
The **standard deviation** $\sigma_X$ is of course just the square root of the variance $\sigma_X = \sqrt{\operatorname{Var}(X)}$.
### Discrete random variables
A **discrete random variable** $X$ is a random variable that takes on only a finite or countable number of values. In that case we can characterize its distribution by the **probability (mass) function** $p_X(x)$ given by
$$
p_X(x) = \mathbb{P}( X = x ).
$$
In that case the expectation value of $X$ is simply given by a sum
$$
\mathbb{E}[ X ] = \sum_{x} x \,p_X(x).
$$
If $g : \mathbb{R} \to \mathbb{R}$ is any function then we can also consider the expectation value of $g(X)$,
$$
\mathbb{E}[ g(X) ] = \sum_x g(x)\,p_X(x).
$$
:::{admonition} Example (Bernoulli random variable)
This is probably the simplest discrete random variable. It has a single parameter $p\in[0,1]$ and the random variable takes on only two values $0$ and $1$ with probability function
$$
p_X(0) = 1-p,\qquad p_X(1) = p.
$$
Note that it has expectation value $\mathbb{E}[X] = p$ and variance $\operatorname{Var}(X) = p(1-p)$.
Such random variables appear naturally when considering events on a probability space. If $A \subset \Omega$ is an event then the indicator function $\mathbf{1}_{A} : \Omega \to \mathbb{R}$ defines a random variable taking values $0$ and $1$, which is distributed as a Bernoulli random variable with $p = \mathbb{P}(A)$. We already encountered a random variable of this form in the direct-sampling pebble game, where we looked at an observable $\mathcal{O}$ that was indicator function on the event $\text{hit} = \{ \mathbf{x}\cdot\mathbf{x} < 1 \}$ with $p=\pi/4$.
:::
It also makes sense to look at the joint distribution of several discrete random variables $X_1,\ldots,X_n$. In that case one considers the **joint probability (mass) function**
$$
p_{X_1,\ldots,X_n}(x_1,\ldots,x_n) = \mathbb{P}(X_1 = x_1,\cdots,X_n= x_n).
$$
### Continuous random variables
A random variable is **(absolutely) continuous** if there exists a **probability density function** $f_X : \mathbb{R} \to [0,\infty)$ such that
$$
\mathbb{P}( X \leq s ) = \int_{-\infty}^s f_X(x)\mathrm{d} x.
$$
It follows directly that we have the properties
$$
\begin{split}
\mathbb{P}( X = x ) &= 0\text{ for }x\in\mathbb{R},\\
\mathbb{P}(a\leq X\leq b) &= \int_a^b f_X(x)\mathrm{d}x,\\
\int_{-\infty}^\infty f_X(x) \mathrm{d}x &= 1,\\
f_X(x) &= F_X '(x).
\end{split}
$$
Now the expectation value is given by
$$
\mathbb{E}[X] = \int_{-\infty}^\infty x\,f_X(x) \mathrm{d}x
$$
and for any (decent) function $g : \mathbb{R} \to \mathbb{R}$,
$$
\mathbb{E}[g(X)] = \int_{-\infty}^\infty g(x)\,f_X(x) \mathrm{d}x.
$$
:::{admonition} Example (uniform random variable)
The uniform random variable $U$ on $(a,b)$, $a** 0
$$
that holds for any random variable $Y$ with finite variance to derive the **(weak) law of large numbers** for the sample mean
$$
\mathbb{P}( |\bar{X}_n - \mathbb{E}[X]| > \epsilon ) \leq \frac{\sigma_X^2}{n\epsilon^2} \qquad \text{for }\epsilon>0.
$$ (weak-law-large-numbers)
In other words, $\bar{X}_n$ is very likely to approximate $\mathbb{E}[X]$ to an accuracy $\epsilon$ if we take the sample size $n \gg \sigma_X^2 / \epsilon^2$.
(central-limit-theorem)=
### Central limit theorem
The law of large numbers tells us that the sample mean approximates the mean up to an error of order $1/\sqrt{n}$. We can be more precise and ask what is the distribution of the error in units of $\sigma_X/\sqrt{n}$ when $n$ becomes large,
$$
Z_n := \frac{\sqrt{n}}{\sigma_X} (\bar{X}_n - \mathbb{E}[X]).
$$
In the case when $X$ is uniformly distributed on $(0,1)$ (with mean $\mathbb{E}[X] = 1/2$ and standard deviation $\sigma_X = 1/\sqrt{12}$) the distribution of $Z_n$ looks like this:
```{code-cell} ipython3
def gaussian(x):
return np.exp(-x*x/2)/np.sqrt(2*np.pi)
length = 500
sample_means = np.array([np.mean(rng.uniform(0,1,length)) for _ in range(40000)])
# the mean of X is 0.5 and the standard deviation is 1/np.sqrt(12)
normalized_sample_means = (sample_means - 0.5) * np.sqrt(12*length)
plt.hist(normalized_sample_means,bins=np.arange(-4,4,0.25),density=True)
xrange = np.arange(-4,4,0.1)
plt.plot(xrange,gaussian(xrange))
plt.xlabel(r"$Z_n$")
plt.ylabel("density")
plt.title("n = {}".format(length))
plt.legend([r"$f_{\mathcal{N}}(x)$","histogram"])
plt.show()
```
The orange curve is the standard Gaussian, i.e. the probability density function of a **normal** random variable $\mathcal{N}$ of mean $0$ and standard deviation $1$,
$$
f_{\mathcal{N}}(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2 /2}.
$$
To be precise the **central limit theorem (CLT)** states that, provided $\mathbb{E}[|X|]$ and $\operatorname{Var}(X)$ are finite, the CDF of $Z_n$ converges to that of $\mathcal{N}$ as $n\to\infty$,
$$
\lim_{n\to\infty} F_{Z_n}(x) = F_{\mathcal{N}}(x) \qquad\text{for all }x\in\mathbb{R}.
$$
Such a convergence of random variables is called **convergence in distribution**.
It is instructive to see how the central limit theorem can be proved. To this end we introduce the **characteristic function** $\varphi_X(t)$ associated to any random variable $X$ as
$$
\varphi_X(t) = \mathbb{E}[ e^{itX}]\qquad \text{for }t\in\mathbb{R}.
$$
Due to absolute convergence of the sum or integral appearing in the expectation value, it is a continuous function satisfying $|\varphi_X(t)| \leq 1$ and it completely characterizes the distribution of $X$.
If $X$ has finite mean and finite variance, which we assume, then $\varphi_X(t)$ is twice continuous differentiable. Set $\hat{X} = (X - \mathbb{E}(X)) / \sigma_X$ to be the normalized version of $X$, such that $\mathbb{E}[\hat{X}] = 0$ and $\operatorname{Var}(\hat{X}) = \mathbb{E}[\hat{X}^2]=1$ and $Z_n = \tfrac{1}{\sqrt{n}} \sum_{i=1}^n \hat{X}_i$. Then we can Taylor expand the characteristic function of $\hat{X}$ around $t=0$ to obtain
$$
\varphi_{\hat{X}}(t) = \mathbb{E}[ 1 + it \hat{X} - \tfrac{1}{2} t^2 \hat{X}^2 + \cdots] = 1 - \tfrac{1}{2} t^2 + \cdots.
$$
On the other hand the characteristic function of $Z_n$ is
$$
\varphi_{Z_n}(t) = \mathbb{E}\Big[ \exp(it\tfrac{1}{\sqrt{n}}\sum_{i=1}^n \hat{X}_i)\Big] = \prod_{i=1}^n \mathbb{E}\Big[ \exp(it\tfrac{1}{\sqrt{n}}\hat{X}_i)\Big] = \varphi_{\hat{X}}(\tfrac{1}{\sqrt{n}}t)^{n},
$$
where we used {eq}`product-expectation`.
Plugging in the Taylor expansion gives
$$
\lim_{n\to\infty} \varphi_{Z_n}(t) = \lim_{n\to\infty} \left(1 - \tfrac{t^2}{2n} + \cdots\right)^{n} = e^{-t^2/2}.
$$
But the right-hand side we recognize as the characteristic function
$$
\varphi_{\mathcal{N}}(t) = \mathbb{E}[e^{i t \mathcal{N}}] = \int_{-\infty}^\infty \mathrm{d}x e^{itx} \frac{1}{\sqrt{2\pi}}e^{-x^2/2} = e^{-t^2/2}
$$
of the standard normal distribution. Convergence of characteristic functions can be shown ([Lévy's continuity theorem](https://en.wikipedia.org/wiki/L%C3%A9vy%27s_continuity_theorem)) to be equivalent to convergence in distribution and therefore this proves the central limit theorem.
Note that the central limit theorem does **not** hold when $\operatorname{Var}(X) = \infty$ and a variety of other limit distributions (called **stable distributions**) can arise in the large-$n$ limit. You will investigate this further in the {doc}`/exercises/exercise_sheet_02`.
**