Probability Theory — Chapters 1–10

Chapter 1 — The Problem of Uncertainty

1.1 Deterministic statements versus probabilistic statements

A deterministic statement has a truth value fixed by the state of the system and the rules of the model: “the derivative of $x^2$ is $2x$ ,” “this algorithm halts on this input,” or “the particle is at coordinate $x$ ” once the model’s state is fully specified. A probabilistic statement does not assert a single realized truth in advance; it assigns weights to possible truth-values or outcomes. The statement “the coin lands heads with probability $1/2$ ” is not equivalent to “the coin will land heads,” nor is it a weaker deterministic statement. It is a statement about a measure over alternatives.

The primitive move in probability is therefore not prediction but carrier construction. One must specify what can happen, which propositions about what can happen are legally measurable, and how much probability mass they carry. The deterministic claim has the form $P$ or $\neg P$ . The probabilistic claim has the form

\mathbb P(E)=p,

where $E$ is an event in some event space. Without the event space, the expression is syntactically suggestive but mathematically incomplete.

1.2 Events, outcomes, experiments, observations

An outcome is a primitive realized possibility in a model. In a die roll, an outcome may be $1,2,\ldots,6$ . An event is a set of outcomes satisfying some property, such as “the die is even,” represented by $\{2,4,6\}$ . An experiment is the procedure or formal random mechanism generating outcomes. An observation is the information actually registered; it may be coarser than the outcome. For example, if a die is rolled but only parity is observed, the observable events are $\{\text{even}\}$ , $\{\text{odd}\}$ , not necessarily the full six singleton outcomes.

This distinction matters because probability attaches to events, not directly to linguistic descriptions. Two descriptions may denote the same event, and the same informal description may denote different events under different sample-space models. A formal probability model therefore separates $\Omega$ , the outcome space, from $\mathcal F$ , the admissible event family, and from $P$ , the probability law. The triple $(\Omega,\mathcal F,P)$ is the carrier; events are legal only when they belong to $\mathcal F$ .

1.3 Probability as weight, frequency, belief, symmetry, and measure

Probability has several interpretations. As weight, it is a normalized mass assigned to events. As frequency, it describes the limiting proportion of occurrence in repeated trials. As belief, it quantifies rational degrees of uncertainty. As symmetry, it assigns equal mass to indistinguishable alternatives. As measure, it becomes a mathematical function $P:\mathcal F\to[0,1]$ satisfying normalization and countable additivity:

P(\Omega)=1,\qquad P\Big(\bigcup_{n=1}^{\infty}E_n\Big)=\sum_{n=1}^{\infty}P(E_n)

for pairwise disjoint $E_n\in\mathcal F$ .

The measure interpretation is the formal engine, not necessarily the philosophical interpretation. A Bayesian may use $P$ to encode belief; a frequentist may use $P$ to model long-run sampling behavior; a physicist may use $P$ to describe ensemble uncertainty. All still need a calculus that supports events, random variables, products, conditioning, expectation, and limits. Measure theory supplies that common calculus.

1.4 Why finite probability is insufficient

Finite probability handles dice, cards, urns, and finite games well. If $\Omega=\{\omega_1,\ldots,\omega_n\}$ , every event is a subset of $\Omega$ , and a probability law is determined by masses $p_i=P(\{\omega_i\})$ with $p_i\ge0$ and $\sum_i p_i=1$ . Then

P(E)=\sum_{\omega_i\in E}p_i.

This is clean, but it cannot model continuous random variables, infinite sequences, stochastic processes, Brownian motion, or limit theorems without extension.

The fatal issue is continuous probability. A uniform random point $X\in[0,1]$ should satisfy $P(X=x)=0$ for each individual $x$ , but $P(X\in[0,1])=1$ . This probability cannot be recovered by summing point masses. The model must assign mass to sets such as intervals and then extend to a suitable event family. Countable additivity, measurable sets, and integration become unavoidable once probability must survive infinite limiting operations.

1.5 The need for a carrier: sample space, event space, probability law

A probability claim requires three layers. The sample space $\Omega$ lists possible outcomes. The event space $\mathcal F$ specifies which subsets of $\Omega$ are measurable events. The probability law $P$ assigns mass to those events. The formal object is

(\Omega,\mathcal F,P).

Leaving out $\mathcal F$ is harmless only in finite or countable models where $\mathcal F=2^\Omega$ is usually acceptable. In continuous models, the full power set may include nonmeasurable subsets, so $\mathcal F$ must be restricted.

The carrier also determines which random variables exist. A random variable is not just any function $X:\Omega\to S$ ; it must be measurable, meaning inverse images of measurable target events are events in $\mathcal F$ . For real-valued $X$ , this means

\{\omega:X(\omega)\le t\}\in\mathcal F

for every $t\in\mathbb R$ . This condition ensures that statements about $X$ have probabilities.

1.6 Common category errors: outcome ≠ event, random variable ≠ value, law ≠ sample

An outcome is a point $\omega\in\Omega$ . An event is a measurable subset $E\subseteq\Omega$ . Confusing the two causes errors such as assigning probability to a realized value instead of to the event that the value occurs. A random variable $X$ is a function on $\Omega$ , not the number $X(\omega)$ obtained after realization. Its value is random before $\omega$ is fixed and deterministic after $\omega$ is fixed.

The law of $X$ is also not the same thing as $X$ . The law is the pushforward measure

\mu_X(B)=P(X\in B)=P(X^{-1}(B)).

Two random variables can have the same law while living on different spaces or while being dependent in different ways on other variables. Equality in distribution,

X\stackrel d=Y,

does not imply $X=Y$ pointwise or almost surely. That implication requires a common carrier and an equality statement inside that carrier.

Chapter 2 — Finite Probability Spaces

2.1 Sample spaces

In finite probability, the sample space is a finite set

\Omega=\{\omega_1,\ldots,\omega_n\}.

Each $\omega_i$ represents a possible outcome. For one die, $\Omega=\{1,2,3,4,5,6\}$ . For two ordered dice, $\Omega=\{1,\ldots,6\}^2$ . The choice of $\Omega$ encodes what distinctions the model regards as real. Ordered dice and unordered dice are different sample spaces; using the wrong one changes the probabilities unless the law is adjusted.

Finite sample spaces are useful because every subset can be treated as measurable:

\mathcal F=2^\Omega.

This removes measurability issues and lets probability be introduced as weighted counting. But the simplicity hides the later distinction between outcome space and event space, which becomes load-bearing in infinite models.

2.2 Events as subsets

An event is a subset $E\subseteq\Omega$ . If $\Omega=\{1,2,3,4,5,6\}$ , then “even” is $E=\{2,4,6\}$ , “greater than four” is $F=\{5,6\}$ , and “even and greater than four” is $E\cap F=\{6\}$ . Logical operations translate into set operations:

\text{not }E=E^c,\qquad E\text{ or }F=E\cup F,\qquad E\text{ and }F=E\cap F.

This translation is the first formalization step of probability.

In finite spaces, the event algebra is Boolean. It is closed under complements, finite unions, and finite intersections. Since the space is finite, countable operations reduce to finite operations after repetitions are removed. Thus finite probability avoids the countability boundary that later forces the use of σ-algebras.

2.3 Probability mass functions

A probability mass function assigns a number $p(\omega)\ge0$ to each outcome such that

\sum_{\omega\in\Omega}p(\omega)=1.

The probability of an event is then

P(E)=\sum_{\omega\in E}p(\omega).

This is the finite version of integration: events are measured by summing masses over their elements.

The mass function determines the law completely. If all masses are equal, $p(\omega)=1/|\Omega|$ , the model is uniform. If the masses differ, the model is weighted. For a biased coin with $\Omega=\{H,T\}$ , one may set $p(H)=q$ , $p(T)=1-q$ . The formalism is the same; only the mass function changes.

2.4 Uniform probability and counting

Uniform probability is the special case where all outcomes are equally likely. Then

P(E)=\frac{|E|}{|\Omega|}.

This is the bridge between counting and probability. For two fair dice, $|\Omega|=36$ , and the event “sum equals seven” has six outcomes:

(1,6),(2,5),(3,4),(4,3),(5,2),(6,1),

so its probability is $6/36=1/6$ .

Uniform models are powerful but dangerous. Equal likelihood must be justified by the modeling carrier, not by aesthetic symmetry alone. Two ways of parametrizing outcomes may produce different apparent uniform distributions. For example, choosing a random chord by choosing endpoints uniformly is not the same as choosing a midpoint uniformly. Uniformity is always uniformity with respect to a declared carrier.

2.5 Complements, unions, intersections

The complement rule is

P(E^c)=1-P(E).

For two events, the union rule is

P(E\cup F)=P(E)+P(F)-P(E\cap F).

The subtraction prevents double-counting the overlap. If $E\cap F=\varnothing$ , the events are disjoint and the formula reduces to additivity:

P(E\cup F)=P(E)+P(F).

Intersections encode simultaneous occurrence. If $E$ is “even” and $F$ is “greater than four,” then $E\cap F$ is “even and greater than four.” Probability theory often reduces verbal problems to algebra over complements, unions, and intersections. The event grammar is not optional; it is the syntax of probabilistic reasoning.

2.6 Inclusion–exclusion

For three events,

P(A\cup B\cup C) = P(A)+P(B)+P(C) -P(A\cap B)-P(A\cap C)-P(B\cap C) +P(A\cap B\cap C).

The general inclusion–exclusion formula is

P\Big(\bigcup_{i=1}^n A_i\Big) = \sum_iP(A_i) -\sum_{i<j}P(A_i\cap A_j) +\sum_{i<j<k}P(A_i\cap A_j\cap A_k) -\cdots +(-1)^{n+1}P(A_1\cap\cdots\cap A_n).

Inclusion–exclusion is exact but can be computationally expensive. Its truncated versions give bounds, such as the union bound

P\Big(\bigcup_i A_i\Big)\le\sum_iP(A_i).

This inequality is often more important than the exact formula because it scales to large systems where full overlap data are unavailable.

2.7 Conditional probability

Conditional probability restricts attention to a known event $B$ with $P(B)>0$ . It is defined by

P(A\mid B)=\frac{P(A\cap B)}{P(B)}.

This is not a new primitive in finite probability; it is a renormalized probability law on $B$ . The conditional law satisfies $P(B\mid B)=1$ and assigns zero mass outside $B$ .

Conditioning changes the carrier. The relevant sample space becomes $B$ , and probabilities are rescaled. If a die roll is known to be even, the probability that it is greater than four is

P(\{5,6\}\mid \{2,4,6\})=\frac{P(\{6\})}{P(\{2,4,6\})}=\frac{1/6}{3/6}=\frac13.

The calculation is simple; the important point is that conditioning is event-space restriction plus normalization.

2.8 Bayes’ theorem

Bayes’ theorem follows directly from the symmetry of intersection:

P(A\mid B)P(B)=P(A\cap B)=P(B\mid A)P(A).

Thus

P(A\mid B)=\frac{P(B\mid A)P(A)}{P(B)}.

If $\{A_i\}$ is a partition of $\Omega$ , then

P(A_i\mid B)= \frac{P(B\mid A_i)P(A_i)} {\sum_jP(B\mid A_j)P(A_j)}.

The theorem transports probability from causal or generative direction $P(B\mid A)$ into diagnostic direction $P(A\mid B)$ . The numerator combines likelihood and prior; the denominator normalizes over all alternatives. The main error is to confuse $P(A\mid B)$ with $P(B\mid A)$ . Bayes’ theorem states exactly how they differ.

2.9 Independence of events

Events $A$ and $B$ are independent if

P(A\cap B)=P(A)P(B).

If $P(B)>0$ , this is equivalent to

P(A\mid B)=P(A).

Independence means learning $B$ does not change the probability of $A$ . It does not mean the events are disjoint. In fact, disjoint positive-probability events are negatively dependent, since $P(A\cap B)=0$ while $P(A)P(B)>0$ .

Independence is a structural certificate, not a feeling of unrelatedness. It must be derived from the model. In product spaces, events depending on different coordinates are independent when the probability law factors. Without such factorization, independence is an assertion requiring proof.

2.10 Pairwise versus mutual independence

A family $A_1,\ldots,A_n$ is pairwise independent if every pair satisfies

P(A_i\cap A_j)=P(A_i)P(A_j).

It is mutually independent if for every subfamily $I\subseteq\{1,\ldots,n\}$ ,

P\Big(\bigcap_{i\in I}A_i\Big)=\prod_{i\in I}P(A_i).

Mutual independence is stronger.

The distinction is not technical decoration. Three events can be pairwise independent but not mutually independent. For two fair coin tosses, let $A$ be “first toss heads,” $B$ be “second toss heads,” and $C$ be “the tosses agree.” Each pair is independent, but $A\cap B\cap C=A\cap B$ , so the triple intersection does not factor as the product of the three probabilities. Pairwise checks do not certify full product structure.

2.11 Finite random variables

A finite random variable is a function $X:\Omega\to S$ , often with $S\subseteq\mathbb R$ . Its distribution is

P_X(s)=P(X=s)=\sum_{\omega:X(\omega)=s}p(\omega).

The variable compresses outcomes into values. Multiple outcomes may produce the same value, so the law of $X$ is generally coarser than the law on $\Omega$ .

A random variable is not random because the function changes; $X$ is fixed. Randomness enters through the random choice of $\omega$ . Once $\omega$ is realized, $X(\omega)$ is deterministic. This distinction is essential when defining functions of random variables: if $Y=g(X)$ , then $Y(\omega)=g(X(\omega))$ , and its law is obtained by pushing $P_X$ through $g$ .

2.12 Expectation as weighted average

For a real-valued finite random variable $X$ , expectation is

\mathbb E[X]=\sum_{\omega\in\Omega}X(\omega)p(\omega).

Equivalently, summing over values,

\mathbb E[X]=\sum_x x\,P(X=x).

Expectation is the center of mass of the distribution, not necessarily a typical or possible value. A die has expectation $3.5$ , though no roll equals $3.5$ .

Expectation is linear:

\mathbb E[aX+bY]=a\mathbb E[X]+b\mathbb E[Y],

whether or not $X$ and $Y$ are independent. This is one of the most powerful facts in elementary probability because it lets global averages be computed without full distributional knowledge.

2.13 Variance, covariance, correlation

Variance measures squared deviation from the mean:

\operatorname{Var}(X)=\mathbb E[(X-\mathbb E X)^2].

The computational identity is

\operatorname{Var}(X)=\mathbb E[X^2]-(\mathbb E X)^2.

Variance is nonnegative and equals zero exactly when $X$ is almost surely constant.

Covariance is

\operatorname{Cov}(X,Y)=\mathbb E[(X-\mathbb E X)(Y-\mathbb E Y)] =\mathbb E[XY]-\mathbb E[X]\mathbb E[Y].

Correlation normalizes covariance:

\rho(X,Y)=\frac{\operatorname{Cov}(X,Y)} {\sqrt{\operatorname{Var}(X)\operatorname{Var}(Y)}}.

Independence implies zero covariance when expectations exist, but zero covariance does not imply independence except in special families such as jointly Gaussian variables.

2.14 Indicator variables

The indicator of an event $A$ is

1_A(\omega)= \begin{cases} 1,&\omega\in A,\\ 0,&\omega\notin A. \end{cases}

Its expectation is

\mathbb E[1_A]=P(A).

This turns probability questions into expectation questions and counting questions into sums of indicators.

If $N$ counts how many events $A_i$ occur, then

N=\sum_i1_{A_i}, \qquad \mathbb E[N]=\sum_iP(A_i).

No independence is required. This method is the backbone of probabilistic combinatorics, occupancy estimates, random graph thresholds, and existence arguments.

2.15 Linearity of expectation

Linearity states that for finite random variables,

\mathbb E\left[\sum_{i=1}^n X_i\right]=\sum_{i=1}^n\mathbb E[X_i].

It does not require independence. If $X_i$ are indicators, the formula says that the expected count of successful events equals the sum of their individual probabilities.

The power of linearity is that it avoids dependence. To compute the expected number of fixed points in a random permutation $\pi\in S_n$ , define $I_i=1_{\{\pi(i)=i\}}$ . Then $P(I_i=1)=1/n$ , so

\mathbb E\sum_{i=1}^n I_i=\sum_{i=1}^n\frac1n=1.

The indicators are dependent, but expectation ignores that dependence for first-moment purposes.

2.16 First probabilistic existence arguments

The first moment method uses expectation to prove existence. If $X\ge0$ and $\mathbb E[X]<a$ , then there exists an outcome with $X<a$ ; otherwise $X\ge a$ everywhere would force $\mathbb E[X]\ge a$ . Similarly, if $\mathbb E[X]>0$ for a nonnegative integer-valued $X$ , then there is an outcome with $X>0$ .

This converts random construction into deterministic existence. One defines a random object, counts desired or bad features by a random variable, computes its expectation, and concludes that some object has at least or at most the expected performance. The method gives existence without necessarily giving an efficient construction.

Chapter 3 — Counting and Discrete Models

3.1 Permutations and combinations

A permutation is an ordering. The number of permutations of $n$ distinct objects is

n!=n(n-1)\cdots2\cdot1.

The number of ordered selections of $k$ distinct objects from $n$ is

(n)_k=n(n-1)\cdots(n-k+1)=\frac{n!}{(n-k)!}.

A combination is an unordered selection. The number of $k$ -element subsets of an $n$ -element set is

\binom nk=\frac{n!}{k!(n-k)!}.

The division by $k!$ removes the ordering of the selected elements. These formulas underlie finite uniform probability, since many events are counted by counting favorable selections divided by total selections.

3.2 Multinomial coefficients

The multinomial coefficient counts ways to split $n$ labeled objects into $r$ labeled boxes of sizes $k_1,\ldots,k_r$ , where $\sum_i k_i=n$ :

\binom{n}{k_1,\ldots,k_r}=\frac{n!}{k_1!\cdots k_r!}.

It appears in the expansion

(x_1+\cdots+x_r)^n = \sum_{k_1+\cdots+k_r=n} \binom{n}{k_1,\ldots,k_r}x_1^{k_1}\cdots x_r^{k_r}.

In probability, multinomial coefficients describe counts from repeated categorical trials. If each trial produces category $i$ with probability $p_i$ , then the probability of counts $N_i=k_i$ is

P(N_1=k_1,\ldots,N_r=k_r) = \frac{n!}{k_1!\cdots k_r!}p_1^{k_1}\cdots p_r^{k_r}.

3.3 Occupancy problems

Occupancy problems distribute balls into boxes. If $m$ balls are placed independently and uniformly into $n$ boxes, the sample space has size $n^m$ . The occupancy number $N_j$ of box $j$ has binomial distribution:

N_j\sim \operatorname{Binomial}(m,1/n).

The joint distribution of occupancies is multinomial:

P(N_1=k_1,\ldots,N_n=k_n) = \frac{m!}{k_1!\cdots k_n!}\left(\frac1n\right)^m.

Occupancy models encode hashing, allocation, collisions, load balancing, coupon collection, and random mappings. They are simple enough for exact counting but rich enough to display threshold behavior. For example, the expected number of empty boxes is

n\left(1-\frac1n\right)^m\approx ne^{-m/n}.

3.4 Balls into bins

The balls-into-bins model studies the load profile after random allocation. With $m$ balls and $n$ bins, each bin load has mean $m/n$ . Indicator methods compute many structural statistics. Let $I_j$ indicate that bin $j$ is empty. Then

\mathbb E\sum_{j=1}^n I_j = n\left(1-\frac1n\right)^m.

The maximum load is subtler because it depends on tail probabilities and dependence. When $m=n$ , the maximum load is of order

\frac{\log n}{\log\log n}

with high probability. This is one of the first places where expectation alone is insufficient; concentration and tail bounds are needed to control extremes.

3.5 Coupon collector

The coupon collector problem asks how many independent uniform samples from $n$ coupon types are needed to see all types. Let $T$ be the completion time. Decompose

T=T_1+\cdots+T_n,

where $T_k$ is the waiting time to collect a new coupon after $k-1$ distinct coupons have already appeared. Then $T_k$ is geometric with success probability $(n-k+1)/n$ , so

\mathbb E[T] = \sum_{k=1}^n \frac{n}{n-k+1} = n\sum_{j=1}^n\frac1j = nH_n \sim n\log n+\gamma n.

The problem shows the difference between mean scale and high-probability completion. Around $n\log n+cn$ , the probability of completion approaches a nontrivial limit related to $e^{-e^{-c}}$ . The final coupons dominate the waiting time.

3.6 Birthday problem

The birthday problem asks for the probability of at least one collision among $k$ independent uniform samples from $n$ possible birthdays. The no-collision probability is

\frac{n(n-1)\cdots(n-k+1)}{n^k} = \prod_{j=0}^{k-1}\left(1-\frac jn\right).

For $k\ll n$ , use $\log(1-x)\approx-x$ :

P(\text{no collision}) \approx \exp\left(-\frac{k(k-1)}{2n}\right).

The threshold occurs when $k^2/n$ is order one, so $k\sim\sqrt n$ . This square-root threshold appears in hashing, cryptography, random graphs, and collision search. The lesson is that rare pairwise coincidences become likely when the number of pairs is large.

3.7 Hypergeometric model

The hypergeometric distribution describes sampling without replacement. If a population has $N$ objects, $K$ successes, and $N-K$ failures, then drawing $n$ objects without replacement gives

P(X=x) = \frac{\binom Kx\binom{N-K}{n-x}}{\binom Nn}.

The expectation is

\mathbb E[X]=n\frac KN.

Unlike the binomial model, draws are dependent. Observing a success slightly reduces the chance of another success. Nevertheless, the expectation matches the binomial expectation with $p=K/N$ . When $N$ is large compared with $n$ , the hypergeometric distribution is close to binomial because sampling without replacement approximates sampling with replacement.

3.8 Binomial model

The binomial distribution counts successes in $n$ independent Bernoulli trials with success probability $p$ :

P(X=k)=\binom nkp^k(1-p)^{n-k}, \qquad k=0,\ldots,n.

Its mean and variance are

\mathbb E[X]=np,\qquad \operatorname{Var}(X)=np(1-p).

The binomial model is the canonical finite independent-sum model. Its generating function is

\mathbb E[z^X]=(1-p+pz)^n.

For large $n$ , it has different asymptotic regimes: normal approximation when $np(1-p)$ is large, Poisson approximation when $n\to\infty$ , $p\to0$ , and $np\to\lambda$ , and large-deviation behavior far from the mean.

3.9 Geometric and negative binomial models

The geometric distribution models the waiting time $T$ until the first success in independent Bernoulli trials. With success probability $p$ ,

P(T=k)=(1-p)^{k-1}p,\qquad k\ge1,

and

\mathbb E[T]=\frac1p.

It has the memoryless property:

P(T>s+t\mid T>s)=P(T>t).

The negative binomial distribution models the number of trials needed to obtain $r$ successes. If $T_r$ is the trial of the $r$ -th success,

P(T_r=k)=\binom{k-1}{r-1}p^r(1-p)^{k-r}, \qquad k\ge r.

It is a sum of $r$ independent geometric waiting times.

3.10 Poisson approximation in finite settings

The Poisson distribution with parameter $\lambda>0$ is

P(X=k)=e^{-\lambda}\frac{\lambda^k}{k!}.

It approximates sums of many rare, weakly dependent indicators. The simplest limit is binomial:

\operatorname{Binomial}(n,\lambda/n)\Rightarrow \operatorname{Poisson}(\lambda).

Indeed,

\binom nk\left(\frac{\lambda}{n}\right)^k\left(1-\frac{\lambda}{n}\right)^{n-k} \to e^{-\lambda}\frac{\lambda^k}{k!}.

Finite Poisson approximation is often justified by Chen–Stein methods or by bounding dependence neighborhoods. The heuristic is that if events are individually rare and do not cluster too strongly, their count behaves approximately Poisson. The failure mode is hidden dependence that creates clumping.

3.11 Random graphs: first finite examples

The Erdős–Rényi graph $G(n,p)$ has vertex set $\{1,\ldots,n\}$ , with each possible edge included independently with probability $p$ . The number of edges is

E\sim\operatorname{Binomial}\left(\binom n2,p\right),

\mathbb E[E]=p\binom n2.

Subgraph counts are sums of indicators. For triangles,

T=\sum_{\{i,j,k\}}1_{\{ij,jk,ik\text{ present}\}}, \qquad \mathbb E[T]=\binom n3p^3.

Random graphs demonstrate thresholds. A property may be unlikely below a scale of $p$ and likely above it. For example, isolated vertices disappear around $p\sim(\log n)/n$ , and graph connectivity emerges at the same scale. Random graphs turn combinatorial existence into probabilistic mass.

3.12 Counting as probability and probability as counting

Counting and probability are dual in finite uniform spaces:

P(E)=\frac{|E|}{|\Omega|}, \qquad |E|=P(E)|\Omega|.

This means a probability estimate can imply a counting estimate and a counting estimate can imply a probability estimate. Many combinatorial arguments choose a random object uniformly, estimate the probability it has a property, and then multiply by the total number of objects.

The duality weakens in nonuniform spaces but survives through weighted counting. A probability law is a weighted counting scheme. The conceptual shift is from cardinality to measure: finite probability is counting with weights, while measure-theoretic probability is infinite weighted event geometry.

Chapter 4 — Why Measure Theory Enters Probability

4.1 Failure of purely discrete models

Purely discrete models assign probability by summing atomic masses:

P(E)=\sum_{\omega\in E}p(\omega).

This works when the sample space is finite or countable and the total mass is distributed over atoms. It fails for distributions such as uniform measure on $[0,1]$ , where each singleton should have mass zero but the whole interval has mass one. A countable sum of zero masses remains zero, so nonatomic probability cannot be represented by pointwise mass summation.

The issue is not merely continuous variables. Infinite sequences also strain discrete intuition. A sequence of independent fair coin flips has sample space $\{0,1\}^{\mathbb N}$ , uncountable in size. Each individual infinite sequence has probability zero, yet the full space has probability one. Events such as “infinitely many heads occur” require countable intersections and unions. Measure theory is the carrier that handles such events.

4.2 Continuous variables and zero-probability points

For a continuous random variable with density $f$ , probabilities are assigned by integration:

P(a\le X\le b)=\int_a^b f(x)\,dx.

For a point,

P(X=x)=\int_x^x f(t)\,dt=0.

Thus individual outcomes can be impossible in the measure sense while one of them must occur. This is not a contradiction; probability zero does not mean logical impossibility.

The event $\{X=x\}$ has zero mass for each fixed $x$ , but

\{X\in[0,1]\}=\bigcup_{x\in[0,1]}\{X=x\}

is an uncountable union. Countable additivity does not license summing over uncountably many null events. This is one of the fundamental reasons the σ-algebra and countability boundary matter.

4.3 Countable additivity versus finite additivity

Finite additivity states that for disjoint $A,B$ ,

P(A\cup B)=P(A)+P(B).

Countable additivity extends this to countably many disjoint events:

P\Big(\bigcup_{n=1}^{\infty}A_n\Big)=\sum_{n=1}^{\infty}P(A_n).

This property is essential for limits. If $A_n\uparrow A$ , then countable additivity implies continuity from below:

P(A)=\lim_{n\to\infty}P(A_n).

If $A_n\downarrow A$ , then

P(A)=\lim_{n\to\infty}P(A_n)

provided $P(A_1)<\infty$ , automatically true in probability spaces.

Finite additivity can support some decision-theoretic frameworks, but it does not give the same convergence machinery. Limit theorems, Borel–Cantelli, product processes, Lebesgue integration, and conditional expectation all depend on countable additivity.

4.4 Legal events and illegal events

In finite spaces, every subset is legal. In continuous spaces, not every subset can be assigned probability while preserving desirable properties such as countable additivity and translation invariance. Nonmeasurable sets exist under standard set-theoretic assumptions. Therefore probability is defined only on a σ-algebra $\mathcal F\subseteq2^\Omega$ .

A legal event is an element of $\mathcal F$ . An illegal event is a subset of $\Omega$ outside $\mathcal F$ . The statement $P(E)$ is meaningful only if $E\in\mathcal F$ . This restriction is not a defect but a consistency condition. It prevents the probability calculus from being forced to assign incompatible masses to pathological sets.

4.5 The σ-algebra as event grammar

A σ-algebra $\mathcal F$ on $\Omega$ is a family of subsets satisfying:

\Omega\in\mathcal F,\qquad E\in\mathcal F\Rightarrow E^c\in\mathcal F,\qquad E_1,E_2,\ldots\in\mathcal F\Rightarrow\bigcup_{n=1}^{\infty}E_n\in\mathcal F.

It follows that $\mathcal F$ is also closed under countable intersections. Thus σ-algebras support countable logical operations.

The σ-algebra is the grammar of probabilistic propositions. It determines which statements about the outcome are admissible. If $X$ is a random variable, then events such as $\{X\le t\}$ , $\{X\in B\}$ , and $\{\lim X_n \text{ exists}\}$ must belong to $\mathcal F$ before their probabilities can be discussed.

4.6 The probability measure as normalized measure

A probability measure is a measure with total mass one. Formally,

P:\mathcal F\to[0,1]

satisfies $P(\varnothing)=0$ , $P(\Omega)=1$ , and countable additivity over disjoint events. The normalization $P(\Omega)=1$ distinguishes probability from general measure theory, where total mass may be finite, infinite, or σ-finite.

This normalization makes probability a calculus of relative mass. Conditional probability, expectation, variance, and distribution all depend on the measure. The probability measure is not merely a list of event weights; it is the structural object that supports integration, pushforward, product construction, and convergence.

4.7 Null sets and almost sure reasoning

A null set is an event $N$ with $P(N)=0$ . A property holds almost surely if it fails only on a null set. Thus $X=Y$ almost surely means

P(\{\omega:X(\omega)=Y(\omega)\})=1.

Almost sure equality is not pointwise equality. It is equality in the quotient space where null differences are ignored.

This quotient is essential in analysis. $L^p$ spaces identify random variables that differ only on null sets. Conditional expectation is unique only almost surely. Sample-path properties of stochastic processes often hold almost surely but not for every $\omega$ . The null-set firewall prevents exporting measure-level statements as deterministic statements.

4.8 Probability theory as measure theory plus independence

Measure theory gives events, measures, functions, integration, products, and convergence. Probability theory adds independence, conditioning, stochastic processes, and asymptotic laws. Independence is not native to arbitrary measure theory as a philosophical concept, but it is formalized by product and factorization:

P(A\cap B)=P(A)P(B),

or for random variables,

\mu_{(X,Y)}=\mu_X\otimes\mu_Y.

The slogan “probability is measure theory plus independence” is accurate as a structural compression. Measure theory supplies the carrier; independence supplies probabilistic product structure; conditioning supplies information-relative projection; limit theorems supply asymptotic transport. Without measure theory, the infinite and continuous parts collapse into ad hoc rules.

Chapter 5 — Measurable Spaces and Events

5.1 Sets, algebras, σ-algebras

An algebra of sets is closed under finite unions and complements. A σ-algebra is closed under countable unions and complements. Every σ-algebra is an algebra, but not every algebra is a σ-algebra. Finite probability can often use algebras because only finite operations are needed. Modern probability needs σ-algebras because limiting events are naturally countable.

For example, if $A_n$ is the event that a process satisfies a condition at time $n$ , then “the condition occurs infinitely often” is

\limsup_{n\to\infty}A_n = \bigcap_{m=1}^{\infty}\bigcup_{n\ge m}A_n.

This expression requires countable unions and intersections. A finite algebra is not enough.

5.2 Generated σ-algebras

Given a collection of subsets $\mathcal C\subseteq2^\Omega$ , the σ-algebra generated by $\mathcal C$ , denoted $\sigma(\mathcal C)$ , is the smallest σ-algebra containing $\mathcal C$ . It is the intersection of all σ-algebras that contain $\mathcal C$ :

\sigma(\mathcal C)=\bigcap\{\mathcal F:\mathcal F\text{ is a σ-algebra and }\mathcal C\subseteq\mathcal F\}.

Generated σ-algebras let probability specify elementary observable events and then close them under legal countable operations. For a real random variable $X$ , the information it generates is

\sigma(X)=\{X^{-1}(B):B\in\mathcal B(\mathbb R)\}.

This is the σ-algebra of events determined by observing $X$ .

5.3 Borel σ-algebra on topological spaces

For a topological space $S$ , the Borel σ-algebra $\mathcal B(S)$ is generated by the open sets:

\mathcal B(S)=\sigma(\{U\subseteq S:U\text{ open}\}).

On $\mathbb R$ , it is also generated by open intervals, closed intervals, half-lines $(-\infty,t]$ , or rational intervals. This flexibility is useful for proving measurability.

The Borel σ-algebra is the standard event space for random variables taking values in metric or topological spaces. A real-valued random variable $X$ is measurable if $X^{-1}(B)\in\mathcal F$ for every Borel set $B$ . It suffices to check

\{X\le t\}\in\mathcal F

for all $t\in\mathbb R$ .

5.4 Product σ-algebras

If $(S,\mathcal S)$ and $(T,\mathcal T)$ are measurable spaces, the product σ-algebra is

\mathcal S\otimes\mathcal T = \sigma(\{A\times B:A\in\mathcal S,\ B\in\mathcal T\}).

It is the smallest σ-algebra making coordinate projections measurable.

Product σ-algebras are required for joint random variables. If $X:\Omega\to S$ and $Y:\Omega\to T$ are measurable, then $(X,Y):\Omega\to S\times T$ is measurable with respect to $\mathcal S\otimes\mathcal T$ . The joint law of $(X,Y)$ is a probability measure on this product space. Independence is then expressed by factorization of that joint law.

5.5 Completion of a measure space

A measure space is complete if every subset of every null set is measurable. Given $(\Omega,\mathcal F,P)$ , its completion adds all subsets of null sets to $\mathcal F$ . The completed σ-algebra is

\overline{\mathcal F} = \{E\cup N:E\in\mathcal F,\ N\subseteq Z\text{ for some }Z\in\mathcal F,\ P(Z)=0\}.

The measure extends by $\overline P(E\cup N)=P(E)$ .

Completion is natural because null subsets cannot affect probabilities. However, completion can interact subtly with product spaces and regular conditional probabilities. One must track whether the model uses Borel σ-algebras, Lebesgue completions, or completed filtrations.

5.6 Measurable subsets

A measurable subset is an event belonging to the chosen σ-algebra. In $\mathbb R^n$ , Borel sets include open sets, closed sets, countable intersections of open sets, countable unions of closed sets, and many more. Lebesgue measurable sets further include completions of Borel sets by null modifications.

Measurability is a legality condition, not a size condition. A set can be dense, fractal, uncountable, or topologically complicated and still be measurable. Conversely, nonmeasurable sets are not necessarily visually exotic; their pathology lies in incompatibility with the desired measure properties. In probability, no event probability exists until measurability is established.

5.7 Countable operations on events

If $A_n\in\mathcal F$ , then

\bigcup_{n=1}^{\infty}A_n,\qquad \bigcap_{n=1}^{\infty}A_n

belong to $\mathcal F$ . This allows one to define events such as “at least one $A_n$ occurs,” “all $A_n$ occur,” “infinitely many $A_n$ occur,” and “eventually all $A_n$ occur.”

The two standard limiting events are

\limsup A_n=\bigcap_{m=1}^{\infty}\bigcup_{n\ge m}A_n

and

\liminf A_n=\bigcup_{m=1}^{\infty}\bigcap_{n\ge m}A_n.

The first means infinitely many $A_n$ occur; the second means all but finitely many $A_n$ occur.

5.8 Uncountable operation traps

A σ-algebra is not generally closed under uncountable unions or intersections. If $\{A_t:t\in T\}\subseteq\mathcal F$ and $T$ is uncountable, it does not follow that

\bigcup_{t\in T}A_t

is measurable. Sometimes it is measurable for structural reasons, but not automatically.

This matters in continuous probability. For each $t$ , the event $\{X=t\}$ may be measurable and null, but the union over all $t\in\mathbb R$ is $\Omega$ if $X$ is real-valued. Countable additivity does not apply to that union. Any argument that sums probabilities over uncountably many events is invalid unless replaced by integration, separability, or a countable reduction.

5.9 Tail events

For a sequence of random variables $X_1,X_2,\ldots$ , the tail σ-algebra is

\mathcal T=\bigcap_{n=1}^{\infty}\sigma(X_n,X_{n+1},\ldots).

It contains events unaffected by changing finitely many initial coordinates. Examples include convergence of averages, infinitely many occurrences, and limiting frequencies.

For independent sequences, Kolmogorov’s zero-one law states that every tail event has probability $0$ or $1$ . This is a structural theorem: events depending only on the infinite tail cannot have intermediate probability under full independence. Tail σ-algebras encode asymptotic information stripped of finite initial noise.

5.10 Germ σ-algebras

A germ σ-algebra records information arbitrarily close to a point, time, or boundary. For a stochastic process, the germ at time $t$ may be represented as an intersection of σ-algebras over shrinking neighborhoods:

\mathcal G_t=\bigcap_{\varepsilon>0}\sigma(X_s:|s-t|<\varepsilon).

It captures infinitesimal local information rather than global trajectory information.

Germ σ-algebras arise in Markov processes, Brownian motion, stochastic calculus, and local field behavior. Their analysis often requires right-continuity, completion, separability, or regularity assumptions. The danger is to assume that infinitesimal information is trivial or maximal without proving the relevant zero-one or regularity law.

5.11 Event equivalence modulo null sets

Events $A$ and $B$ are equivalent modulo null sets if

P(A\triangle B)=0,

where $A\triangle B=(A\setminus B)\cup(B\setminus A)$ . In probability, such events are often indistinguishable because they have the same probability and differ only on a null set.

This quotient viewpoint is essential for $L^p$ spaces and conditional expectation. However, modulo-null equivalence must not be exported as literal equality when pointwise structure matters. In stochastic processes, two versions may agree at each fixed time almost surely but fail to have indistinguishable sample paths unless stronger conditions are imposed.

Chapter 6 — Probability Spaces

6.1 Probability space `(Ω, 𝓕, P)`

A probability space consists of a sample space $\Omega$ , a σ-algebra $\mathcal F$ , and a probability measure $P$ . It is the formal carrier for all events and random variables in a model. The axioms are:

P(\Omega)=1,\qquad P(E)\ge0,\qquad P\Big(\bigcup_{n=1}^{\infty}E_n\Big)=\sum_{n=1}^{\infty}P(E_n)

for pairwise disjoint $E_n$ .

Every probability expression must be interpretable in this carrier or in a declared extension/quotient. If $X$ and $Y$ are random variables on different spaces, then $X+Y$ is undefined until a joint space or coupling is specified. The probability space is not background decoration; it is the domain of legal probabilistic syntax.

6.2 Atoms and nonatomic spaces

An atom is a measurable set $A$ with $P(A)>0$ such that every measurable $B\subseteq A$ has $P(B)=0$ or $P(B)=P(A)$ . Discrete probability spaces are atomic; the atoms are often singleton outcomes with positive mass. A nonatomic space has no atoms. Lebesgue probability on $[0,1]$ is nonatomic.

Atomic and nonatomic spaces behave differently. In atomic spaces, probabilities decompose into point masses. In nonatomic spaces, mass can be split continuously: for many standard nonatomic spaces, every $t\in[0,1]$ is the probability of some event. Continuous randomization, uniform variables, and many coupling constructions rely on nonatomic structure.

6.3 Discrete probability measures

A discrete probability measure is concentrated on a finite or countable set. If $S=\{x_i\}$ , then

\mu=\sum_i p_i\delta_{x_i}, \qquad p_i\ge0,\quad \sum_i p_i=1.

For an event $A$ ,

\mu(A)=\sum_{i:x_i\in A}p_i.

Discrete measures are computationally transparent. Expectation becomes summation:

\mathbb E[f(X)]=\sum_i f(x_i)p_i.

However, discrete methods fail when no atoms carry mass, as with continuous distributions. Many probability models combine discrete and continuous components, so one must not assume all laws have densities or all laws have mass functions.

6.4 Continuous probability measures

A continuous probability measure has no atoms, or more narrowly, may admit a density $f$ with respect to Lebesgue measure:

\mu(A)=\int_A f(x)\,dx.

The density must satisfy $f\ge0$ and $\int f=1$ . If $X$ has density $f$ , then

P(a\le X\le b)=\int_a^b f(x)\,dx.

Not every continuous measure has a density. The Cantor distribution is nonatomic but singular with respect to Lebesgue measure. Thus “continuous” and “has a density” are different properties. The correct carrier distinction is atomic, absolutely continuous, singular, and mixtures thereof.

6.5 Singular measures

A measure $\mu$ is singular with respect to another measure $\nu$ , written $\mu\perp\nu$ , if there exists a measurable set $A$ such that $\mu(A)=1$ and $\nu(A)=0$ . The Cantor measure is singular with respect to Lebesgue measure: it is concentrated on the Cantor set, which has Lebesgue measure zero, while having no atoms.

Singular measures show why densities are not universal. A random variable may have a continuous distribution function but no density. Any argument that differentiates a CDF or writes probabilities as $\int_A f\,dx$ must verify absolute continuity. Otherwise it silently changes carrier.

6.6 Mixed distributions

A mixed distribution has both discrete and continuous components. For example,

\mu=p\delta_0+(1-p)\lambda_{[0,1]},

where $\lambda_{[0,1]}$ is uniform measure on $[0,1]$ . Then $P(X=0)=p$ , while conditional on the continuous component, $X$ spreads over $[0,1]$ .

The general Lebesgue decomposition separates a measure into absolutely continuous, singular continuous, and atomic components relative to a reference measure. Mixed distributions appear in survival models, censored data, spike-and-slab priors, default models, and random variables with boundary masses. Treating them as purely discrete or purely continuous loses mass.

6.7 Pushforward measures

If $X:\Omega\to S$ is measurable and $P$ is a probability measure on $\Omega$ , the pushforward law $X_*P$ on $S$ is defined by

X_*P(B)=P(X^{-1}(B))=P(X\in B).

This is the distribution of $X$ . It allows one to study $X$ without retaining the entire original sample space.

Pushforward is the correct abstraction behind transformation of variables. If $Y=g(X)$ , then

\mu_Y=g_*\mu_X.

When densities exist and $g$ is smooth and invertible, this yields the familiar Jacobian formula. But the pushforward definition works more generally, including discrete, singular, and mixed laws.

6.8 Pullback of events

Given a measurable map $X:\Omega\to S$ , an event $B\subseteq S$ pulls back to

X^{-1}(B)=\{\omega:X(\omega)\in B\}.

Measurability of $X$ means Borel or measurable events in the target pull back to legal events in $\Omega$ . Thus statements about $X$ become events in the original probability space.

Pullback and pushforward are dual. Pullback turns target propositions into source events; pushforward transports source probability to target laws. The equation

X_*P(B)=P(X^{-1}(B))

is the bridge. Probability of random-variable statements is always computed by pulling back the statement to the sample space or by using the pushed-forward law.

6.9 Model extension and sample-space enlargement

A probability model may need enlargement to include new randomness. If $(\Omega,\mathcal F,P)$ models $X$ , then adding an independent $Y$ may require

(\Omega',\mathcal F',P')=(\Omega\times S,\mathcal F\otimes\mathcal S,P\otimes\nu).

Old events lift by projection:

E\mapsto E\times S.

Their probabilities are preserved:

P'(E\times S)=P(E).

Model extension shows that sample spaces are representations, not the probabilistic objects themselves. The same event may have different set-theoretic representatives in different carriers while preserving probability. What must be invariant are the laws and joint relations explicitly required by the claim.

6.10 Product probability spaces

Given probability spaces $(\Omega_1,\mathcal F_1,P_1)$ and $(\Omega_2,\mathcal F_2,P_2)$ , the product space is

(\Omega_1\times\Omega_2,\mathcal F_1\otimes\mathcal F_2,P_1\otimes P_2),

where

(P_1\otimes P_2)(A\times B)=P_1(A)P_2(B).

Product measure extends this rectangle rule to the product σ-algebra.

Product spaces provide the canonical carrier for independent random objects. Coordinate variables $X(\omega_1,\omega_2)=\omega_1$ and $Y(\omega_1,\omega_2)=\omega_2$ are independent because their joint law factors. Dependence requires a non-product joint law.

6.11 Infinite product spaces

For countably many spaces $(S_n,\mathcal S_n,\mu_n)$ , the infinite product space is

\prod_{n=1}^{\infty}S_n

with product σ-algebra generated by cylinder sets. A cylinder event depends on finitely many coordinates. The product measure is determined by

P(X_1\in A_1,\ldots,X_k\in A_k)=\prod_{i=1}^k\mu_i(A_i).

Infinite products model independent sequences. Events such as convergence, limiting frequencies, and infinitely many occurrences are not cylinder events, but they belong to the σ-algebra generated by cylinders through countable operations. Infinite product spaces are the foundation for laws of large numbers, coin-flip sequences, and many stochastic processes.

6.12 Kolmogorov extension theorem

The Kolmogorov extension theorem constructs a probability measure on an infinite product or path space from consistent finite-dimensional distributions. If for every finite index set $I$ there is a law $\mu_I$ on $S^I$ , and these laws are compatible under marginalization, then under suitable state-space hypotheses there exists a process $(X_t)$ with those finite-dimensional laws.

The theorem is the bridge from finite data to process-level probability. It lets one define stochastic processes by specifying all finite joint distributions. However, it does not automatically provide regular sample paths. Continuity, càdlàg paths, and other path properties require additional arguments such as Kolmogorov continuity criteria.

6.13 Standard Borel spaces

A standard Borel space is a measurable space isomorphic to the Borel space of a complete separable metric space. Examples include $\mathbb R^n$ , Polish spaces with their Borel σ-algebras, countable discrete spaces, and many function spaces.

Standard Borel spaces are the safe operating environment for regular conditional probabilities, disintegration, measurable selection, and extension theorems. Many pathologies disappear in this category. When probability theory states a theorem requiring “regularity of the state space,” standard Borel or Polish assumptions are often the hidden carrier condition.

6.14 Regularity of probability measures

On well-behaved topological spaces, probability measures can be approximated by compact and open sets. A Borel probability measure $\mu$ on a metric space is often regular:

\mu(A)=\inf\{\mu(U):A\subseteq U,\ U\text{ open}\}

and

\mu(A)=\sup\{\mu(K):K\subseteq A,\ K\text{ compact}\}

for Borel $A$ .

Regularity lets measure theory interact with topology. It supports weak convergence, tightness, approximation by continuous functions, and compactness arguments. Without regularity, topological probability loses much of its analytic machinery.

Chapter 7 — Random Variables

7.1 Random variables as measurable maps

A random variable is a measurable map from a probability space into a measurable state space:

X:(\Omega,\mathcal F)\to(S,\mathcal S).

Measurability means

X^{-1}(B)\in\mathcal F

for every $B\in\mathcal S$ . This ensures that every legal target event has a probability.

The terminology “variable” can mislead. $X$ is a fixed function. Randomness comes from the random input $\omega$ . The law $X_*P$ describes the distribution of values. The carrier $\Omega$ may be changed or enlarged without changing the law of $X$ , provided the pushforward measure is preserved.

7.2 Real-valued random variables

A real-valued random variable is a measurable function $X:\Omega\to\mathbb R$ , where $\mathbb R$ has its Borel σ-algebra. It suffices to check that

\{X\le t\}\in\mathcal F

for every $t\in\mathbb R$ . The distribution function is

F_X(t)=P(X\le t).

Real-valued variables are central because they support ordering, integration, moments, quantiles, and convergence modes. Many non-real random objects are studied through real-valued probes. For a random vector $Z\in\mathbb R^d$ , linear functionals $\langle u,Z\rangle$ often determine distributional behavior.

7.3 Vector-valued random variables

A vector-valued random variable is a measurable map $X:\Omega\to\mathbb R^d$ . Its law is a probability measure on $\mathbb R^d$ . The coordinate variables $X_i$ are real-valued random variables, and $X$ is measurable iff all coordinates are measurable.

Joint distributions are naturally vector-valued laws. Covariance matrices, multivariate normal distributions, concentration inequalities, and random matrix theory all use vector-valued random variables. The key object is not merely the list of marginal laws but the joint law, which encodes dependence.

7.4 Random elements in measurable spaces

A random element generalizes random variables to arbitrary measurable spaces:

X:\Omega\to S.

Here $S$ may be a function space, graph space, space of measures, manifold, metric space, or distribution space. The law $X_*P$ is a probability measure on $S$ .

Random elements allow probability to model stochastic processes as single random objects taking values in path spaces. For example, Brownian motion can be treated as a random element of $C([0,\infty),\mathbb R)$ once continuity is established. This viewpoint shifts attention from coordinate distributions to process-level laws.

7.5 Simple random variables

A simple random variable takes finitely many values:

X=\sum_{i=1}^n a_i1_{A_i},

where $A_i\in\mathcal F$ are measurable. Its expectation is

\mathbb E[X]=\sum_{i=1}^n a_iP(A_i).

Simple variables are the building blocks of Lebesgue integration.

Every nonnegative measurable function can be approximated increasingly by simple functions. This is the construction route from finite weighted averages to general expectation. Simple variables therefore form the bridge between elementary probability and measure-theoretic probability.

7.6 Positive random variables

A positive or nonnegative random variable satisfies $X\ge0$ . Its expectation is always defined in the extended sense:

\mathbb E[X]\in[0,\infty].

It may be infinite. This avoids undefined expressions from subtracting infinities.

For nonnegative variables, Markov’s inequality holds:

P(X\ge a)\le\frac{\mathbb E[X]}{a},\qquad a>0.

This simple inequality is a fundamental bridge from expectation to tail probability. It is often the first step in concentration, moment methods, and existence proofs.

7.7 Extended real-valued random variables

An extended real-valued random variable takes values in $[-\infty,\infty]$ . Such variables appear naturally as limits, suprema, hitting times, logarithms of zero, or infima of random sets. Measurability is defined using the Borel σ-algebra on the extended real line.

Expectation of an extended real-valued $X$ is handled by positive and negative parts:

X^+=\max(X,0),\qquad X^-=\max(-X,0), \qquad X=X^+-X^-.

The expectation is defined if $\mathbb E[X^+]$ and $\mathbb E[X^-]$ are not both infinite. Otherwise $\infty-\infty$ is undefined.

7.8 Distribution of a random variable

The distribution or law of $X$ is

\mu_X=X_*P.

For real $X$ ,

\mu_X(B)=P(X\in B).

If $X$ is discrete, $\mu_X$ is determined by masses $P(X=x)$ . If $X$ has density $f$ , then $\mu_X(B)=\int_Bf(x)\,dx$ .

The law forgets the original sample space and retains only the probabilities of events determined by $X$ . This quotient is useful but loses coupling information. Knowing $\mu_X$ and $\mu_Y$ separately does not determine the joint law $\mu_{(X,Y)}$ .

7.9 Cumulative distribution functions

For real-valued $X$ , the cumulative distribution function is

F_X(t)=P(X\le t).

It is nondecreasing, right-continuous, and satisfies

\lim_{t\to-\infty}F_X(t)=0,\qquad \lim_{t\to\infty}F_X(t)=1.

Conversely, every function with these properties is the CDF of a probability measure on $\mathbb R$ .

The CDF determines the law. Point masses appear as jumps:

P(X=t)=F_X(t)-F_X(t^-).

If $F$ is differentiable with derivative $f$ and absolutely continuous, then $f$ is a density. Differentiability alone is not enough globally; absolute continuity is the correct condition.

7.10 Quantile functions

For a CDF $F$ , a quantile function may be defined by

Q(u)=\inf\{x:F(x)\ge u\},\qquad u\in(0,1).

If $U\sim\operatorname{Uniform}(0,1)$ , then $Q(U)$ has CDF $F$ . This is inverse-transform sampling.

Quantiles are law-level objects. They support simulation, stochastic ordering, coupling, and distributional construction. In general, $F(Q(u))$ need not equal $u$ exactly when $F$ has jumps or flat regions, but the pushforward law is still correct. The quantile construction gives a canonical coupling of many distributions using a shared uniform variable.

7.11 Equality almost surely

Random variables $X$ and $Y$ on the same probability space are equal almost surely if

P(X=Y)=1.

They may differ on a null set. In measure-theoretic probability, many objects are defined only up to almost sure equality. For example, elements of $L^p$ are equivalence classes of random variables modulo a.s. equality.

Almost sure equality requires a common sample space. If $X$ and $Y$ live on different spaces, the expression $P(X=Y)$ is meaningless until a coupling is supplied. Equality almost surely is therefore stronger than equality in distribution and more carrier-dependent.

7.12 Equality in distribution

Random variables $X$ and $Y$ , possibly on different probability spaces, are equal in distribution if

\mu_X=\mu_Y.

For real variables, this is equivalent to

F_X(t)=F_Y(t)

for every $t$ or at all continuity points, depending on context.

Equality in distribution permits comparison without a joint carrier. It is central to weak convergence and limit theorems. But it says nothing about pointwise equality, dependence, correlation, or joint relations. Exporting $X\stackrel d=Y$ as $X=Y$ is a category error unless a coupling with equality is constructed.

7.13 Joint distributions

For random variables $X:\Omega\to S$ and $Y:\Omega\to T$ , the joint distribution is the law of $(X,Y)$ :

\mu_{X,Y}(A\times B)=P(X\in A,\ Y\in B).

It determines the marginals:

\mu_X(A)=\mu_{X,Y}(A\times T),\qquad \mu_Y(B)=\mu_{X,Y}(S\times B).

The marginals do not determine the joint distribution. Dependence lives in the gap between marginals and joint law. Coupling theory studies all possible joint laws with specified marginals. Independence is the special joint law $\mu_X\otimes\mu_Y$ .

7.14 Marginals

Marginals are projections of joint laws. If $\gamma$ is a probability measure on $S\times T$ , its marginals are

\gamma_S(A)=\gamma(A\times T),\qquad \gamma_T(B)=\gamma(S\times B).

For random variables, these are the laws of each coordinate.

Marginalization loses dependence information. Many different couplings share the same marginals: independent coupling, perfectly correlated coupling, antimonotone coupling, optimal transport coupling, and others. Any inference from marginal laws to joint behavior requires an additional coupling certificate.

7.15 Transformations of random variables

If $Y=g(X)$ , then the law of $Y$ is the pushforward

\mu_Y=g_*\mu_X.

For discrete $X$ ,

P(Y=y)=\sum_{x:g(x)=y}P(X=x).

For continuous $X$ with density $f_X$ and smooth invertible $g$ ,

f_Y(y)=f_X(g^{-1}(y))\left|\frac{d}{dy}g^{-1}(y)\right|.

In higher dimensions, the Jacobian determinant appears. But the pushforward formula is more fundamental than density formulas. It works even when no density exists. The density formula is a special coordinate representation of measure transport.

7.16 Measurability traps

Common measurability traps include defining $X$ by a supremum over an uncountable family, projecting a measurable subset of a product space, or assuming every subset of a continuous space is measurable. Supremum over countably many measurable functions is measurable; supremum over uncountably many requires additional structure such as separability or joint measurability.

Another trap is confusing pointwise-defined modifications with measurable versions. A function equal almost everywhere to a measurable function need not be measurable unless the measure space is complete or the modification is controlled. In stochastic processes, path properties often require choosing versions with measurable or regular sample paths. Measurability is the legality gate for probability.

Chapter 8 — Expectation as Lebesgue Integration

8.1 Simple-function integration

For a nonnegative simple random variable

X=\sum_{i=1}^n a_i1_{A_i}, \qquad a_i\ge0,

define

\int X\,dP=\sum_{i=1}^n a_iP(A_i).

If the representation is refined, the value remains the same. This is the finite weighted-average formula expressed in measure language.

Simple-function integration is the primitive construction of the Lebesgue integral. General nonnegative measurable functions are approximated from below by simple functions. Thus expectation is not introduced by density or Riemann area; it is built by monotone approximation from event probabilities.

8.2 Nonnegative random variables

For $X\ge0$ , define

\mathbb E[X]=\int X\,dP = \sup\left\{\int s\,dP:0\le s\le X,\ s\text{ simple}\right\}.

The value may be $+\infty$ . This definition is stable under monotone limits and does not require cancellation.

A useful identity for nonnegative $X$ is the tail integral formula:

\mathbb E[X]=\int_0^\infty P(X>t)\,dt.

For integer-valued nonnegative $X$ ,

\mathbb E[X]=\sum_{k=1}^{\infty}P(X\ge k).

These formulas convert expectation into tail probabilities.

8.3 Integrable random variables

A real-valued random variable $X$ is integrable if

\mathbb E[|X|]<\infty.

Then

\mathbb E[X]=\mathbb E[X^+]-\mathbb E[X^-],

where both terms are finite. Integrability prevents the undefined expression $\infty-\infty$ .

Integrability is the gate for many operations. Linearity of expectation, conditional expectation in $L^1$ , convergence of averages, and martingale theory all require appropriate integrability. A random variable may be finite almost surely but not integrable; heavy-tailed distributions provide standard examples.

8.4 Positive and negative parts

Every real $X$ decomposes as

X=X^+-X^-, \qquad |X|=X^++X^-,

where

X^+=\max(X,0),\qquad X^-=\max(-X,0).

Both $X^+$ and $X^-$ are nonnegative measurable functions when $X$ is measurable.

This decomposition is not cosmetic. Lebesgue integration handles nonnegative functions first; signed integration is defined by subtracting two nonnegative integrals only when the subtraction is meaningful. If both $\mathbb E[X^+]$ and $\mathbb E[X^-]$ are infinite, $\mathbb E[X]$ is undefined.

8.5 Expectation as integral

Expectation is Lebesgue integration against probability:

\mathbb E[X]=\int_\Omega X(\omega)\,P(d\omega).

If $X$ has law $\mu_X$ , then

\mathbb E[g(X)]=\int_{\mathbb R}g(x)\,\mu_X(dx)

whenever the integral is defined. If $\mu_X$ has density $f$ ,

\mathbb E[g(X)]=\int_{\mathbb R}g(x)f(x)\,dx.

The law-level formula shows expectation depends only on the distribution of $X$ , not on the particular sample-space representation. But expectations of functions involving multiple variables depend on the joint law, not only the marginals.

8.6 Linearity of expectation

If $X,Y$ are integrable and $a,b\in\mathbb R$ , then

\mathbb E[aX+bY]=a\mathbb E[X]+b\mathbb E[Y].

If $X,Y\ge0$ , linearity also holds in the extended sense:

\mathbb E[X+Y]=\mathbb E[X]+\mathbb E[Y],

allowing $+\infty$ .

Linearity does not require independence. This remains one of probability’s most effective tools. Independence is needed for multiplicative identities such as

\mathbb E[XY]=\mathbb E[X]\mathbb E[Y],

not for additive identities.

8.7 Change of variables / pushforward formula

If $X:\Omega\to S$ has law $\mu=X_*P$ , then for measurable $g:S\to\mathbb R$ ,

\int_\Omega g(X(\omega))\,P(d\omega) = \int_S g(x)\,\mu(dx).

This is the abstract change-of-variables formula. It states that integrating a function of $X$ over the source space equals integrating that function over the distribution of $X$ .

Classical density transformations are special cases. If $Y=g(X)$ , then

\mathbb E[h(Y)] = \int h(g(x))\,\mu_X(dx) = \int h(y)\,\mu_Y(dy).

The pushforward law contains the transformed probabilities.

8.8 Expectation under distribution law

For a real random variable $X$ with distribution function $F$ , expectation can be written as a Lebesgue–Stieltjes integral:

\mathbb E[X]=\int_{\mathbb R}x\,dF(x),

when integrable. If $X$ is discrete,

\mathbb E[X]=\sum_x xP(X=x).

If $X$ has density $f$ ,

\mathbb E[X]=\int_{\mathbb R}xf(x)\,dx.

These are not different concepts of expectation; they are different representations of the same law-level integral. The correct representation depends on the measure type. For mixed or singular distributions, forcing a density or a mass function loses information.

8.9 Infinite expectations

A nonnegative random variable may have infinite expectation:

\mathbb E[X]=\infty.

For example, a Pareto-type tail with $P(X>t)\sim c/t$ has divergent expectation since

\mathbb E[X]=\int_0^\infty P(X>t)\,dt

diverges logarithmically. Infinite expectation is a legitimate value for nonnegative $X$ .

For signed variables, infinite positive and negative parts cannot be subtracted. A Cauchy random variable has no expectation in the Lebesgue sense, even though symmetric principal value calculations may give zero. Principal value is not expectation; it is a different limiting operation.

8.10 Integrability conditions

Integrability is often certified by tail bounds, domination, or moment estimates. If $|X|\le Y$ and $Y$ is integrable, then $X$ is integrable. If $P(|X|>t)\le Ct^{-\alpha}$ for large $t$ , then

\mathbb E[|X|]<\infty

when $\alpha>1$ , by the tail integral formula.

For $p>0$ ,

\mathbb E[|X|^p] = p\int_0^\infty t^{p-1}P(|X|>t)\,dt.

This formula gives moment criteria from tail decay. Moment assumptions in limit theorems are therefore tail assumptions in disguised integral form.

8.11 Expectation versus typical value

Expectation is an average, not necessarily a typical value. Heavy-tailed variables may have means dominated by rare extreme events. A variable may have expectation far from its median or mode. In skewed distributions, $\mathbb E[X]$ , median, and most likely value can be very different.

This distinction matters in risk, algorithms, and probabilistic method arguments. Expected runtime may be finite while typical runtime is much smaller, or median performance may be good while expectation is ruined by rare catastrophes. Concentration inequalities are needed when one wants typical behavior, not just average behavior.

8.12 Expectation under model extension

If a probability space is extended by adding auxiliary randomness, old random variables lift by composition with projection. If $\pi:\Omega'\to\Omega$ preserves probability and $X' = X\circ\pi$ , then

\mathbb E_{\Omega'}[X']=\mathbb E_\Omega[X].

Thus expectation is invariant under probability-preserving model extension.

This invariance justifies changing sample spaces for convenience. One may add independent uniforms, construct couplings, or realize random variables on canonical spaces. The invariant object is the law and the relevant joint structure, not the raw sample-space representation.

Chapter 9 — Core Convergence Theorems

9.1 Monotone convergence theorem

If $0\le X_n\uparrow X$ pointwise, then

\mathbb E[X_n]\uparrow\mathbb E[X].

This is the monotone convergence theorem. It is one of the foundational results of Lebesgue integration and depends on countable additivity.

The theorem licenses passing limits through expectations for increasing nonnegative sequences without domination. It is used to define integrals, prove Tonelli’s theorem, derive tail integral formulas, and handle stopping times by approximation. The nonnegative monotone structure is the certificate; without monotonicity, the conclusion may fail.

9.2 Fatou’s lemma

For nonnegative random variables $X_n$ ,

\mathbb E[\liminf_{n\to\infty}X_n] \le \liminf_{n\to\infty}\mathbb E[X_n].

Fatou’s lemma gives a lower-semicontinuity principle for expectations. It is weaker than full convergence exchange but requires minimal hypotheses.

Fatou is often the correct tool when limits are available but domination is absent. It prevents mass from appearing in the limit without being accounted for, but it allows mass to escape. In probability, this “escape” corresponds to lack of uniform integrability or tightness.

9.3 Dominated convergence theorem

If $X_n\to X$ almost surely and $|X_n|\le Y$ for an integrable $Y$ , then

\mathbb E[X_n]\to\mathbb E[X].

The integrable dominating variable $Y$ prevents mass from escaping into rare large spikes.

Dominated convergence is one of the main liftback theorems from pointwise convergence to expectation convergence. The domination hypothesis is load-bearing. Pointwise convergence alone does not imply convergence of expectations. A common counterexample is $X_n=n1_{(0,1/n)}$ on $[0,1]$ : $X_n\to0$ almost everywhere, but $\mathbb E[X_n]=1$ .

9.4 Bounded convergence theorem

If $X_n\to X$ almost surely and $|X_n|\le M$ for a constant $M<\infty$ , then

\mathbb E[X_n]\to\mathbb E[X].

This is a special case of dominated convergence with $Y=M$ .

Bounded convergence is often used with probabilities because indicators are bounded. If $1_{A_n}\to1_A$ almost surely, then

P(A_n)=\mathbb E[1_{A_n}]\to\mathbb E[1_A]=P(A).

However, indicator convergence must be verified; set convergence can mean different things depending on limsup and liminf behavior.

9.5 Uniform integrability

A family $\mathcal X\subset L^1$ is uniformly integrable if

\lim_{K\to\infty}\sup_{X\in\mathcal X} \mathbb E[|X|1_{\{|X|>K\}}]=0.

Uniform integrability prevents mass from escaping to infinity uniformly over the family.

It is the correct bridge from convergence in probability to convergence in $L^1$ . If $X_n\to X$ in probability and $\{X_n\}$ is uniformly integrable, then

\mathbb E|X_n-X|\to0

under standard hypotheses. Without uniform integrability, convergence in probability does not imply convergence of expectations.

9.6 Vitali convergence theorem

Vitali’s theorem states that $X_n\to X$ in $L^1$ if and only if $X_n\to X$ in probability and the family $\{X_n\}$ is uniformly integrable, with appropriate inclusion of $X$ . More generally, $L^p$ -versions use uniform integrability of $|X_n|^p$ .

This theorem identifies the missing payload in many false expectation arguments. Convergence in probability controls typical deviations; uniform integrability controls rare large deviations. Both are needed for convergence of means. The theorem is therefore a precise decomposition of convergence into typical behavior plus tail control.

9.7 Interchanging limits and expectations

The formal question is when

\lim_{n\to\infty}\mathbb E[X_n]=\mathbb E[\lim_{n\to\infty}X_n].

Monotone convergence, dominated convergence, bounded convergence, and Vitali’s theorem are sufficient frameworks. Fatou gives one-sided control for nonnegative sequences.

The error pattern is to treat expectation as a finite sum after limits have entered. Infinite probability spaces allow mass to move, concentrate, vanish, or escape. Every interchange of limit and expectation requires a certificate: monotonicity, domination, boundedness, uniform integrability, or another convergence theorem.

9.8 Failure modes for limit-expectation exchange

A standard failure is rare spikes. On $[0,1]$ , define

X_n=n1_{(0,1/n)}.

Then $X_n\to0$ almost everywhere, but $\mathbb E[X_n]=1$ . The pointwise limit misses mass that escapes into narrower and taller regions.

Another failure is lack of integrability in the limit. Variables may converge pointwise to a nonintegrable random variable while expectations fail to converge finitely. Oscillation can also break convergence if positive and negative parts are not controlled. The general counterkernel is missing tail control.

9.9 Tightness versus integrability

Tightness controls where probability mass lies. A family of probability measures $\{\mu_i\}$ on a metric space is tight if for every $\varepsilon>0$ , there exists compact $K$ such that

\sup_i\mu_i(K^c)<\varepsilon.

Integrability controls the magnitude of random variables:

\sup_i\mathbb E[|X_i|1_{\{|X_i|>K\}}]\to0.

Tightness is about mass not escaping spatially; uniform integrability is about weighted mass not escaping in expectation. A family can be tight without uniformly integrable first moments. For convergence of laws, tightness is central; for convergence of expectations, uniform integrability is the stronger bridge.

9.10 Expectation convergence counterexamples

Counterexamples are not peripheral; they define the gates. Let $X_n=n$ with probability $1/n$ and $0$ otherwise. Then $X_n\to0$ in probability, but $\mathbb E[X_n]=1$ . Thus convergence in probability does not imply expectation convergence.

Let $X_n=n^2$ with probability $1/n$ and $0$ otherwise. Then $X_n\to0$ in probability, but $\mathbb E[X_n]=n\to\infty$ . This shows even boundedness of probabilities of large deviations is insufficient; the magnitude of rare deviations matters. Uniform integrability is the exact missing condition.

Chapter 10 — Moments and Inequalities

10.1 Moments

The $k$ -th raw moment of $X$ is

\mathbb E[X^k],

when it exists. The $k$ -th absolute moment is

\mathbb E[|X|^k].

Moments summarize distributional information. The first moment is the mean; the second raw moment contributes to variance; higher moments measure tail weight and shape.

Moment existence is not automatic. Heavy-tailed distributions may have some finite moments and some infinite moments. If $\mathbb E[|X|^q]<\infty$ , then $\mathbb E[|X|^p]<\infty$ for $0<p<q$ on probability spaces. Higher finite moments imply lower finite moments, but not conversely.

10.2 Absolute moments

Absolute moments avoid cancellation. The condition

\mathbb E[|X|^p]<\infty

defines $X\in L^p$ . For signed variables, $\mathbb E[X^p]$ may exist by cancellation in some improper sense while $\mathbb E[|X|^p]$ fails. Probability theory generally uses absolute integrability to certify legal operations.

The tail formula is

\mathbb E[|X|^p] = p\int_0^\infty t^{p-1}P(|X|>t)\,dt.

This makes clear that $p$ -th moments are tail decay conditions. They are not just algebraic averages; they control rare large values.

10.3 Variance and standard deviation

Variance is the second central moment:

\operatorname{Var}(X)=\mathbb E[(X-\mu)^2], \qquad \mu=\mathbb E[X].

The standard deviation is

\sigma=\sqrt{\operatorname{Var}(X)}.

Variance exists when $X\in L^2$ . It measures quadratic spread around the mean.

The identity

\operatorname{Var}(X)=\mathbb E[X^2]-\mu^2

is computationally useful. For sums,

\operatorname{Var}\Big(\sum_iX_i\Big) = \sum_i\operatorname{Var}(X_i) + 2\sum_{i<j}\operatorname{Cov}(X_i,X_j).

If the variables are pairwise uncorrelated, the covariance terms vanish.

10.4 Covariance and correlation

Covariance is

\operatorname{Cov}(X,Y)=\mathbb E[(X-\mathbb E X)(Y-\mathbb E Y)].

It measures linear co-fluctuation. Correlation normalizes covariance:

\rho(X,Y)= \frac{\operatorname{Cov}(X,Y)} {\sqrt{\operatorname{Var}(X)\operatorname{Var}(Y)}}.

By Cauchy–Schwarz, $|\rho|\le1$ .

Covariance is not a general dependence measure. Variables can be dependent with zero covariance. For example, if $X$ is symmetric about zero and $Y=X^2$ , then $\operatorname{Cov}(X,Y)=\mathbb E[X^3]-\mathbb E[X]\mathbb E[X^2]=0$ under symmetry, but $Y$ is determined by $X$ . Covariance detects linear dependence, not arbitrary dependence.

10.5 Jensen’s inequality

If $\varphi$ is convex and $X$ is integrable with $\varphi(X)$ integrable or nonnegative, then

\varphi(\mathbb E[X])\le\mathbb E[\varphi(X)].

For concave $\varphi$ , the inequality reverses. Jensen’s inequality says expectation commutes with convex functions only in an inequality direction.

Examples include

(\mathbb E[X])^2\le\mathbb E[X^2],

and for positive $X$ ,

\log\mathbb E[X]\ge\mathbb E[\log X]

because $\log$ is concave. Jensen is a convexity certificate inside probability: averaging before applying a convex function gives a smaller value than applying the convex function before averaging.

10.6 Markov’s inequality

For $X\ge0$ and $a>0$ ,

P(X\ge a)\le\frac{\mathbb E[X]}{a}.

Proof: $X\ge a1_{\{X\ge a\}}$ , so taking expectations gives $\mathbb E[X]\ge aP(X\ge a)$ .

Markov’s inequality is crude but universal. Applied to $|X|^p$ , it gives

P(|X|\ge a)\le\frac{\mathbb E[|X|^p]}{a^p}.

This converts moment bounds into tail bounds. It is often the first inequality in probabilistic estimates and the base of many concentration arguments.

10.7 Chebyshev’s inequality

If $X$ has finite variance, then

P(|X-\mathbb E X|\ge t) \le \frac{\operatorname{Var}(X)}{t^2}.

This is Markov’s inequality applied to $(X-\mathbb E X)^2$ .

Chebyshev gives a general concentration bound using only variance. For averages of independent variables with common variance $\sigma^2$ ,

\operatorname{Var}\left(\frac1n\sum_{i=1}^nX_i\right)=\frac{\sigma^2}{n},

P\left(\left|\frac1n\sum_iX_i-\mu\right|\ge\varepsilon\right) \le \frac{\sigma^2}{n\varepsilon^2}.

This proves a weak law of large numbers under finite variance.

10.8 Hölder’s inequality

If $p,q>1$ with $1/p+1/q=1$ , then

\mathbb E[|XY|]\le \left(\mathbb E[|X|^p]\right)^{1/p} \left(\mathbb E[|Y|^q]\right)^{1/q}.

More generally, products of several variables are bounded by corresponding $L^p$ norms whose reciprocal exponents sum to one.

Hölder is the core multiplicative inequality of $L^p$ spaces. It proves duality bounds, integrability of products, and moment interpolation. Cauchy–Schwarz is the case $p=q=2$ :

|\mathbb E[XY]|\le\sqrt{\mathbb E[X^2]\mathbb E[Y^2]}.

10.9 Minkowski’s inequality

For $p\ge1$ ,

\|X+Y\|_p\le\|X\|_p+\|Y\|_p,

where

\|X\|_p=(\mathbb E[|X|^p])^{1/p}.

This is the triangle inequality in $L^p$ .

Minkowski turns $L^p$ spaces into normed spaces. It allows probability to use functional analysis: completeness, projections, duality, and compactness methods. For $0<p<1$ , $\|\cdot\|_p$ is not a norm and Minkowski fails; this changes the geometry of the space.

10.10 Lyapunov inequality

On a probability space, if $0<p<q$ , then

(\mathbb E[|X|^p])^{1/p} \le (\mathbb E[|X|^q])^{1/q}.

Thus $L^q\subseteq L^p$ . Higher moments control lower moments.

Lyapunov’s inequality is a monotonicity principle for moment norms. It follows from Jensen or Hölder. It is frequently used to downgrade assumptions: if a theorem gives a fourth-moment bound, then second and first moments are automatically finite and bounded.

10.11 Paley–Zygmund inequality

For $X\ge0$ with finite second moment and $0<\theta<1$ ,

P(X\ge \theta\mathbb E[X]) \ge (1-\theta)^2\frac{(\mathbb E[X])^2}{\mathbb E[X^2]}.

This gives a lower bound on the probability that $X$ is not too small.

Paley–Zygmund is a second-moment existence tool. If $\mathbb E[X]^2$ is comparable to $\mathbb E[X^2]$ , then $X$ is positive with nontrivial probability. It is central in probabilistic combinatorics, random graphs, and branching processes, where first moment alone may not prove existence with positive probability.

10.12 Moment generating functions

The moment generating function of $X$ is

M_X(t)=\mathbb E[e^{tX}],

where finite. Derivatives at zero, when justified, give moments:

M_X^{(k)}(0)=\mathbb E[X^k].

MGFs transform sums of independent variables into products:

M_{X+Y}(t)=M_X(t)M_Y(t)

when $X,Y$ are independent.

MGFs support Chernoff bounds:

P(X\ge a)\le e^{-ta}M_X(t),\qquad t>0.

Optimizing over $t$ gives exponential tail estimates. The existence domain of $M_X$ is load-bearing; heavy-tailed variables may have infinite MGF for all $t>0$ .

10.13 Characteristic functions

The characteristic function of $X$ is

\phi_X(t)=\mathbb E[e^{itX}].

It always exists because $|e^{itX}|=1$ . Characteristic functions determine laws and convert convolution into multiplication:

\phi_{X+Y}(t)=\phi_X(t)\phi_Y(t)

for independent $X,Y$ .

Characteristic functions are central to the central limit theorem. If

\phi_{X_n}(t)\to\phi(t)

pointwise and $\phi$ is continuous at zero, then $\phi$ is the characteristic function of a probability law and $X_n$ converges in distribution to that law. This is Lévy’s continuity theorem.

10.14 Cumulants

The cumulant generating function is

K_X(t)=\log M_X(t),

when the MGF exists near zero. The $n$ -th cumulant is

\kappa_n=K_X^{(n)}(0).

The first cumulant is the mean; the second is variance; higher cumulants encode skewness, kurtosis, and non-Gaussian structure.

Cumulants add under independence:

K_{X+Y}(t)=K_X(t)+K_Y(t),

\kappa_n(X+Y)=\kappa_n(X)+\kappa_n(Y).

Gaussian variables have cumulants of order $n\ge3$ equal to zero. This makes cumulants useful in normal approximation, Edgeworth expansions, statistical mechanics, and random matrix theory.

10.15 Moment determinacy and indeterminacy

A distribution is moment-determinate if its sequence of moments uniquely determines the law. Compactly supported distributions are moment-determinate. A sufficient condition on $\mathbb R$ is Carleman’s condition:

\sum_{n=1}^{\infty}m_{2n}^{-1/(2n)}=\infty,

where $m_{2n}=\mathbb E[X^{2n}]$ .

Moment indeterminacy means two different distributions can share all moments. The lognormal distribution is a standard example of a law not determined by its moments. Therefore “all moments match” is not always a distribution certificate unless determinacy is proved. Characteristic functions avoid this issue because they always determine the law.