Probability Theory — Chapters 1–10

 

Probability Theory — Chapters 1–10

Chapter 1 — The Problem of Uncertainty

1.1 Deterministic statements versus probabilistic statements

A deterministic statement has a truth value fixed by the state of the system and the rules of the model: “the derivative of x2x^2 is 2x2x,” “this algorithm halts on this input,” or “the particle is at coordinate xx” once the model’s state is fully specified. A probabilistic statement does not assert a single realized truth in advance; it assigns weights to possible truth-values or outcomes. The statement “the coin lands heads with probability 1/21/2” is not equivalent to “the coin will land heads,” nor is it a weaker deterministic statement. It is a statement about a measure over alternatives.

The primitive move in probability is therefore not prediction but carrier construction. One must specify what can happen, which propositions about what can happen are legally measurable, and how much probability mass they carry. The deterministic claim has the form PP or ¬P\neg P. The probabilistic claim has the form

P(E)=p,\mathbb P(E)=p,

where EE is an event in some event space. Without the event space, the expression is syntactically suggestive but mathematically incomplete.

1.2 Events, outcomes, experiments, observations

An outcome is a primitive realized possibility in a model. In a die roll, an outcome may be 1,2,,61,2,\ldots,6. An event is a set of outcomes satisfying some property, such as “the die is even,” represented by {2,4,6}\{2,4,6\}. An experiment is the procedure or formal random mechanism generating outcomes. An observation is the information actually registered; it may be coarser than the outcome. For example, if a die is rolled but only parity is observed, the observable events are {even}\{\text{even}\}, {odd}\{\text{odd}\}, not necessarily the full six singleton outcomes.

This distinction matters because probability attaches to events, not directly to linguistic descriptions. Two descriptions may denote the same event, and the same informal description may denote different events under different sample-space models. A formal probability model therefore separates Ω\Omega, the outcome space, from F\mathcal F, the admissible event family, and from PP, the probability law. The triple (Ω,F,P)(\Omega,\mathcal F,P) is the carrier; events are legal only when they belong to F\mathcal F.

1.3 Probability as weight, frequency, belief, symmetry, and measure

Probability has several interpretations. As weight, it is a normalized mass assigned to events. As frequency, it describes the limiting proportion of occurrence in repeated trials. As belief, it quantifies rational degrees of uncertainty. As symmetry, it assigns equal mass to indistinguishable alternatives. As measure, it becomes a mathematical function P:F[0,1]P:\mathcal F\to[0,1] satisfying normalization and countable additivity:

P(Ω)=1,P(n=1En)=n=1P(En)P(\Omega)=1,\qquad P\Big(\bigcup_{n=1}^{\infty}E_n\Big)=\sum_{n=1}^{\infty}P(E_n)

for pairwise disjoint EnFE_n\in\mathcal F.

The measure interpretation is the formal engine, not necessarily the philosophical interpretation. A Bayesian may use PP to encode belief; a frequentist may use PP to model long-run sampling behavior; a physicist may use PP to describe ensemble uncertainty. All still need a calculus that supports events, random variables, products, conditioning, expectation, and limits. Measure theory supplies that common calculus.

1.4 Why finite probability is insufficient

Finite probability handles dice, cards, urns, and finite games well. If Ω={ω1,,ωn}\Omega=\{\omega_1,\ldots,\omega_n\}, every event is a subset of Ω\Omega, and a probability law is determined by masses pi=P({ωi})p_i=P(\{\omega_i\}) with pi0p_i\ge0 and ipi=1\sum_i p_i=1. Then

P(E)=ωiEpi.P(E)=\sum_{\omega_i\in E}p_i.

This is clean, but it cannot model continuous random variables, infinite sequences, stochastic processes, Brownian motion, or limit theorems without extension.

The fatal issue is continuous probability. A uniform random point X[0,1]X\in[0,1] should satisfy P(X=x)=0P(X=x)=0 for each individual xx, but P(X[0,1])=1P(X\in[0,1])=1. This probability cannot be recovered by summing point masses. The model must assign mass to sets such as intervals and then extend to a suitable event family. Countable additivity, measurable sets, and integration become unavoidable once probability must survive infinite limiting operations.

1.5 The need for a carrier: sample space, event space, probability law

A probability claim requires three layers. The sample space Ω\Omega lists possible outcomes. The event space F\mathcal F specifies which subsets of Ω\Omega are measurable events. The probability law PP assigns mass to those events. The formal object is

(Ω,F,P).(\Omega,\mathcal F,P).

Leaving out F\mathcal F is harmless only in finite or countable models where F=2Ω\mathcal F=2^\Omega is usually acceptable. In continuous models, the full power set may include nonmeasurable subsets, so F\mathcal F must be restricted.

The carrier also determines which random variables exist. A random variable is not just any function X:ΩSX:\Omega\to S; it must be measurable, meaning inverse images of measurable target events are events in F\mathcal F. For real-valued XX, this means

{ω:X(ω)t}F\{\omega:X(\omega)\le t\}\in\mathcal F

for every tRt\in\mathbb R. This condition ensures that statements about XX have probabilities.

1.6 Common category errors: outcome ≠ event, random variable ≠ value, law ≠ sample

An outcome is a point ωΩ\omega\in\Omega. An event is a measurable subset EΩE\subseteq\Omega. Confusing the two causes errors such as assigning probability to a realized value instead of to the event that the value occurs. A random variable XX is a function on Ω\Omega, not the number X(ω)X(\omega) obtained after realization. Its value is random before ω\omega is fixed and deterministic after ω\omega is fixed.

The law of XX is also not the same thing as XX. The law is the pushforward measure

μX(B)=P(XB)=P(X1(B)).\mu_X(B)=P(X\in B)=P(X^{-1}(B)).

Two random variables can have the same law while living on different spaces or while being dependent in different ways on other variables. Equality in distribution,

X=dY,X\stackrel d=Y,

does not imply X=YX=Y pointwise or almost surely. That implication requires a common carrier and an equality statement inside that carrier.


Chapter 2 — Finite Probability Spaces

2.1 Sample spaces

In finite probability, the sample space is a finite set

Ω={ω1,,ωn}.\Omega=\{\omega_1,\ldots,\omega_n\}.

Each ωi\omega_i represents a possible outcome. For one die, Ω={1,2,3,4,5,6}\Omega=\{1,2,3,4,5,6\}. For two ordered dice, Ω={1,,6}2\Omega=\{1,\ldots,6\}^2. The choice of Ω\Omega encodes what distinctions the model regards as real. Ordered dice and unordered dice are different sample spaces; using the wrong one changes the probabilities unless the law is adjusted.

Finite sample spaces are useful because every subset can be treated as measurable:

F=2Ω.\mathcal F=2^\Omega.

This removes measurability issues and lets probability be introduced as weighted counting. But the simplicity hides the later distinction between outcome space and event space, which becomes load-bearing in infinite models.

2.2 Events as subsets

An event is a subset EΩE\subseteq\Omega. If Ω={1,2,3,4,5,6}\Omega=\{1,2,3,4,5,6\}, then “even” is E={2,4,6}E=\{2,4,6\}, “greater than four” is F={5,6}F=\{5,6\}, and “even and greater than four” is EF={6}E\cap F=\{6\}. Logical operations translate into set operations:

not E=Ec,E or F=EF,E and F=EF.\text{not }E=E^c,\qquad E\text{ or }F=E\cup F,\qquad E\text{ and }F=E\cap F.

This translation is the first formalization step of probability.

In finite spaces, the event algebra is Boolean. It is closed under complements, finite unions, and finite intersections. Since the space is finite, countable operations reduce to finite operations after repetitions are removed. Thus finite probability avoids the countability boundary that later forces the use of σ-algebras.

2.3 Probability mass functions

A probability mass function assigns a number p(ω)0p(\omega)\ge0 to each outcome such that

ωΩp(ω)=1.\sum_{\omega\in\Omega}p(\omega)=1.

The probability of an event is then

P(E)=ωEp(ω).P(E)=\sum_{\omega\in E}p(\omega).

This is the finite version of integration: events are measured by summing masses over their elements.

The mass function determines the law completely. If all masses are equal, p(ω)=1/Ωp(\omega)=1/|\Omega|, the model is uniform. If the masses differ, the model is weighted. For a biased coin with Ω={H,T}\Omega=\{H,T\}, one may set p(H)=qp(H)=q, p(T)=1qp(T)=1-q. The formalism is the same; only the mass function changes.

2.4 Uniform probability and counting

Uniform probability is the special case where all outcomes are equally likely. Then

P(E)=EΩ.P(E)=\frac{|E|}{|\Omega|}.

This is the bridge between counting and probability. For two fair dice, Ω=36|\Omega|=36, and the event “sum equals seven” has six outcomes:

(1,6),(2,5),(3,4),(4,3),(5,2),(6,1),(1,6),(2,5),(3,4),(4,3),(5,2),(6,1),

so its probability is 6/36=1/66/36=1/6.

Uniform models are powerful but dangerous. Equal likelihood must be justified by the modeling carrier, not by aesthetic symmetry alone. Two ways of parametrizing outcomes may produce different apparent uniform distributions. For example, choosing a random chord by choosing endpoints uniformly is not the same as choosing a midpoint uniformly. Uniformity is always uniformity with respect to a declared carrier.

2.5 Complements, unions, intersections

The complement rule is

P(Ec)=1P(E).P(E^c)=1-P(E).

For two events, the union rule is

P(EF)=P(E)+P(F)P(EF).P(E\cup F)=P(E)+P(F)-P(E\cap F).

The subtraction prevents double-counting the overlap. If EF=E\cap F=\varnothing, the events are disjoint and the formula reduces to additivity:

P(EF)=P(E)+P(F).P(E\cup F)=P(E)+P(F).

Intersections encode simultaneous occurrence. If EE is “even” and FF is “greater than four,” then EFE\cap F is “even and greater than four.” Probability theory often reduces verbal problems to algebra over complements, unions, and intersections. The event grammar is not optional; it is the syntax of probabilistic reasoning.

2.6 Inclusion–exclusion

For three events,

P(ABC)=P(A)+P(B)+P(C)P(AB)P(AC)P(BC)+P(ABC).P(A\cup B\cup C) = P(A)+P(B)+P(C) -P(A\cap B)-P(A\cap C)-P(B\cap C) +P(A\cap B\cap C).

The general inclusion–exclusion formula is

P(i=1nAi)=iP(Ai)i<jP(AiAj)+i<j<kP(AiAjAk)+(1)n+1P(A1An).P\Big(\bigcup_{i=1}^n A_i\Big) = \sum_iP(A_i) -\sum_{i<j}P(A_i\cap A_j) +\sum_{i<j<k}P(A_i\cap A_j\cap A_k) -\cdots +(-1)^{n+1}P(A_1\cap\cdots\cap A_n).

Inclusion–exclusion is exact but can be computationally expensive. Its truncated versions give bounds, such as the union bound

P(iAi)iP(Ai).P\Big(\bigcup_i A_i\Big)\le\sum_iP(A_i).

This inequality is often more important than the exact formula because it scales to large systems where full overlap data are unavailable.

2.7 Conditional probability

Conditional probability restricts attention to a known event BB with P(B)>0P(B)>0. It is defined by

P(AB)=P(AB)P(B).P(A\mid B)=\frac{P(A\cap B)}{P(B)}.

This is not a new primitive in finite probability; it is a renormalized probability law on BB. The conditional law satisfies P(BB)=1P(B\mid B)=1 and assigns zero mass outside BB.

Conditioning changes the carrier. The relevant sample space becomes BB, and probabilities are rescaled. If a die roll is known to be even, the probability that it is greater than four is

P({5,6}{2,4,6})=P({6})P({2,4,6})=1/63/6=13.P(\{5,6\}\mid \{2,4,6\})=\frac{P(\{6\})}{P(\{2,4,6\})}=\frac{1/6}{3/6}=\frac13.

The calculation is simple; the important point is that conditioning is event-space restriction plus normalization.

2.8 Bayes’ theorem

Bayes’ theorem follows directly from the symmetry of intersection:

P(AB)P(B)=P(AB)=P(BA)P(A).P(A\mid B)P(B)=P(A\cap B)=P(B\mid A)P(A).

Thus

P(AB)=P(BA)P(A)P(B).P(A\mid B)=\frac{P(B\mid A)P(A)}{P(B)}.

If {Ai}\{A_i\} is a partition of Ω\Omega, then

P(AiB)=P(BAi)P(Ai)jP(BAj)P(Aj).P(A_i\mid B)= \frac{P(B\mid A_i)P(A_i)} {\sum_jP(B\mid A_j)P(A_j)}.

The theorem transports probability from causal or generative direction P(BA)P(B\mid A) into diagnostic direction P(AB)P(A\mid B). The numerator combines likelihood and prior; the denominator normalizes over all alternatives. The main error is to confuse P(AB)P(A\mid B) with P(BA)P(B\mid A). Bayes’ theorem states exactly how they differ.

2.9 Independence of events

Events AA and BB are independent if

P(AB)=P(A)P(B).P(A\cap B)=P(A)P(B).

If P(B)>0P(B)>0, this is equivalent to

P(AB)=P(A).P(A\mid B)=P(A).

Independence means learning BB does not change the probability of AA. It does not mean the events are disjoint. In fact, disjoint positive-probability events are negatively dependent, since P(AB)=0P(A\cap B)=0 while P(A)P(B)>0P(A)P(B)>0.

Independence is a structural certificate, not a feeling of unrelatedness. It must be derived from the model. In product spaces, events depending on different coordinates are independent when the probability law factors. Without such factorization, independence is an assertion requiring proof.

2.10 Pairwise versus mutual independence

A family A1,,AnA_1,\ldots,A_n is pairwise independent if every pair satisfies

P(AiAj)=P(Ai)P(Aj).P(A_i\cap A_j)=P(A_i)P(A_j).

It is mutually independent if for every subfamily I{1,,n}I\subseteq\{1,\ldots,n\},

P(iIAi)=iIP(Ai).P\Big(\bigcap_{i\in I}A_i\Big)=\prod_{i\in I}P(A_i).

Mutual independence is stronger.

The distinction is not technical decoration. Three events can be pairwise independent but not mutually independent. For two fair coin tosses, let AA be “first toss heads,” BB be “second toss heads,” and CC be “the tosses agree.” Each pair is independent, but ABC=ABA\cap B\cap C=A\cap B, so the triple intersection does not factor as the product of the three probabilities. Pairwise checks do not certify full product structure.

2.11 Finite random variables

A finite random variable is a function X:ΩSX:\Omega\to S, often with SRS\subseteq\mathbb R. Its distribution is

PX(s)=P(X=s)=ω:X(ω)=sp(ω).P_X(s)=P(X=s)=\sum_{\omega:X(\omega)=s}p(\omega).

The variable compresses outcomes into values. Multiple outcomes may produce the same value, so the law of XX is generally coarser than the law on Ω\Omega.

A random variable is not random because the function changes; XX is fixed. Randomness enters through the random choice of ω\omega. Once ω\omega is realized, X(ω)X(\omega) is deterministic. This distinction is essential when defining functions of random variables: if Y=g(X)Y=g(X), then Y(ω)=g(X(ω))Y(\omega)=g(X(\omega)), and its law is obtained by pushing PXP_X through gg.

2.12 Expectation as weighted average

For a real-valued finite random variable XX, expectation is

E[X]=ωΩX(ω)p(ω).\mathbb E[X]=\sum_{\omega\in\Omega}X(\omega)p(\omega).

Equivalently, summing over values,

E[X]=xxP(X=x).\mathbb E[X]=\sum_x x\,P(X=x).

Expectation is the center of mass of the distribution, not necessarily a typical or possible value. A die has expectation 3.53.5, though no roll equals 3.53.5.

Expectation is linear:

E[aX+bY]=aE[X]+bE[Y],\mathbb E[aX+bY]=a\mathbb E[X]+b\mathbb E[Y],

whether or not XX and YY are independent. This is one of the most powerful facts in elementary probability because it lets global averages be computed without full distributional knowledge.

2.13 Variance, covariance, correlation

Variance measures squared deviation from the mean:

Var(X)=E[(XEX)2].\operatorname{Var}(X)=\mathbb E[(X-\mathbb E X)^2].

The computational identity is

Var(X)=E[X2](EX)2.\operatorname{Var}(X)=\mathbb E[X^2]-(\mathbb E X)^2.

Variance is nonnegative and equals zero exactly when XX is almost surely constant.

Covariance is

Cov(X,Y)=E[(XEX)(YEY)]=E[XY]E[X]E[Y].\operatorname{Cov}(X,Y)=\mathbb E[(X-\mathbb E X)(Y-\mathbb E Y)] =\mathbb E[XY]-\mathbb E[X]\mathbb E[Y].

Correlation normalizes covariance:

ρ(X,Y)=Cov(X,Y)Var(X)Var(Y).\rho(X,Y)=\frac{\operatorname{Cov}(X,Y)} {\sqrt{\operatorname{Var}(X)\operatorname{Var}(Y)}}.

Independence implies zero covariance when expectations exist, but zero covariance does not imply independence except in special families such as jointly Gaussian variables.

2.14 Indicator variables

The indicator of an event AA is

1A(ω)={1,ωA,0,ωA.1_A(\omega)= \begin{cases} 1,&\omega\in A,\\ 0,&\omega\notin A. \end{cases}

Its expectation is

E[1A]=P(A).\mathbb E[1_A]=P(A).

This turns probability questions into expectation questions and counting questions into sums of indicators.

If NN counts how many events AiA_i occur, then

N=i1Ai,E[N]=iP(Ai).N=\sum_i1_{A_i}, \qquad \mathbb E[N]=\sum_iP(A_i).

No independence is required. This method is the backbone of probabilistic combinatorics, occupancy estimates, random graph thresholds, and existence arguments.

2.15 Linearity of expectation

Linearity states that for finite random variables,

E[i=1nXi]=i=1nE[Xi].\mathbb E\left[\sum_{i=1}^n X_i\right]=\sum_{i=1}^n\mathbb E[X_i].

It does not require independence. If XiX_i are indicators, the formula says that the expected count of successful events equals the sum of their individual probabilities.

The power of linearity is that it avoids dependence. To compute the expected number of fixed points in a random permutation πSn\pi\in S_n, define Ii=1{π(i)=i}I_i=1_{\{\pi(i)=i\}}. Then P(Ii=1)=1/nP(I_i=1)=1/n, so

Ei=1nIi=i=1n1n=1.\mathbb E\sum_{i=1}^n I_i=\sum_{i=1}^n\frac1n=1.

The indicators are dependent, but expectation ignores that dependence for first-moment purposes.

2.16 First probabilistic existence arguments

The first moment method uses expectation to prove existence. If X0X\ge0 and E[X]<a\mathbb E[X]<a, then there exists an outcome with X<aX<a; otherwise XaX\ge a everywhere would force E[X]a\mathbb E[X]\ge a. Similarly, if E[X]>0\mathbb E[X]>0 for a nonnegative integer-valued XX, then there is an outcome with X>0X>0.

This converts random construction into deterministic existence. One defines a random object, counts desired or bad features by a random variable, computes its expectation, and concludes that some object has at least or at most the expected performance. The method gives existence without necessarily giving an efficient construction.


Chapter 3 — Counting and Discrete Models

3.1 Permutations and combinations

A permutation is an ordering. The number of permutations of nn distinct objects is

n!=n(n1)21.n!=n(n-1)\cdots2\cdot1.

The number of ordered selections of kk distinct objects from nn is

(n)k=n(n1)(nk+1)=n!(nk)!.(n)_k=n(n-1)\cdots(n-k+1)=\frac{n!}{(n-k)!}.

A combination is an unordered selection. The number of kk-element subsets of an nn-element set is

(nk)=n!k!(nk)!.\binom nk=\frac{n!}{k!(n-k)!}.

The division by k!k! removes the ordering of the selected elements. These formulas underlie finite uniform probability, since many events are counted by counting favorable selections divided by total selections.

3.2 Multinomial coefficients

The multinomial coefficient counts ways to split nn labeled objects into rr labeled boxes of sizes k1,,krk_1,\ldots,k_r, where iki=n\sum_i k_i=n:

(nk1,,kr)=n!k1!kr!.\binom{n}{k_1,\ldots,k_r}=\frac{n!}{k_1!\cdots k_r!}.

It appears in the expansion

(x1++xr)n=k1++kr=n(nk1,,kr)x1k1xrkr.(x_1+\cdots+x_r)^n = \sum_{k_1+\cdots+k_r=n} \binom{n}{k_1,\ldots,k_r}x_1^{k_1}\cdots x_r^{k_r}.

In probability, multinomial coefficients describe counts from repeated categorical trials. If each trial produces category ii with probability pip_i, then the probability of counts Ni=kiN_i=k_i is

P(N1=k1,,Nr=kr)=n!k1!kr!p1k1prkr.P(N_1=k_1,\ldots,N_r=k_r) = \frac{n!}{k_1!\cdots k_r!}p_1^{k_1}\cdots p_r^{k_r}.

3.3 Occupancy problems

Occupancy problems distribute balls into boxes. If mm balls are placed independently and uniformly into nn boxes, the sample space has size nmn^m. The occupancy number NjN_j of box jj has binomial distribution:

NjBinomial(m,1/n).N_j\sim \operatorname{Binomial}(m,1/n).

The joint distribution of occupancies is multinomial:

P(N1=k1,,Nn=kn)=m!k1!kn!(1n)m.P(N_1=k_1,\ldots,N_n=k_n) = \frac{m!}{k_1!\cdots k_n!}\left(\frac1n\right)^m.

Occupancy models encode hashing, allocation, collisions, load balancing, coupon collection, and random mappings. They are simple enough for exact counting but rich enough to display threshold behavior. For example, the expected number of empty boxes is

n(11n)mnem/n.n\left(1-\frac1n\right)^m\approx ne^{-m/n}.

3.4 Balls into bins

The balls-into-bins model studies the load profile after random allocation. With mm balls and nn bins, each bin load has mean m/nm/n. Indicator methods compute many structural statistics. Let IjI_j indicate that bin jj is empty. Then

Ej=1nIj=n(11n)m.\mathbb E\sum_{j=1}^n I_j = n\left(1-\frac1n\right)^m.

The maximum load is subtler because it depends on tail probabilities and dependence. When m=nm=n, the maximum load is of order

lognloglogn\frac{\log n}{\log\log n}

with high probability. This is one of the first places where expectation alone is insufficient; concentration and tail bounds are needed to control extremes.

3.5 Coupon collector

The coupon collector problem asks how many independent uniform samples from nn coupon types are needed to see all types. Let TT be the completion time. Decompose

T=T1++Tn,T=T_1+\cdots+T_n,

where TkT_k is the waiting time to collect a new coupon after k1k-1 distinct coupons have already appeared. Then TkT_k is geometric with success probability (nk+1)/n(n-k+1)/n, so

E[T]=k=1nnnk+1=nj=1n1j=nHnnlogn+γn.\mathbb E[T] = \sum_{k=1}^n \frac{n}{n-k+1} = n\sum_{j=1}^n\frac1j = nH_n \sim n\log n+\gamma n.

The problem shows the difference between mean scale and high-probability completion. Around nlogn+cnn\log n+cn, the probability of completion approaches a nontrivial limit related to eece^{-e^{-c}}. The final coupons dominate the waiting time.

3.6 Birthday problem

The birthday problem asks for the probability of at least one collision among kk independent uniform samples from nn possible birthdays. The no-collision probability is

n(n1)(nk+1)nk=j=0k1(1jn).\frac{n(n-1)\cdots(n-k+1)}{n^k} = \prod_{j=0}^{k-1}\left(1-\frac jn\right).

For knk\ll n, use log(1x)x\log(1-x)\approx-x:

P(no collision)exp(k(k1)2n).P(\text{no collision}) \approx \exp\left(-\frac{k(k-1)}{2n}\right).

The threshold occurs when k2/nk^2/n is order one, so knk\sim\sqrt n. This square-root threshold appears in hashing, cryptography, random graphs, and collision search. The lesson is that rare pairwise coincidences become likely when the number of pairs is large.

3.7 Hypergeometric model

The hypergeometric distribution describes sampling without replacement. If a population has NN objects, KK successes, and NKN-K failures, then drawing nn objects without replacement gives

P(X=x)=(Kx)(NKnx)(Nn).P(X=x) = \frac{\binom Kx\binom{N-K}{n-x}}{\binom Nn}.

The expectation is

E[X]=nKN.\mathbb E[X]=n\frac KN.

Unlike the binomial model, draws are dependent. Observing a success slightly reduces the chance of another success. Nevertheless, the expectation matches the binomial expectation with p=K/Np=K/N. When NN is large compared with nn, the hypergeometric distribution is close to binomial because sampling without replacement approximates sampling with replacement.

3.8 Binomial model

The binomial distribution counts successes in nn independent Bernoulli trials with success probability pp:

P(X=k)=(nk)pk(1p)nk,k=0,,n.P(X=k)=\binom nkp^k(1-p)^{n-k}, \qquad k=0,\ldots,n.

Its mean and variance are

E[X]=np,Var(X)=np(1p).\mathbb E[X]=np,\qquad \operatorname{Var}(X)=np(1-p).

The binomial model is the canonical finite independent-sum model. Its generating function is

E[zX]=(1p+pz)n.\mathbb E[z^X]=(1-p+pz)^n.

For large nn, it has different asymptotic regimes: normal approximation when np(1p)np(1-p) is large, Poisson approximation when nn\to\infty, p0p\to0, and npλnp\to\lambda, and large-deviation behavior far from the mean.

3.9 Geometric and negative binomial models

The geometric distribution models the waiting time TT until the first success in independent Bernoulli trials. With success probability pp,

P(T=k)=(1p)k1p,k1,P(T=k)=(1-p)^{k-1}p,\qquad k\ge1,

and

E[T]=1p.\mathbb E[T]=\frac1p.

It has the memoryless property:

P(T>s+tT>s)=P(T>t).P(T>s+t\mid T>s)=P(T>t).

The negative binomial distribution models the number of trials needed to obtain rr successes. If TrT_r is the trial of the rr-th success,

P(Tr=k)=(k1r1)pr(1p)kr,kr.P(T_r=k)=\binom{k-1}{r-1}p^r(1-p)^{k-r}, \qquad k\ge r.

It is a sum of rr independent geometric waiting times.

3.10 Poisson approximation in finite settings

The Poisson distribution with parameter λ>0\lambda>0 is

P(X=k)=eλλkk!.P(X=k)=e^{-\lambda}\frac{\lambda^k}{k!}.

It approximates sums of many rare, weakly dependent indicators. The simplest limit is binomial:

Binomial(n,λ/n)Poisson(λ).\operatorname{Binomial}(n,\lambda/n)\Rightarrow \operatorname{Poisson}(\lambda).

Indeed,

(nk)(λn)k(1λn)nkeλλkk!.\binom nk\left(\frac{\lambda}{n}\right)^k\left(1-\frac{\lambda}{n}\right)^{n-k} \to e^{-\lambda}\frac{\lambda^k}{k!}.

Finite Poisson approximation is often justified by Chen–Stein methods or by bounding dependence neighborhoods. The heuristic is that if events are individually rare and do not cluster too strongly, their count behaves approximately Poisson. The failure mode is hidden dependence that creates clumping.

3.11 Random graphs: first finite examples

The Erdős–Rényi graph G(n,p)G(n,p) has vertex set {1,,n}\{1,\ldots,n\}, with each possible edge included independently with probability pp. The number of edges is

EBinomial((n2),p),E\sim\operatorname{Binomial}\left(\binom n2,p\right),

so

E[E]=p(n2).\mathbb E[E]=p\binom n2.

Subgraph counts are sums of indicators. For triangles,

T={i,j,k}1{ij,jk,ik present},E[T]=(n3)p3.T=\sum_{\{i,j,k\}}1_{\{ij,jk,ik\text{ present}\}}, \qquad \mathbb E[T]=\binom n3p^3.

Random graphs demonstrate thresholds. A property may be unlikely below a scale of pp and likely above it. For example, isolated vertices disappear around p(logn)/np\sim(\log n)/n, and graph connectivity emerges at the same scale. Random graphs turn combinatorial existence into probabilistic mass.

3.12 Counting as probability and probability as counting

Counting and probability are dual in finite uniform spaces:

P(E)=EΩ,E=P(E)Ω.P(E)=\frac{|E|}{|\Omega|}, \qquad |E|=P(E)|\Omega|.

This means a probability estimate can imply a counting estimate and a counting estimate can imply a probability estimate. Many combinatorial arguments choose a random object uniformly, estimate the probability it has a property, and then multiply by the total number of objects.

The duality weakens in nonuniform spaces but survives through weighted counting. A probability law is a weighted counting scheme. The conceptual shift is from cardinality to measure: finite probability is counting with weights, while measure-theoretic probability is infinite weighted event geometry.


Chapter 4 — Why Measure Theory Enters Probability

4.1 Failure of purely discrete models

Purely discrete models assign probability by summing atomic masses:

P(E)=ωEp(ω).P(E)=\sum_{\omega\in E}p(\omega).

This works when the sample space is finite or countable and the total mass is distributed over atoms. It fails for distributions such as uniform measure on [0,1][0,1], where each singleton should have mass zero but the whole interval has mass one. A countable sum of zero masses remains zero, so nonatomic probability cannot be represented by pointwise mass summation.

The issue is not merely continuous variables. Infinite sequences also strain discrete intuition. A sequence of independent fair coin flips has sample space {0,1}N\{0,1\}^{\mathbb N}, uncountable in size. Each individual infinite sequence has probability zero, yet the full space has probability one. Events such as “infinitely many heads occur” require countable intersections and unions. Measure theory is the carrier that handles such events.

4.2 Continuous variables and zero-probability points

For a continuous random variable with density ff, probabilities are assigned by integration:

P(aXb)=abf(x)dx.P(a\le X\le b)=\int_a^b f(x)\,dx.

For a point,

P(X=x)=xxf(t)dt=0.P(X=x)=\int_x^x f(t)\,dt=0.

Thus individual outcomes can be impossible in the measure sense while one of them must occur. This is not a contradiction; probability zero does not mean logical impossibility.

The event {X=x}\{X=x\} has zero mass for each fixed xx, but

{X[0,1]}=x[0,1]{X=x}\{X\in[0,1]\}=\bigcup_{x\in[0,1]}\{X=x\}

is an uncountable union. Countable additivity does not license summing over uncountably many null events. This is one of the fundamental reasons the σ-algebra and countability boundary matter.

4.3 Countable additivity versus finite additivity

Finite additivity states that for disjoint A,BA,B,

P(AB)=P(A)+P(B).P(A\cup B)=P(A)+P(B).

Countable additivity extends this to countably many disjoint events:

P(n=1An)=n=1P(An).P\Big(\bigcup_{n=1}^{\infty}A_n\Big)=\sum_{n=1}^{\infty}P(A_n).

This property is essential for limits. If AnAA_n\uparrow A, then countable additivity implies continuity from below:

P(A)=limnP(An).P(A)=\lim_{n\to\infty}P(A_n).

If AnAA_n\downarrow A, then

P(A)=limnP(An)P(A)=\lim_{n\to\infty}P(A_n)

provided P(A1)<P(A_1)<\infty, automatically true in probability spaces.

Finite additivity can support some decision-theoretic frameworks, but it does not give the same convergence machinery. Limit theorems, Borel–Cantelli, product processes, Lebesgue integration, and conditional expectation all depend on countable additivity.

4.4 Legal events and illegal events

In finite spaces, every subset is legal. In continuous spaces, not every subset can be assigned probability while preserving desirable properties such as countable additivity and translation invariance. Nonmeasurable sets exist under standard set-theoretic assumptions. Therefore probability is defined only on a σ-algebra F2Ω\mathcal F\subseteq2^\Omega.

A legal event is an element of F\mathcal F. An illegal event is a subset of Ω\Omega outside F\mathcal F. The statement P(E)P(E) is meaningful only if EFE\in\mathcal F. This restriction is not a defect but a consistency condition. It prevents the probability calculus from being forced to assign incompatible masses to pathological sets.

4.5 The σ-algebra as event grammar

A σ-algebra F\mathcal F on Ω\Omega is a family of subsets satisfying:

ΩF,EFEcF,E1,E2,Fn=1EnF.\Omega\in\mathcal F,\qquad E\in\mathcal F\Rightarrow E^c\in\mathcal F,\qquad E_1,E_2,\ldots\in\mathcal F\Rightarrow\bigcup_{n=1}^{\infty}E_n\in\mathcal F.

It follows that F\mathcal F is also closed under countable intersections. Thus σ-algebras support countable logical operations.

The σ-algebra is the grammar of probabilistic propositions. It determines which statements about the outcome are admissible. If XX is a random variable, then events such as {Xt}\{X\le t\}, {XB}\{X\in B\}, and {limXn exists}\{\lim X_n \text{ exists}\} must belong to F\mathcal F before their probabilities can be discussed.

4.6 The probability measure as normalized measure

A probability measure is a measure with total mass one. Formally,

P:F[0,1]P:\mathcal F\to[0,1]

satisfies P()=0P(\varnothing)=0, P(Ω)=1P(\Omega)=1, and countable additivity over disjoint events. The normalization P(Ω)=1P(\Omega)=1 distinguishes probability from general measure theory, where total mass may be finite, infinite, or σ-finite.

This normalization makes probability a calculus of relative mass. Conditional probability, expectation, variance, and distribution all depend on the measure. The probability measure is not merely a list of event weights; it is the structural object that supports integration, pushforward, product construction, and convergence.

4.7 Null sets and almost sure reasoning

A null set is an event NN with P(N)=0P(N)=0. A property holds almost surely if it fails only on a null set. Thus X=YX=Y almost surely means

P({ω:X(ω)=Y(ω)})=1.P(\{\omega:X(\omega)=Y(\omega)\})=1.

Almost sure equality is not pointwise equality. It is equality in the quotient space where null differences are ignored.

This quotient is essential in analysis. LpL^p spaces identify random variables that differ only on null sets. Conditional expectation is unique only almost surely. Sample-path properties of stochastic processes often hold almost surely but not for every ω\omega. The null-set firewall prevents exporting measure-level statements as deterministic statements.

4.8 Probability theory as measure theory plus independence

Measure theory gives events, measures, functions, integration, products, and convergence. Probability theory adds independence, conditioning, stochastic processes, and asymptotic laws. Independence is not native to arbitrary measure theory as a philosophical concept, but it is formalized by product and factorization:

P(AB)=P(A)P(B),P(A\cap B)=P(A)P(B),

or for random variables,

μ(X,Y)=μXμY.\mu_{(X,Y)}=\mu_X\otimes\mu_Y.

The slogan “probability is measure theory plus independence” is accurate as a structural compression. Measure theory supplies the carrier; independence supplies probabilistic product structure; conditioning supplies information-relative projection; limit theorems supply asymptotic transport. Without measure theory, the infinite and continuous parts collapse into ad hoc rules.


Chapter 5 — Measurable Spaces and Events

5.1 Sets, algebras, σ-algebras

An algebra of sets is closed under finite unions and complements. A σ-algebra is closed under countable unions and complements. Every σ-algebra is an algebra, but not every algebra is a σ-algebra. Finite probability can often use algebras because only finite operations are needed. Modern probability needs σ-algebras because limiting events are naturally countable.

For example, if AnA_n is the event that a process satisfies a condition at time nn, then “the condition occurs infinitely often” is

lim supnAn=m=1nmAn.\limsup_{n\to\infty}A_n = \bigcap_{m=1}^{\infty}\bigcup_{n\ge m}A_n.

This expression requires countable unions and intersections. A finite algebra is not enough.

5.2 Generated σ-algebras

Given a collection of subsets C2Ω\mathcal C\subseteq2^\Omega, the σ-algebra generated by C\mathcal C, denoted σ(C)\sigma(\mathcal C), is the smallest σ-algebra containing C\mathcal C. It is the intersection of all σ-algebras that contain C\mathcal C:

σ(C)={F:F is a σ-algebra and CF}.\sigma(\mathcal C)=\bigcap\{\mathcal F:\mathcal F\text{ is a σ-algebra and }\mathcal C\subseteq\mathcal F\}.

Generated σ-algebras let probability specify elementary observable events and then close them under legal countable operations. For a real random variable XX, the information it generates is

σ(X)={X1(B):BB(R)}.\sigma(X)=\{X^{-1}(B):B\in\mathcal B(\mathbb R)\}.

This is the σ-algebra of events determined by observing XX.

5.3 Borel σ-algebra on topological spaces

For a topological space SS, the Borel σ-algebra B(S)\mathcal B(S) is generated by the open sets:

B(S)=σ({US:U open}).\mathcal B(S)=\sigma(\{U\subseteq S:U\text{ open}\}).

On R\mathbb R, it is also generated by open intervals, closed intervals, half-lines (,t](-\infty,t], or rational intervals. This flexibility is useful for proving measurability.

The Borel σ-algebra is the standard event space for random variables taking values in metric or topological spaces. A real-valued random variable XX is measurable if X1(B)FX^{-1}(B)\in\mathcal F for every Borel set BB. It suffices to check

{Xt}F\{X\le t\}\in\mathcal F

for all tRt\in\mathbb R.

5.4 Product σ-algebras

If (S,S)(S,\mathcal S) and (T,T)(T,\mathcal T) are measurable spaces, the product σ-algebra is

ST=σ({A×B:AS, BT}).\mathcal S\otimes\mathcal T = \sigma(\{A\times B:A\in\mathcal S,\ B\in\mathcal T\}).

It is the smallest σ-algebra making coordinate projections measurable.

Product σ-algebras are required for joint random variables. If X:ΩSX:\Omega\to S and Y:ΩTY:\Omega\to T are measurable, then (X,Y):ΩS×T(X,Y):\Omega\to S\times T is measurable with respect to ST\mathcal S\otimes\mathcal T. The joint law of (X,Y)(X,Y) is a probability measure on this product space. Independence is then expressed by factorization of that joint law.

5.5 Completion of a measure space

A measure space is complete if every subset of every null set is measurable. Given (Ω,F,P)(\Omega,\mathcal F,P), its completion adds all subsets of null sets to F\mathcal F. The completed σ-algebra is

F={EN:EF, NZ for some ZF, P(Z)=0}.\overline{\mathcal F} = \{E\cup N:E\in\mathcal F,\ N\subseteq Z\text{ for some }Z\in\mathcal F,\ P(Z)=0\}.

The measure extends by P(EN)=P(E)\overline P(E\cup N)=P(E).

Completion is natural because null subsets cannot affect probabilities. However, completion can interact subtly with product spaces and regular conditional probabilities. One must track whether the model uses Borel σ-algebras, Lebesgue completions, or completed filtrations.

5.6 Measurable subsets

A measurable subset is an event belonging to the chosen σ-algebra. In Rn\mathbb R^n, Borel sets include open sets, closed sets, countable intersections of open sets, countable unions of closed sets, and many more. Lebesgue measurable sets further include completions of Borel sets by null modifications.

Measurability is a legality condition, not a size condition. A set can be dense, fractal, uncountable, or topologically complicated and still be measurable. Conversely, nonmeasurable sets are not necessarily visually exotic; their pathology lies in incompatibility with the desired measure properties. In probability, no event probability exists until measurability is established.

5.7 Countable operations on events

If AnFA_n\in\mathcal F, then

n=1An,n=1An\bigcup_{n=1}^{\infty}A_n,\qquad \bigcap_{n=1}^{\infty}A_n

belong to F\mathcal F. This allows one to define events such as “at least one AnA_n occurs,” “all AnA_n occur,” “infinitely many AnA_n occur,” and “eventually all AnA_n occur.”

The two standard limiting events are

lim supAn=m=1nmAn\limsup A_n=\bigcap_{m=1}^{\infty}\bigcup_{n\ge m}A_n

and

lim infAn=m=1nmAn.\liminf A_n=\bigcup_{m=1}^{\infty}\bigcap_{n\ge m}A_n.

The first means infinitely many AnA_n occur; the second means all but finitely many AnA_n occur.

5.8 Uncountable operation traps

A σ-algebra is not generally closed under uncountable unions or intersections. If {At:tT}F\{A_t:t\in T\}\subseteq\mathcal F and TT is uncountable, it does not follow that

tTAt\bigcup_{t\in T}A_t

is measurable. Sometimes it is measurable for structural reasons, but not automatically.

This matters in continuous probability. For each tt, the event {X=t}\{X=t\} may be measurable and null, but the union over all tRt\in\mathbb R is Ω\Omega if XX is real-valued. Countable additivity does not apply to that union. Any argument that sums probabilities over uncountably many events is invalid unless replaced by integration, separability, or a countable reduction.

5.9 Tail events

For a sequence of random variables X1,X2,X_1,X_2,\ldots, the tail σ-algebra is

T=n=1σ(Xn,Xn+1,).\mathcal T=\bigcap_{n=1}^{\infty}\sigma(X_n,X_{n+1},\ldots).

It contains events unaffected by changing finitely many initial coordinates. Examples include convergence of averages, infinitely many occurrences, and limiting frequencies.

For independent sequences, Kolmogorov’s zero-one law states that every tail event has probability 00 or 11. This is a structural theorem: events depending only on the infinite tail cannot have intermediate probability under full independence. Tail σ-algebras encode asymptotic information stripped of finite initial noise.

5.10 Germ σ-algebras

A germ σ-algebra records information arbitrarily close to a point, time, or boundary. For a stochastic process, the germ at time tt may be represented as an intersection of σ-algebras over shrinking neighborhoods:

Gt=ε>0σ(Xs:st<ε).\mathcal G_t=\bigcap_{\varepsilon>0}\sigma(X_s:|s-t|<\varepsilon).

It captures infinitesimal local information rather than global trajectory information.

Germ σ-algebras arise in Markov processes, Brownian motion, stochastic calculus, and local field behavior. Their analysis often requires right-continuity, completion, separability, or regularity assumptions. The danger is to assume that infinitesimal information is trivial or maximal without proving the relevant zero-one or regularity law.

5.11 Event equivalence modulo null sets

Events AA and BB are equivalent modulo null sets if

P(AB)=0,P(A\triangle B)=0,

where AB=(AB)(BA)A\triangle B=(A\setminus B)\cup(B\setminus A). In probability, such events are often indistinguishable because they have the same probability and differ only on a null set.

This quotient viewpoint is essential for LpL^p spaces and conditional expectation. However, modulo-null equivalence must not be exported as literal equality when pointwise structure matters. In stochastic processes, two versions may agree at each fixed time almost surely but fail to have indistinguishable sample paths unless stronger conditions are imposed.


Chapter 6 — Probability Spaces

6.1 Probability space (Ω, 𝓕, P)

A probability space consists of a sample space Ω\Omega, a σ-algebra F\mathcal F, and a probability measure PP. It is the formal carrier for all events and random variables in a model. The axioms are:

P(Ω)=1,P(E)0,P(n=1En)=n=1P(En)P(\Omega)=1,\qquad P(E)\ge0,\qquad P\Big(\bigcup_{n=1}^{\infty}E_n\Big)=\sum_{n=1}^{\infty}P(E_n)

for pairwise disjoint EnE_n.

Every probability expression must be interpretable in this carrier or in a declared extension/quotient. If XX and YY are random variables on different spaces, then X+YX+Y is undefined until a joint space or coupling is specified. The probability space is not background decoration; it is the domain of legal probabilistic syntax.

6.2 Atoms and nonatomic spaces

An atom is a measurable set AA with P(A)>0P(A)>0 such that every measurable BAB\subseteq A has P(B)=0P(B)=0 or P(B)=P(A)P(B)=P(A). Discrete probability spaces are atomic; the atoms are often singleton outcomes with positive mass. A nonatomic space has no atoms. Lebesgue probability on [0,1][0,1] is nonatomic.

Atomic and nonatomic spaces behave differently. In atomic spaces, probabilities decompose into point masses. In nonatomic spaces, mass can be split continuously: for many standard nonatomic spaces, every t[0,1]t\in[0,1] is the probability of some event. Continuous randomization, uniform variables, and many coupling constructions rely on nonatomic structure.

6.3 Discrete probability measures

A discrete probability measure is concentrated on a finite or countable set. If S={xi}S=\{x_i\}, then

μ=ipiδxi,pi0,ipi=1.\mu=\sum_i p_i\delta_{x_i}, \qquad p_i\ge0,\quad \sum_i p_i=1.

For an event AA,

μ(A)=i:xiApi.\mu(A)=\sum_{i:x_i\in A}p_i.

Discrete measures are computationally transparent. Expectation becomes summation:

E[f(X)]=if(xi)pi.\mathbb E[f(X)]=\sum_i f(x_i)p_i.

However, discrete methods fail when no atoms carry mass, as with continuous distributions. Many probability models combine discrete and continuous components, so one must not assume all laws have densities or all laws have mass functions.

6.4 Continuous probability measures

A continuous probability measure has no atoms, or more narrowly, may admit a density ff with respect to Lebesgue measure:

μ(A)=Af(x)dx.\mu(A)=\int_A f(x)\,dx.

The density must satisfy f0f\ge0 and f=1\int f=1. If XX has density ff, then

P(aXb)=abf(x)dx.P(a\le X\le b)=\int_a^b f(x)\,dx.

Not every continuous measure has a density. The Cantor distribution is nonatomic but singular with respect to Lebesgue measure. Thus “continuous” and “has a density” are different properties. The correct carrier distinction is atomic, absolutely continuous, singular, and mixtures thereof.

6.5 Singular measures

A measure μ\mu is singular with respect to another measure ν\nu, written μν\mu\perp\nu, if there exists a measurable set AA such that μ(A)=1\mu(A)=1 and ν(A)=0\nu(A)=0. The Cantor measure is singular with respect to Lebesgue measure: it is concentrated on the Cantor set, which has Lebesgue measure zero, while having no atoms.

Singular measures show why densities are not universal. A random variable may have a continuous distribution function but no density. Any argument that differentiates a CDF or writes probabilities as Afdx\int_A f\,dx must verify absolute continuity. Otherwise it silently changes carrier.

6.6 Mixed distributions

A mixed distribution has both discrete and continuous components. For example,

μ=pδ0+(1p)λ[0,1],\mu=p\delta_0+(1-p)\lambda_{[0,1]},

where λ[0,1]\lambda_{[0,1]} is uniform measure on [0,1][0,1]. Then P(X=0)=pP(X=0)=p, while conditional on the continuous component, XX spreads over [0,1][0,1].

The general Lebesgue decomposition separates a measure into absolutely continuous, singular continuous, and atomic components relative to a reference measure. Mixed distributions appear in survival models, censored data, spike-and-slab priors, default models, and random variables with boundary masses. Treating them as purely discrete or purely continuous loses mass.

6.7 Pushforward measures

If X:ΩSX:\Omega\to S is measurable and PP is a probability measure on Ω\Omega, the pushforward law XPX_*P on SS is defined by

XP(B)=P(X1(B))=P(XB).X_*P(B)=P(X^{-1}(B))=P(X\in B).

This is the distribution of XX. It allows one to study XX without retaining the entire original sample space.

Pushforward is the correct abstraction behind transformation of variables. If Y=g(X)Y=g(X), then

μY=gμX.\mu_Y=g_*\mu_X.

When densities exist and gg is smooth and invertible, this yields the familiar Jacobian formula. But the pushforward definition works more generally, including discrete, singular, and mixed laws.

6.8 Pullback of events

Given a measurable map X:ΩSX:\Omega\to S, an event BSB\subseteq S pulls back to

X1(B)={ω:X(ω)B}.X^{-1}(B)=\{\omega:X(\omega)\in B\}.

Measurability of XX means Borel or measurable events in the target pull back to legal events in Ω\Omega. Thus statements about XX become events in the original probability space.

Pullback and pushforward are dual. Pullback turns target propositions into source events; pushforward transports source probability to target laws. The equation

XP(B)=P(X1(B))X_*P(B)=P(X^{-1}(B))

is the bridge. Probability of random-variable statements is always computed by pulling back the statement to the sample space or by using the pushed-forward law.

6.9 Model extension and sample-space enlargement

A probability model may need enlargement to include new randomness. If (Ω,F,P)(\Omega,\mathcal F,P) models XX, then adding an independent YY may require

(Ω,F,P)=(Ω×S,FS,Pν).(\Omega',\mathcal F',P')=(\Omega\times S,\mathcal F\otimes\mathcal S,P\otimes\nu).

Old events lift by projection:

EE×S.E\mapsto E\times S.

Their probabilities are preserved:

P(E×S)=P(E).P'(E\times S)=P(E).

Model extension shows that sample spaces are representations, not the probabilistic objects themselves. The same event may have different set-theoretic representatives in different carriers while preserving probability. What must be invariant are the laws and joint relations explicitly required by the claim.

6.10 Product probability spaces

Given probability spaces (Ω1,F1,P1)(\Omega_1,\mathcal F_1,P_1) and (Ω2,F2,P2)(\Omega_2,\mathcal F_2,P_2), the product space is

(Ω1×Ω2,F1F2,P1P2),(\Omega_1\times\Omega_2,\mathcal F_1\otimes\mathcal F_2,P_1\otimes P_2),

where

(P1P2)(A×B)=P1(A)P2(B).(P_1\otimes P_2)(A\times B)=P_1(A)P_2(B).

Product measure extends this rectangle rule to the product σ-algebra.

Product spaces provide the canonical carrier for independent random objects. Coordinate variables X(ω1,ω2)=ω1X(\omega_1,\omega_2)=\omega_1 and Y(ω1,ω2)=ω2Y(\omega_1,\omega_2)=\omega_2 are independent because their joint law factors. Dependence requires a non-product joint law.

6.11 Infinite product spaces

For countably many spaces (Sn,Sn,μn)(S_n,\mathcal S_n,\mu_n), the infinite product space is

n=1Sn\prod_{n=1}^{\infty}S_n

with product σ-algebra generated by cylinder sets. A cylinder event depends on finitely many coordinates. The product measure is determined by

P(X1A1,,XkAk)=i=1kμi(Ai).P(X_1\in A_1,\ldots,X_k\in A_k)=\prod_{i=1}^k\mu_i(A_i).

Infinite products model independent sequences. Events such as convergence, limiting frequencies, and infinitely many occurrences are not cylinder events, but they belong to the σ-algebra generated by cylinders through countable operations. Infinite product spaces are the foundation for laws of large numbers, coin-flip sequences, and many stochastic processes.

6.12 Kolmogorov extension theorem

The Kolmogorov extension theorem constructs a probability measure on an infinite product or path space from consistent finite-dimensional distributions. If for every finite index set II there is a law μI\mu_I on SIS^I, and these laws are compatible under marginalization, then under suitable state-space hypotheses there exists a process (Xt)(X_t) with those finite-dimensional laws.

The theorem is the bridge from finite data to process-level probability. It lets one define stochastic processes by specifying all finite joint distributions. However, it does not automatically provide regular sample paths. Continuity, càdlàg paths, and other path properties require additional arguments such as Kolmogorov continuity criteria.

6.13 Standard Borel spaces

A standard Borel space is a measurable space isomorphic to the Borel space of a complete separable metric space. Examples include Rn\mathbb R^n, Polish spaces with their Borel σ-algebras, countable discrete spaces, and many function spaces.

Standard Borel spaces are the safe operating environment for regular conditional probabilities, disintegration, measurable selection, and extension theorems. Many pathologies disappear in this category. When probability theory states a theorem requiring “regularity of the state space,” standard Borel or Polish assumptions are often the hidden carrier condition.

6.14 Regularity of probability measures

On well-behaved topological spaces, probability measures can be approximated by compact and open sets. A Borel probability measure μ\mu on a metric space is often regular:

μ(A)=inf{μ(U):AU, U open}\mu(A)=\inf\{\mu(U):A\subseteq U,\ U\text{ open}\}

and

μ(A)=sup{μ(K):KA, K compact}\mu(A)=\sup\{\mu(K):K\subseteq A,\ K\text{ compact}\}

for Borel AA.

Regularity lets measure theory interact with topology. It supports weak convergence, tightness, approximation by continuous functions, and compactness arguments. Without regularity, topological probability loses much of its analytic machinery.


Chapter 7 — Random Variables

7.1 Random variables as measurable maps

A random variable is a measurable map from a probability space into a measurable state space:

X:(Ω,F)(S,S).X:(\Omega,\mathcal F)\to(S,\mathcal S).

Measurability means

X1(B)FX^{-1}(B)\in\mathcal F

for every BSB\in\mathcal S. This ensures that every legal target event has a probability.

The terminology “variable” can mislead. XX is a fixed function. Randomness comes from the random input ω\omega. The law XPX_*P describes the distribution of values. The carrier Ω\Omega may be changed or enlarged without changing the law of XX, provided the pushforward measure is preserved.

7.2 Real-valued random variables

A real-valued random variable is a measurable function X:ΩRX:\Omega\to\mathbb R, where R\mathbb R has its Borel σ-algebra. It suffices to check that

{Xt}F\{X\le t\}\in\mathcal F

for every tRt\in\mathbb R. The distribution function is

FX(t)=P(Xt).F_X(t)=P(X\le t).

Real-valued variables are central because they support ordering, integration, moments, quantiles, and convergence modes. Many non-real random objects are studied through real-valued probes. For a random vector ZRdZ\in\mathbb R^d, linear functionals u,Z\langle u,Z\rangle often determine distributional behavior.

7.3 Vector-valued random variables

A vector-valued random variable is a measurable map X:ΩRdX:\Omega\to\mathbb R^d. Its law is a probability measure on Rd\mathbb R^d. The coordinate variables XiX_i are real-valued random variables, and XX is measurable iff all coordinates are measurable.

Joint distributions are naturally vector-valued laws. Covariance matrices, multivariate normal distributions, concentration inequalities, and random matrix theory all use vector-valued random variables. The key object is not merely the list of marginal laws but the joint law, which encodes dependence.

7.4 Random elements in measurable spaces

A random element generalizes random variables to arbitrary measurable spaces:

X:ΩS.X:\Omega\to S.

Here SS may be a function space, graph space, space of measures, manifold, metric space, or distribution space. The law XPX_*P is a probability measure on SS.

Random elements allow probability to model stochastic processes as single random objects taking values in path spaces. For example, Brownian motion can be treated as a random element of C([0,),R)C([0,\infty),\mathbb R) once continuity is established. This viewpoint shifts attention from coordinate distributions to process-level laws.

7.5 Simple random variables

A simple random variable takes finitely many values:

X=i=1nai1Ai,X=\sum_{i=1}^n a_i1_{A_i},

where AiFA_i\in\mathcal F are measurable. Its expectation is

E[X]=i=1naiP(Ai).\mathbb E[X]=\sum_{i=1}^n a_iP(A_i).

Simple variables are the building blocks of Lebesgue integration.

Every nonnegative measurable function can be approximated increasingly by simple functions. This is the construction route from finite weighted averages to general expectation. Simple variables therefore form the bridge between elementary probability and measure-theoretic probability.

7.6 Positive random variables

A positive or nonnegative random variable satisfies X0X\ge0. Its expectation is always defined in the extended sense:

E[X][0,].\mathbb E[X]\in[0,\infty].

It may be infinite. This avoids undefined expressions from subtracting infinities.

For nonnegative variables, Markov’s inequality holds:

P(Xa)E[X]a,a>0.P(X\ge a)\le\frac{\mathbb E[X]}{a},\qquad a>0.

This simple inequality is a fundamental bridge from expectation to tail probability. It is often the first step in concentration, moment methods, and existence proofs.

7.7 Extended real-valued random variables

An extended real-valued random variable takes values in [,][-\infty,\infty]. Such variables appear naturally as limits, suprema, hitting times, logarithms of zero, or infima of random sets. Measurability is defined using the Borel σ-algebra on the extended real line.

Expectation of an extended real-valued XX is handled by positive and negative parts:

X+=max(X,0),X=max(X,0),X=X+X.X^+=\max(X,0),\qquad X^-=\max(-X,0), \qquad X=X^+-X^-.

The expectation is defined if E[X+]\mathbb E[X^+] and E[X]\mathbb E[X^-] are not both infinite. Otherwise \infty-\infty is undefined.

7.8 Distribution of a random variable

The distribution or law of XX is

μX=XP.\mu_X=X_*P.

For real XX,

μX(B)=P(XB).\mu_X(B)=P(X\in B).

If XX is discrete, μX\mu_X is determined by masses P(X=x)P(X=x). If XX has density ff, then μX(B)=Bf(x)dx\mu_X(B)=\int_Bf(x)\,dx.

The law forgets the original sample space and retains only the probabilities of events determined by XX. This quotient is useful but loses coupling information. Knowing μX\mu_X and μY\mu_Y separately does not determine the joint law μ(X,Y)\mu_{(X,Y)}.

7.9 Cumulative distribution functions

For real-valued XX, the cumulative distribution function is

FX(t)=P(Xt).F_X(t)=P(X\le t).

It is nondecreasing, right-continuous, and satisfies

limtFX(t)=0,limtFX(t)=1.\lim_{t\to-\infty}F_X(t)=0,\qquad \lim_{t\to\infty}F_X(t)=1.

Conversely, every function with these properties is the CDF of a probability measure on R\mathbb R.

The CDF determines the law. Point masses appear as jumps:

P(X=t)=FX(t)FX(t).P(X=t)=F_X(t)-F_X(t^-).

If FF is differentiable with derivative ff and absolutely continuous, then ff is a density. Differentiability alone is not enough globally; absolute continuity is the correct condition.

7.10 Quantile functions

For a CDF FF, a quantile function may be defined by

Q(u)=inf{x:F(x)u},u(0,1).Q(u)=\inf\{x:F(x)\ge u\},\qquad u\in(0,1).

If UUniform(0,1)U\sim\operatorname{Uniform}(0,1), then Q(U)Q(U) has CDF FF. This is inverse-transform sampling.

Quantiles are law-level objects. They support simulation, stochastic ordering, coupling, and distributional construction. In general, F(Q(u))F(Q(u)) need not equal uu exactly when FF has jumps or flat regions, but the pushforward law is still correct. The quantile construction gives a canonical coupling of many distributions using a shared uniform variable.

7.11 Equality almost surely

Random variables XX and YY on the same probability space are equal almost surely if

P(X=Y)=1.P(X=Y)=1.

They may differ on a null set. In measure-theoretic probability, many objects are defined only up to almost sure equality. For example, elements of LpL^p are equivalence classes of random variables modulo a.s. equality.

Almost sure equality requires a common sample space. If XX and YY live on different spaces, the expression P(X=Y)P(X=Y) is meaningless until a coupling is supplied. Equality almost surely is therefore stronger than equality in distribution and more carrier-dependent.

7.12 Equality in distribution

Random variables XX and YY, possibly on different probability spaces, are equal in distribution if

μX=μY.\mu_X=\mu_Y.

For real variables, this is equivalent to

FX(t)=FY(t)F_X(t)=F_Y(t)

for every tt or at all continuity points, depending on context.

Equality in distribution permits comparison without a joint carrier. It is central to weak convergence and limit theorems. But it says nothing about pointwise equality, dependence, correlation, or joint relations. Exporting X=dYX\stackrel d=Y as X=YX=Y is a category error unless a coupling with equality is constructed.

7.13 Joint distributions

For random variables X:ΩSX:\Omega\to S and Y:ΩTY:\Omega\to T, the joint distribution is the law of (X,Y)(X,Y):

μX,Y(A×B)=P(XA, YB).\mu_{X,Y}(A\times B)=P(X\in A,\ Y\in B).

It determines the marginals:

μX(A)=μX,Y(A×T),μY(B)=μX,Y(S×B).\mu_X(A)=\mu_{X,Y}(A\times T),\qquad \mu_Y(B)=\mu_{X,Y}(S\times B).

The marginals do not determine the joint distribution. Dependence lives in the gap between marginals and joint law. Coupling theory studies all possible joint laws with specified marginals. Independence is the special joint law μXμY\mu_X\otimes\mu_Y.

7.14 Marginals

Marginals are projections of joint laws. If γ\gamma is a probability measure on S×TS\times T, its marginals are

γS(A)=γ(A×T),γT(B)=γ(S×B).\gamma_S(A)=\gamma(A\times T),\qquad \gamma_T(B)=\gamma(S\times B).

For random variables, these are the laws of each coordinate.

Marginalization loses dependence information. Many different couplings share the same marginals: independent coupling, perfectly correlated coupling, antimonotone coupling, optimal transport coupling, and others. Any inference from marginal laws to joint behavior requires an additional coupling certificate.

7.15 Transformations of random variables

If Y=g(X)Y=g(X), then the law of YY is the pushforward

μY=gμX.\mu_Y=g_*\mu_X.

For discrete XX,

P(Y=y)=x:g(x)=yP(X=x).P(Y=y)=\sum_{x:g(x)=y}P(X=x).

For continuous XX with density fXf_X and smooth invertible gg,

fY(y)=fX(g1(y))ddyg1(y).f_Y(y)=f_X(g^{-1}(y))\left|\frac{d}{dy}g^{-1}(y)\right|.

In higher dimensions, the Jacobian determinant appears. But the pushforward formula is more fundamental than density formulas. It works even when no density exists. The density formula is a special coordinate representation of measure transport.

7.16 Measurability traps

Common measurability traps include defining XX by a supremum over an uncountable family, projecting a measurable subset of a product space, or assuming every subset of a continuous space is measurable. Supremum over countably many measurable functions is measurable; supremum over uncountably many requires additional structure such as separability or joint measurability.

Another trap is confusing pointwise-defined modifications with measurable versions. A function equal almost everywhere to a measurable function need not be measurable unless the measure space is complete or the modification is controlled. In stochastic processes, path properties often require choosing versions with measurable or regular sample paths. Measurability is the legality gate for probability.


Chapter 8 — Expectation as Lebesgue Integration

8.1 Simple-function integration

For a nonnegative simple random variable

X=i=1nai1Ai,ai0,X=\sum_{i=1}^n a_i1_{A_i}, \qquad a_i\ge0,

define

XdP=i=1naiP(Ai).\int X\,dP=\sum_{i=1}^n a_iP(A_i).

If the representation is refined, the value remains the same. This is the finite weighted-average formula expressed in measure language.

Simple-function integration is the primitive construction of the Lebesgue integral. General nonnegative measurable functions are approximated from below by simple functions. Thus expectation is not introduced by density or Riemann area; it is built by monotone approximation from event probabilities.

8.2 Nonnegative random variables

For X0X\ge0, define

E[X]=XdP=sup{sdP:0sX, s simple}.\mathbb E[X]=\int X\,dP = \sup\left\{\int s\,dP:0\le s\le X,\ s\text{ simple}\right\}.

The value may be ++\infty. This definition is stable under monotone limits and does not require cancellation.

A useful identity for nonnegative XX is the tail integral formula:

E[X]=0P(X>t)dt.\mathbb E[X]=\int_0^\infty P(X>t)\,dt.

For integer-valued nonnegative XX,

E[X]=k=1P(Xk).\mathbb E[X]=\sum_{k=1}^{\infty}P(X\ge k).

These formulas convert expectation into tail probabilities.

8.3 Integrable random variables

A real-valued random variable XX is integrable if

E[X]<.\mathbb E[|X|]<\infty.

Then

E[X]=E[X+]E[X],\mathbb E[X]=\mathbb E[X^+]-\mathbb E[X^-],

where both terms are finite. Integrability prevents the undefined expression \infty-\infty.

Integrability is the gate for many operations. Linearity of expectation, conditional expectation in L1L^1, convergence of averages, and martingale theory all require appropriate integrability. A random variable may be finite almost surely but not integrable; heavy-tailed distributions provide standard examples.

8.4 Positive and negative parts

Every real XX decomposes as

X=X+X,X=X++X,X=X^+-X^-, \qquad |X|=X^++X^-,

where

X+=max(X,0),X=max(X,0).X^+=\max(X,0),\qquad X^-=\max(-X,0).

Both X+X^+ and XX^- are nonnegative measurable functions when XX is measurable.

This decomposition is not cosmetic. Lebesgue integration handles nonnegative functions first; signed integration is defined by subtracting two nonnegative integrals only when the subtraction is meaningful. If both E[X+]\mathbb E[X^+] and E[X]\mathbb E[X^-] are infinite, E[X]\mathbb E[X] is undefined.

8.5 Expectation as integral

Expectation is Lebesgue integration against probability:

E[X]=ΩX(ω)P(dω).\mathbb E[X]=\int_\Omega X(\omega)\,P(d\omega).

If XX has law μX\mu_X, then

E[g(X)]=Rg(x)μX(dx)\mathbb E[g(X)]=\int_{\mathbb R}g(x)\,\mu_X(dx)

whenever the integral is defined. If μX\mu_X has density ff,

E[g(X)]=Rg(x)f(x)dx.\mathbb E[g(X)]=\int_{\mathbb R}g(x)f(x)\,dx.

The law-level formula shows expectation depends only on the distribution of XX, not on the particular sample-space representation. But expectations of functions involving multiple variables depend on the joint law, not only the marginals.

8.6 Linearity of expectation

If X,YX,Y are integrable and a,bRa,b\in\mathbb R, then

E[aX+bY]=aE[X]+bE[Y].\mathbb E[aX+bY]=a\mathbb E[X]+b\mathbb E[Y].

If X,Y0X,Y\ge0, linearity also holds in the extended sense:

E[X+Y]=E[X]+E[Y],\mathbb E[X+Y]=\mathbb E[X]+\mathbb E[Y],

allowing ++\infty.

Linearity does not require independence. This remains one of probability’s most effective tools. Independence is needed for multiplicative identities such as

E[XY]=E[X]E[Y],\mathbb E[XY]=\mathbb E[X]\mathbb E[Y],

not for additive identities.

8.7 Change of variables / pushforward formula

If X:ΩSX:\Omega\to S has law μ=XP\mu=X_*P, then for measurable g:SRg:S\to\mathbb R,

Ωg(X(ω))P(dω)=Sg(x)μ(dx).\int_\Omega g(X(\omega))\,P(d\omega) = \int_S g(x)\,\mu(dx).

This is the abstract change-of-variables formula. It states that integrating a function of XX over the source space equals integrating that function over the distribution of XX.

Classical density transformations are special cases. If Y=g(X)Y=g(X), then

E[h(Y)]=h(g(x))μX(dx)=h(y)μY(dy).\mathbb E[h(Y)] = \int h(g(x))\,\mu_X(dx) = \int h(y)\,\mu_Y(dy).

The pushforward law contains the transformed probabilities.

8.8 Expectation under distribution law

For a real random variable XX with distribution function FF, expectation can be written as a Lebesgue–Stieltjes integral:

E[X]=RxdF(x),\mathbb E[X]=\int_{\mathbb R}x\,dF(x),

when integrable. If XX is discrete,

E[X]=xxP(X=x).\mathbb E[X]=\sum_x xP(X=x).

If XX has density ff,

E[X]=Rxf(x)dx.\mathbb E[X]=\int_{\mathbb R}xf(x)\,dx.

These are not different concepts of expectation; they are different representations of the same law-level integral. The correct representation depends on the measure type. For mixed or singular distributions, forcing a density or a mass function loses information.

8.9 Infinite expectations

A nonnegative random variable may have infinite expectation:

E[X]=.\mathbb E[X]=\infty.

For example, a Pareto-type tail with P(X>t)c/tP(X>t)\sim c/t has divergent expectation since

E[X]=0P(X>t)dt\mathbb E[X]=\int_0^\infty P(X>t)\,dt

diverges logarithmically. Infinite expectation is a legitimate value for nonnegative XX.

For signed variables, infinite positive and negative parts cannot be subtracted. A Cauchy random variable has no expectation in the Lebesgue sense, even though symmetric principal value calculations may give zero. Principal value is not expectation; it is a different limiting operation.

8.10 Integrability conditions

Integrability is often certified by tail bounds, domination, or moment estimates. If XY|X|\le Y and YY is integrable, then XX is integrable. If P(X>t)CtαP(|X|>t)\le Ct^{-\alpha} for large tt, then

E[X]<\mathbb E[|X|]<\infty

when α>1\alpha>1, by the tail integral formula.

For p>0p>0,

E[Xp]=p0tp1P(X>t)dt.\mathbb E[|X|^p] = p\int_0^\infty t^{p-1}P(|X|>t)\,dt.

This formula gives moment criteria from tail decay. Moment assumptions in limit theorems are therefore tail assumptions in disguised integral form.

8.11 Expectation versus typical value

Expectation is an average, not necessarily a typical value. Heavy-tailed variables may have means dominated by rare extreme events. A variable may have expectation far from its median or mode. In skewed distributions, E[X]\mathbb E[X], median, and most likely value can be very different.

This distinction matters in risk, algorithms, and probabilistic method arguments. Expected runtime may be finite while typical runtime is much smaller, or median performance may be good while expectation is ruined by rare catastrophes. Concentration inequalities are needed when one wants typical behavior, not just average behavior.

8.12 Expectation under model extension

If a probability space is extended by adding auxiliary randomness, old random variables lift by composition with projection. If π:ΩΩ\pi:\Omega'\to\Omega preserves probability and X=XπX' = X\circ\pi, then

EΩ[X]=EΩ[X].\mathbb E_{\Omega'}[X']=\mathbb E_\Omega[X].

Thus expectation is invariant under probability-preserving model extension.

This invariance justifies changing sample spaces for convenience. One may add independent uniforms, construct couplings, or realize random variables on canonical spaces. The invariant object is the law and the relevant joint structure, not the raw sample-space representation.


Chapter 9 — Core Convergence Theorems

9.1 Monotone convergence theorem

If 0XnX0\le X_n\uparrow X pointwise, then

E[Xn]E[X].\mathbb E[X_n]\uparrow\mathbb E[X].

This is the monotone convergence theorem. It is one of the foundational results of Lebesgue integration and depends on countable additivity.

The theorem licenses passing limits through expectations for increasing nonnegative sequences without domination. It is used to define integrals, prove Tonelli’s theorem, derive tail integral formulas, and handle stopping times by approximation. The nonnegative monotone structure is the certificate; without monotonicity, the conclusion may fail.

9.2 Fatou’s lemma

For nonnegative random variables XnX_n,

E[lim infnXn]lim infnE[Xn].\mathbb E[\liminf_{n\to\infty}X_n] \le \liminf_{n\to\infty}\mathbb E[X_n].

Fatou’s lemma gives a lower-semicontinuity principle for expectations. It is weaker than full convergence exchange but requires minimal hypotheses.

Fatou is often the correct tool when limits are available but domination is absent. It prevents mass from appearing in the limit without being accounted for, but it allows mass to escape. In probability, this “escape” corresponds to lack of uniform integrability or tightness.

9.3 Dominated convergence theorem

If XnXX_n\to X almost surely and XnY|X_n|\le Y for an integrable YY, then

E[Xn]E[X].\mathbb E[X_n]\to\mathbb E[X].

The integrable dominating variable YY prevents mass from escaping into rare large spikes.

Dominated convergence is one of the main liftback theorems from pointwise convergence to expectation convergence. The domination hypothesis is load-bearing. Pointwise convergence alone does not imply convergence of expectations. A common counterexample is Xn=n1(0,1/n)X_n=n1_{(0,1/n)} on [0,1][0,1]: Xn0X_n\to0 almost everywhere, but E[Xn]=1\mathbb E[X_n]=1.

9.4 Bounded convergence theorem

If XnXX_n\to X almost surely and XnM|X_n|\le M for a constant M<M<\infty, then

E[Xn]E[X].\mathbb E[X_n]\to\mathbb E[X].

This is a special case of dominated convergence with Y=MY=M.

Bounded convergence is often used with probabilities because indicators are bounded. If 1An1A1_{A_n}\to1_A almost surely, then

P(An)=E[1An]E[1A]=P(A).P(A_n)=\mathbb E[1_{A_n}]\to\mathbb E[1_A]=P(A).

However, indicator convergence must be verified; set convergence can mean different things depending on limsup and liminf behavior.

9.5 Uniform integrability

A family XL1\mathcal X\subset L^1 is uniformly integrable if

limKsupXXE[X1{X>K}]=0.\lim_{K\to\infty}\sup_{X\in\mathcal X} \mathbb E[|X|1_{\{|X|>K\}}]=0.

Uniform integrability prevents mass from escaping to infinity uniformly over the family.

It is the correct bridge from convergence in probability to convergence in L1L^1. If XnXX_n\to X in probability and {Xn}\{X_n\} is uniformly integrable, then

EXnX0\mathbb E|X_n-X|\to0

under standard hypotheses. Without uniform integrability, convergence in probability does not imply convergence of expectations.

9.6 Vitali convergence theorem

Vitali’s theorem states that XnXX_n\to X in L1L^1 if and only if XnXX_n\to X in probability and the family {Xn}\{X_n\} is uniformly integrable, with appropriate inclusion of XX. More generally, LpL^p-versions use uniform integrability of Xnp|X_n|^p.

This theorem identifies the missing payload in many false expectation arguments. Convergence in probability controls typical deviations; uniform integrability controls rare large deviations. Both are needed for convergence of means. The theorem is therefore a precise decomposition of convergence into typical behavior plus tail control.

9.7 Interchanging limits and expectations

The formal question is when

limnE[Xn]=E[limnXn].\lim_{n\to\infty}\mathbb E[X_n]=\mathbb E[\lim_{n\to\infty}X_n].

Monotone convergence, dominated convergence, bounded convergence, and Vitali’s theorem are sufficient frameworks. Fatou gives one-sided control for nonnegative sequences.

The error pattern is to treat expectation as a finite sum after limits have entered. Infinite probability spaces allow mass to move, concentrate, vanish, or escape. Every interchange of limit and expectation requires a certificate: monotonicity, domination, boundedness, uniform integrability, or another convergence theorem.

9.8 Failure modes for limit-expectation exchange

A standard failure is rare spikes. On [0,1][0,1], define

Xn=n1(0,1/n).X_n=n1_{(0,1/n)}.

Then Xn0X_n\to0 almost everywhere, but E[Xn]=1\mathbb E[X_n]=1. The pointwise limit misses mass that escapes into narrower and taller regions.

Another failure is lack of integrability in the limit. Variables may converge pointwise to a nonintegrable random variable while expectations fail to converge finitely. Oscillation can also break convergence if positive and negative parts are not controlled. The general counterkernel is missing tail control.

9.9 Tightness versus integrability

Tightness controls where probability mass lies. A family of probability measures {μi}\{\mu_i\} on a metric space is tight if for every ε>0\varepsilon>0, there exists compact KK such that

supiμi(Kc)<ε.\sup_i\mu_i(K^c)<\varepsilon.

Integrability controls the magnitude of random variables:

supiE[Xi1{Xi>K}]0.\sup_i\mathbb E[|X_i|1_{\{|X_i|>K\}}]\to0.

Tightness is about mass not escaping spatially; uniform integrability is about weighted mass not escaping in expectation. A family can be tight without uniformly integrable first moments. For convergence of laws, tightness is central; for convergence of expectations, uniform integrability is the stronger bridge.

9.10 Expectation convergence counterexamples

Counterexamples are not peripheral; they define the gates. Let Xn=nX_n=n with probability 1/n1/n and 00 otherwise. Then Xn0X_n\to0 in probability, but E[Xn]=1\mathbb E[X_n]=1. Thus convergence in probability does not imply expectation convergence.

Let Xn=n2X_n=n^2 with probability 1/n1/n and 00 otherwise. Then Xn0X_n\to0 in probability, but E[Xn]=n\mathbb E[X_n]=n\to\infty. This shows even boundedness of probabilities of large deviations is insufficient; the magnitude of rare deviations matters. Uniform integrability is the exact missing condition.


Chapter 10 — Moments and Inequalities

10.1 Moments

The kk-th raw moment of XX is

E[Xk],\mathbb E[X^k],

when it exists. The kk-th absolute moment is

E[Xk].\mathbb E[|X|^k].

Moments summarize distributional information. The first moment is the mean; the second raw moment contributes to variance; higher moments measure tail weight and shape.

Moment existence is not automatic. Heavy-tailed distributions may have some finite moments and some infinite moments. If E[Xq]<\mathbb E[|X|^q]<\infty, then E[Xp]<\mathbb E[|X|^p]<\infty for 0<p<q0<p<q on probability spaces. Higher finite moments imply lower finite moments, but not conversely.

10.2 Absolute moments

Absolute moments avoid cancellation. The condition

E[Xp]<\mathbb E[|X|^p]<\infty

defines XLpX\in L^p. For signed variables, E[Xp]\mathbb E[X^p] may exist by cancellation in some improper sense while E[Xp]\mathbb E[|X|^p] fails. Probability theory generally uses absolute integrability to certify legal operations.

The tail formula is

E[Xp]=p0tp1P(X>t)dt.\mathbb E[|X|^p] = p\int_0^\infty t^{p-1}P(|X|>t)\,dt.

This makes clear that pp-th moments are tail decay conditions. They are not just algebraic averages; they control rare large values.

10.3 Variance and standard deviation

Variance is the second central moment:

Var(X)=E[(Xμ)2],μ=E[X].\operatorname{Var}(X)=\mathbb E[(X-\mu)^2], \qquad \mu=\mathbb E[X].

The standard deviation is

σ=Var(X).\sigma=\sqrt{\operatorname{Var}(X)}.

Variance exists when XL2X\in L^2. It measures quadratic spread around the mean.

The identity

Var(X)=E[X2]μ2\operatorname{Var}(X)=\mathbb E[X^2]-\mu^2

is computationally useful. For sums,

Var(iXi)=iVar(Xi)+2i<jCov(Xi,Xj).\operatorname{Var}\Big(\sum_iX_i\Big) = \sum_i\operatorname{Var}(X_i) + 2\sum_{i<j}\operatorname{Cov}(X_i,X_j).

If the variables are pairwise uncorrelated, the covariance terms vanish.

10.4 Covariance and correlation

Covariance is

Cov(X,Y)=E[(XEX)(YEY)].\operatorname{Cov}(X,Y)=\mathbb E[(X-\mathbb E X)(Y-\mathbb E Y)].

It measures linear co-fluctuation. Correlation normalizes covariance:

ρ(X,Y)=Cov(X,Y)Var(X)Var(Y).\rho(X,Y)= \frac{\operatorname{Cov}(X,Y)} {\sqrt{\operatorname{Var}(X)\operatorname{Var}(Y)}}.

By Cauchy–Schwarz, ρ1|\rho|\le1.

Covariance is not a general dependence measure. Variables can be dependent with zero covariance. For example, if XX is symmetric about zero and Y=X2Y=X^2, then Cov(X,Y)=E[X3]E[X]E[X2]=0\operatorname{Cov}(X,Y)=\mathbb E[X^3]-\mathbb E[X]\mathbb E[X^2]=0 under symmetry, but YY is determined by XX. Covariance detects linear dependence, not arbitrary dependence.

10.5 Jensen’s inequality

If φ\varphi is convex and XX is integrable with φ(X)\varphi(X) integrable or nonnegative, then

φ(E[X])E[φ(X)].\varphi(\mathbb E[X])\le\mathbb E[\varphi(X)].

For concave φ\varphi, the inequality reverses. Jensen’s inequality says expectation commutes with convex functions only in an inequality direction.

Examples include

(E[X])2E[X2],(\mathbb E[X])^2\le\mathbb E[X^2],

and for positive XX,

logE[X]E[logX]\log\mathbb E[X]\ge\mathbb E[\log X]

because log\log is concave. Jensen is a convexity certificate inside probability: averaging before applying a convex function gives a smaller value than applying the convex function before averaging.

10.6 Markov’s inequality

For X0X\ge0 and a>0a>0,

P(Xa)E[X]a.P(X\ge a)\le\frac{\mathbb E[X]}{a}.

Proof: Xa1{Xa}X\ge a1_{\{X\ge a\}}, so taking expectations gives E[X]aP(Xa)\mathbb E[X]\ge aP(X\ge a).

Markov’s inequality is crude but universal. Applied to Xp|X|^p, it gives

P(Xa)E[Xp]ap.P(|X|\ge a)\le\frac{\mathbb E[|X|^p]}{a^p}.

This converts moment bounds into tail bounds. It is often the first inequality in probabilistic estimates and the base of many concentration arguments.

10.7 Chebyshev’s inequality

If XX has finite variance, then

P(XEXt)Var(X)t2.P(|X-\mathbb E X|\ge t) \le \frac{\operatorname{Var}(X)}{t^2}.

This is Markov’s inequality applied to (XEX)2(X-\mathbb E X)^2.

Chebyshev gives a general concentration bound using only variance. For averages of independent variables with common variance σ2\sigma^2,

Var(1ni=1nXi)=σ2n,\operatorname{Var}\left(\frac1n\sum_{i=1}^nX_i\right)=\frac{\sigma^2}{n},

so

P(1niXiμε)σ2nε2.P\left(\left|\frac1n\sum_iX_i-\mu\right|\ge\varepsilon\right) \le \frac{\sigma^2}{n\varepsilon^2}.

This proves a weak law of large numbers under finite variance.

10.8 Hölder’s inequality

If p,q>1p,q>1 with 1/p+1/q=11/p+1/q=1, then

E[XY](E[Xp])1/p(E[Yq])1/q.\mathbb E[|XY|]\le \left(\mathbb E[|X|^p]\right)^{1/p} \left(\mathbb E[|Y|^q]\right)^{1/q}.

More generally, products of several variables are bounded by corresponding LpL^p norms whose reciprocal exponents sum to one.

Hölder is the core multiplicative inequality of LpL^p spaces. It proves duality bounds, integrability of products, and moment interpolation. Cauchy–Schwarz is the case p=q=2p=q=2:

E[XY]E[X2]E[Y2].|\mathbb E[XY]|\le\sqrt{\mathbb E[X^2]\mathbb E[Y^2]}.

10.9 Minkowski’s inequality

For p1p\ge1,

X+YpXp+Yp,\|X+Y\|_p\le\|X\|_p+\|Y\|_p,

where

Xp=(E[Xp])1/p.\|X\|_p=(\mathbb E[|X|^p])^{1/p}.

This is the triangle inequality in LpL^p.

Minkowski turns LpL^p spaces into normed spaces. It allows probability to use functional analysis: completeness, projections, duality, and compactness methods. For 0<p<10<p<1, p\|\cdot\|_p is not a norm and Minkowski fails; this changes the geometry of the space.

10.10 Lyapunov inequality

On a probability space, if 0<p<q0<p<q, then

(E[Xp])1/p(E[Xq])1/q.(\mathbb E[|X|^p])^{1/p} \le (\mathbb E[|X|^q])^{1/q}.

Thus LqLpL^q\subseteq L^p. Higher moments control lower moments.

Lyapunov’s inequality is a monotonicity principle for moment norms. It follows from Jensen or Hölder. It is frequently used to downgrade assumptions: if a theorem gives a fourth-moment bound, then second and first moments are automatically finite and bounded.

10.11 Paley–Zygmund inequality

For X0X\ge0 with finite second moment and 0<θ<10<\theta<1,

P(XθE[X])(1θ)2(E[X])2E[X2].P(X\ge \theta\mathbb E[X]) \ge (1-\theta)^2\frac{(\mathbb E[X])^2}{\mathbb E[X^2]}.

This gives a lower bound on the probability that XX is not too small.

Paley–Zygmund is a second-moment existence tool. If E[X]2\mathbb E[X]^2 is comparable to E[X2]\mathbb E[X^2], then XX is positive with nontrivial probability. It is central in probabilistic combinatorics, random graphs, and branching processes, where first moment alone may not prove existence with positive probability.

10.12 Moment generating functions

The moment generating function of XX is

MX(t)=E[etX],M_X(t)=\mathbb E[e^{tX}],

where finite. Derivatives at zero, when justified, give moments:

MX(k)(0)=E[Xk].M_X^{(k)}(0)=\mathbb E[X^k].

MGFs transform sums of independent variables into products:

MX+Y(t)=MX(t)MY(t)M_{X+Y}(t)=M_X(t)M_Y(t)

when X,YX,Y are independent.

MGFs support Chernoff bounds:

P(Xa)etaMX(t),t>0.P(X\ge a)\le e^{-ta}M_X(t),\qquad t>0.

Optimizing over tt gives exponential tail estimates. The existence domain of MXM_X is load-bearing; heavy-tailed variables may have infinite MGF for all t>0t>0.

10.13 Characteristic functions

The characteristic function of XX is

ϕX(t)=E[eitX].\phi_X(t)=\mathbb E[e^{itX}].

It always exists because eitX=1|e^{itX}|=1. Characteristic functions determine laws and convert convolution into multiplication:

ϕX+Y(t)=ϕX(t)ϕY(t)\phi_{X+Y}(t)=\phi_X(t)\phi_Y(t)

for independent X,YX,Y.

Characteristic functions are central to the central limit theorem. If

ϕXn(t)ϕ(t)\phi_{X_n}(t)\to\phi(t)

pointwise and ϕ\phi is continuous at zero, then ϕ\phi is the characteristic function of a probability law and XnX_n converges in distribution to that law. This is Lévy’s continuity theorem.

10.14 Cumulants

The cumulant generating function is

KX(t)=logMX(t),K_X(t)=\log M_X(t),

when the MGF exists near zero. The nn-th cumulant is

κn=KX(n)(0).\kappa_n=K_X^{(n)}(0).

The first cumulant is the mean; the second is variance; higher cumulants encode skewness, kurtosis, and non-Gaussian structure.

Cumulants add under independence:

KX+Y(t)=KX(t)+KY(t),K_{X+Y}(t)=K_X(t)+K_Y(t),

so

κn(X+Y)=κn(X)+κn(Y).\kappa_n(X+Y)=\kappa_n(X)+\kappa_n(Y).

Gaussian variables have cumulants of order n3n\ge3 equal to zero. This makes cumulants useful in normal approximation, Edgeworth expansions, statistical mechanics, and random matrix theory.

10.15 Moment determinacy and indeterminacy

A distribution is moment-determinate if its sequence of moments uniquely determines the law. Compactly supported distributions are moment-determinate. A sufficient condition on R\mathbb R is Carleman’s condition:

n=1m2n1/(2n)=,\sum_{n=1}^{\infty}m_{2n}^{-1/(2n)}=\infty,

where m2n=E[X2n]m_{2n}=\mathbb E[X^{2n}].

Moment indeterminacy means two different distributions can share all moments. The lognormal distribution is a standard example of a law not determined by its moments. Therefore “all moments match” is not always a distribution certificate unless determinacy is proved. Characteristic functions avoid this issue because they always determine the law.

Comments

Popular posts from this blog

Semiotics Rebooted

THE COLLAPSE ENGINE: AI, Capital, and the Terminal Logic of 2025

ORSI: The Telic Geometry of Meaning