Probability Theory — Chapters 11–20

Chapter 11 — `Lᵖ` Spaces in Probability

11.1 `L⁰`, `L¹`, `L²`, `Lᵖ`, `L∞`

Given a probability space $(Ω, 𝐹, 𝑃)$ , the space $𝐿^{0}$ consists of measurable random variables modulo almost sure equality. It has no natural norm in general, but it carries convergence in probability. For $0 < 𝑝 < \infty$ , the space $𝐿^{𝑝}$ consists of random variables satisfying

𝐸 ∣ 𝑋 ∣^{𝑝} < \infty .

For $𝑝 \geq 1$ , define

∥ 𝑋 ∥_{𝑝} = (𝐸 ∣ 𝑋 ∣^{𝑝})^{1 / 𝑝} .

For $𝑝 = \infty$ ,

∥ 𝑋 ∥_{\infty} = ess sup ∣ 𝑋 ∣ .

The word “essential” means null sets are ignored. Thus $𝐿^{𝑝}$ is not a space of literal functions but of equivalence classes under $𝑋 = 𝑌$ almost surely.

On a probability space, higher $𝐿^{𝑝}$ integrability implies lower $𝐿^{𝑞}$ integrability:

𝑞 < 𝑝 \Rightarrow 𝐿^{𝑝} \subseteq 𝐿^{𝑞}, ∥ 𝑋 ∥_{𝑞} \leq ∥ 𝑋 ∥_{𝑝} .

This inclusion uses $𝑃 (Ω) = 1$ . On infinite-measure spaces, the inclusion can fail. Probability spaces therefore have a special integrability hierarchy: stronger moments control weaker moments.

11.2 Norms and seminorms modulo null sets

The expression

∥ 𝑋 ∥_{𝑝} = (𝐸 ∣ 𝑋 ∣^{𝑝})^{1 / 𝑝}

is a true norm only after identifying random variables that agree almost surely. Before quotienting, it is merely a seminorm, since $∥ 𝑋 ∥_{𝑝} = 0$ implies $𝑋 = 0$ almost surely, not necessarily pointwise. The quotient operation is not cosmetic; it is forced by measure theory.

The $𝐿^{𝑝}$ carrier therefore has this form:

𝐿^{𝑝} (Ω, 𝐹, 𝑃) = {𝑋 : 𝐸 ∣ 𝑋 ∣^{𝑝} < \infty} / {𝑋 = 𝑌 a.s.} .

Any theorem stated in $𝐿^{𝑝}$ is automatically modulo null sets. A conditional expectation, an $𝐿^{2}$ projection, or a martingale representative is not uniquely defined pointwise unless a version is selected. The null-set quotient is the hidden grammar of modern probability.

11.3 Completeness

For $𝑝 \geq 1$ , $𝐿^{𝑝}$ is a Banach space: every Cauchy sequence in $∥ \cdot ∥_{𝑝}$ converges in $𝐿^{𝑝}$ to an element of $𝐿^{𝑝}$ . Completeness is what allows limiting constructions. If $𝑋_{𝑛}$ is Cauchy in $𝐿^{𝑝}$ , then there exists $𝑋 \in 𝐿^{𝑝}$ such that

∥ 𝑋_{𝑛} - 𝑋 ∥_{𝑝} \to 0.

The proof uses subsequences and almost sure convergence: from an $𝐿^{𝑝}$ -Cauchy sequence one extracts a rapidly Cauchy subsequence, controls the sum of increments, obtains almost sure convergence, and then returns to $𝐿^{𝑝}$ . This pattern is common in probability: norm control produces subsequential pathwise control, then integration reconstructs the full convergence.

11.4 Hilbert-space structure of `L²`

The space $𝐿^{2}$ is a Hilbert space with inner product

⟨ 𝑋, 𝑌 ⟩ = 𝐸 [𝑋 𝑌]

in the real case, or $𝐸 [𝑋 \overline{𝑌}]$ in the complex case. The norm satisfies

∥ 𝑋 ∥_{2}^{2} = 𝐸 [𝑋^{2}] .

Orthogonality means $𝐸 [𝑋 𝑌] = 0$ . Centered variables $𝑋 - 𝐸 𝑋$ and $𝑌 - 𝐸 𝑌$ are orthogonal exactly when their covariance is zero.

This Hilbert structure powers conditional expectation, martingale theory, Gaussian processes, orthogonal decompositions, and least-squares prediction. In $𝐿^{2}$ , probabilistic estimation becomes geometry: the best approximation is an orthogonal projection onto a closed subspace.

11.5 Orthogonality and projection

If $𝐻 \subseteq 𝐿^{2}$ is a closed linear subspace and $𝑋 \in 𝐿^{2}$ , there is a unique $𝑌 \in 𝐻$ minimizing

∥ 𝑋 - 𝑌 ∥_{2} .

The error $𝑋 - 𝑌$ is orthogonal to every element of $𝐻$ :

𝐸 [(𝑋 - 𝑌) 𝑍] = 0, 𝑍 \in 𝐻 .

This is the projection theorem.

In probability, $𝐻$ often consists of all square-integrable random variables measurable with respect to a sub-σ-algebra $𝐺$ . The projection of $𝑋$ onto $𝐻$ is $𝐸 [𝑋 ∣ 𝐺]$ . Thus conditional expectation is not merely an average; in $𝐿^{2}$ , it is the best prediction of $𝑋$ using information $𝐺$ .

11.6 Conditional expectation as projection

For $𝑋 \in 𝐿^{1}$ , conditional expectation $𝐸 [𝑋 ∣ 𝐺]$ is the $𝐺$ -measurable random variable satisfying

\int_{𝐺} 𝐸 [𝑋 ∣ 𝐺] 𝑑 𝑃 = \int_{𝐺} 𝑋 𝑑 𝑃

for every $𝐺 \in 𝐺$ . When $𝑋 \in 𝐿^{2}$ , this object is the orthogonal projection of $𝑋$ onto $𝐿^{2} (𝐺)$ .

This projection interpretation explains the tower property:

𝐸 [𝐸 [𝑋 ∣ 𝐺] ∣ 𝐻] = 𝐸 [𝑋 ∣ 𝐻]

when $𝐻 \subseteq 𝐺$ . Projecting first onto a larger information space and then onto a smaller one gives the same result as projecting directly onto the smaller one.

11.7 Uniform integrability in `L¹`

A family $𝑋 \subset 𝐿^{1}$ is uniformly integrable if

\lim_{𝐾 \to \infty} \sup_{𝑋 \in 𝑋} 𝐸 [∣ 𝑋 ∣ 1_{{∣ 𝑋 ∣ > 𝐾}}] = 0.

This condition prevents expectation from hiding in rare but large values. Boundedness in $𝐿^{𝑝}$ for some $𝑝 > 1$ implies uniform integrability by Hölder or Markov estimates.

Uniform integrability is the correct compactness substitute in $𝐿^{1}$ . It is also the missing bridge between convergence in probability and convergence of expectations. If $𝑋_{𝑛} \to 𝑋$ in probability and ${𝑋_{𝑛}}$ is uniformly integrable, then $𝑋 \in 𝐿^{1}$ and

𝐸 ∣ 𝑋_{𝑛} - 𝑋 ∣ \to 0.

Without uniform integrability, expectations can fail to converge even when random variables converge in probability.

11.8 `Lᵖ` convergence versus probability convergence

Convergence in $𝐿^{𝑝}$ means

𝐸 ∣ 𝑋_{𝑛} - 𝑋 ∣^{𝑝} \to 0.

For $𝑝 > 0$ , $𝐿^{𝑝}$ convergence implies convergence in probability, because Markov’s inequality gives

𝑃 (∣ 𝑋_{𝑛} - 𝑋 ∣ > 𝜀) \leq \frac{𝐸 ∣ 𝑋_{𝑛} - 𝑋 ∣^{𝑝}}{𝜀^{𝑝}} .

The converse is false without additional integrability control.

Different $𝐿^{𝑝}$ modes have different strengths. On a probability space, if $𝑝 > 𝑞 > 0$ , then $𝐿^{𝑝}$ convergence implies $𝐿^{𝑞}$ convergence:

∥ 𝑋_{𝑛} - 𝑋 ∥_{𝑞} \leq ∥ 𝑋_{𝑛} - 𝑋 ∥_{𝑝} .

But convergence in probability alone is only a typical-value statement; it does not control tails. That is why $𝐿^{𝑝}$ spaces encode moment-sensitive convergence, not just distributional behavior.

11.9 Hypercontractivity preview

Hypercontractivity refers to operators that improve integrability: an operator $𝑇$ may map $𝐿^{𝑝}$ into $𝐿^{𝑞}$ with $𝑞 > 𝑝$ , satisfying

∥ 𝑇 𝑓 ∥_{𝑞} \leq 𝐶 ∥ 𝑓 ∥_{𝑝} .

In probability, this phenomenon appears for noise operators, Gaussian semigroups, Boolean functions, Markov semigroups, and log-Sobolev inequalities.

The conceptual value is that smoothing or randomization can upgrade weak moment control into stronger moment control. Hypercontractivity becomes a certificate engine for concentration, tail bounds, invariance principles, and analysis of high-dimensional random structures. It converts functional inequalities into probabilistic regularity.

11.10 Functional-analytic probability carrier

The functional-analytic view treats random variables as elements of normed, Banach, or Hilbert spaces. Expectation becomes a linear functional, conditional expectation becomes projection, martingales become adapted sequences in $𝐿^{𝑝}$ , and convergence theorems become compactness or boundedness principles.

This carrier is powerful because it exposes the geometry behind probability. $𝐿^{1}$ controls mass, $𝐿^{2}$ controls energy and orthogonality, $𝐿^{\infty}$ controls uniform bounds, and $𝐿^{𝑝}$ norms interpolate between tail and moment behavior. The main firewall is that functional-analytic equality is usually equality modulo null sets; pointwise claims require additional version control.

Chapter 12 — Independence

12.1 Independence of events

Events $𝐴$ and $𝐵$ are independent if

𝑃 (𝐴 \cap 𝐵) = 𝑃 (𝐴) 𝑃 (𝐵) .

Equivalently, if $𝑃 (𝐵) > 0$ ,

𝑃 (𝐴 ∣ 𝐵) = 𝑃 (𝐴) .

The event $𝐵$ gives no probabilistic information about $𝐴$ . Independence is not disjointness; disjoint positive-probability events are maximally incompatible, not independent.

For a family ${𝐴_{𝑖}}_{𝑖 \in 𝐼}$ , mutual independence means every finite subfamily factors:

𝑃 (⋂_{𝑗 = 1}^{𝑘} 𝐴_{𝑖_{𝑗}}) = \prod_{𝑗 = 1}^{𝑘} 𝑃 (𝐴_{𝑖_{𝑗}}) .

The finite-subfamily condition is crucial even for infinite families. Independence is always certified by finite joint factorization, not by vague unrelatedness.

12.2 Independence of σ-algebras

Sub-σ-algebras $𝐺$ and $𝐻$ are independent if

𝑃 (𝐺 \cap 𝐻) = 𝑃 (𝐺) 𝑃 (𝐻)

for all $𝐺 \in 𝐺$ , $𝐻 \in 𝐻$ . For a family ${𝐺_{𝑖}}$ , mutual independence means every finite selection of events $𝐺_{𝑖} \in 𝐺_{𝑖}$ factors.

This is the cleanest information-theoretic form of independence. A σ-algebra represents information. Independence of σ-algebras means information in one σ-algebra does not change probabilities of events in the other. Random-variable independence is defined through the σ-algebras generated by those variables.

12.3 Independence of random variables

Random variables $𝑋_{𝑖} : Ω \to 𝑆_{𝑖}$ are independent if the σ-algebras $𝜎 (𝑋_{𝑖})$ are independent. Equivalently, for measurable sets $𝐵_{𝑖} \subseteq 𝑆_{𝑖}$ ,

𝑃 (𝑋_{1} \in 𝐵_{1}, \dots, 𝑋_{𝑛} \in 𝐵_{𝑛}) = \prod_{𝑖 = 1}^{𝑛} 𝑃 (𝑋_{𝑖} \in 𝐵_{𝑖}) .

For real-valued variables, it is enough to check rectangles of the form $(- \infty, 𝑡_{𝑖}]$ .

In law language, independence means the joint law factors:

𝜇_{(𝑋_{1}, \dots, 𝑋_{𝑛})} = 𝜇_{𝑋_{1}} \otimes \dots \otimes 𝜇_{𝑋_{𝑛}} .

This is the most compact certificate. Marginal laws alone do not imply independence; the joint law must be product.

12.4 Factorization of joint laws

If $𝑋$ and $𝑌$ have joint law $𝛾$ on $𝑆 \times 𝑇$ , then they are independent iff

𝛾 = 𝜇_{𝑋} \otimes 𝜇_{𝑌} .

For densities, this becomes

𝑓_{𝑋, 𝑌} (𝑥, 𝑦) = 𝑓_{𝑋} (𝑥) 𝑓_{𝑌} (𝑦)

almost everywhere. For discrete variables, it becomes

𝑃 (𝑋 = 𝑥, 𝑌 = 𝑦) = 𝑃 (𝑋 = 𝑥) 𝑃 (𝑌 = 𝑦)

for all values $𝑥, 𝑦$ .

The phrase “almost everywhere” matters. Density factorizations are measure-level claims. Changing a density on a null set changes the pointwise formula but not the law. The true carrier is the measure factorization, not a chosen version of the density.

12.5 Product measures

Given probability measures $𝜇$ on $𝑆$ and $𝜈$ on $𝑇$ , their product measure $𝜇 \otimes 𝜈$ is characterized by

(𝜇 \otimes 𝜈) (𝐴 \times 𝐵) = 𝜇 (𝐴) 𝜈 (𝐵) .

It extends uniquely under standard measure-theoretic hypotheses from rectangles to the product σ-algebra.

Product measure constructs independent joint randomness from marginal laws. If $𝑋 (𝑠, 𝑡) = 𝑠$ and $𝑌 (𝑠, 𝑡) = 𝑡$ on $𝑆 \times 𝑇$ with law $𝜇 \otimes 𝜈$ , then $𝑋 \sim 𝜇$ , $𝑌 \sim 𝜈$ , and $𝑋, 𝑌$ are independent. Dependence is exactly deviation from this product carrier.

12.6 Infinite independent families

An infinite family ${𝑋_{𝑖}}_{𝑖 \in 𝐼}$ is independent if every finite subfamily is independent. For countable products, one constructs the law on $\prod_{𝑖} 𝑆_{𝑖}$ using finite-dimensional cylinder probabilities:

𝑃 (𝑋_{𝑖_{1}} \in 𝐴_{1}, \dots, 𝑋_{𝑖_{𝑘}} \in 𝐴_{𝑘}) = \prod_{𝑗 = 1}^{𝑘} 𝜇_{𝑖_{𝑗}} (𝐴_{𝑗}) .

Infinite independence is the foundation of coin-flip sequences, iid samples, random walks, and product processes. It produces tail events, zero-one laws, and limit theorems. The finite-dimensional definition is not a weakness; countable probability itself is built by extension from finite-dimensional consistency.

12.7 Pairwise versus mutual independence

Pairwise independence requires every pair to factor. Mutual independence requires every finite collection to factor. Pairwise independence does not control higher-order interactions. For example, let $𝑋, 𝑌$ be independent fair bits and set $𝑍 = 𝑋 \oplus 𝑌$ . Then $𝑋, 𝑌, 𝑍$ are pairwise independent, but not mutually independent because $𝑍$ is determined by $𝑋, 𝑌$ .

This distinction is load-bearing in variance computations, random constructions, hashing, and pseudorandomness. Pairwise independence may be sufficient for second-moment estimates, but not for Chernoff bounds, product distributions, or full limit theorems. The independence strength must match the theorem.

12.8 Conditional independence

Events or random variables $𝑋$ and $𝑌$ are conditionally independent given $𝐺$ if

𝑃 (𝑋 \in 𝐴, 𝑌 \in 𝐵 ∣ 𝐺) = 𝑃 (𝑋 \in 𝐴 ∣ 𝐺) 𝑃 (𝑌 \in 𝐵 ∣ 𝐺)

almost surely, for all measurable $𝐴, 𝐵$ . Conditional independence says that after the information $𝐺$ is known, no residual dependence remains.

Conditional independence is not the same as unconditional independence. Two variables can be dependent marginally but conditionally independent given a latent variable; this is common in Bayesian networks and mixture models. Conversely, conditioning can create dependence through selection effects. Conditional independence is a σ-algebra-relative factorization claim.

12.9 Exchangeability

A sequence $𝑋_{1}, 𝑋_{2}, \dots$ is exchangeable if its finite-dimensional laws are invariant under finite permutations:

(𝑋_{1}, \dots, 𝑋_{𝑛}) \overset{d}{=} (𝑋_{𝜋 (1)}, \dots, 𝑋_{𝜋 (𝑛)})

for every permutation $𝜋$ of ${1, \dots, 𝑛}$ . Iid sequences are exchangeable, but exchangeability is weaker.

Exchangeability encodes symmetry without independence. It says order carries no information, but dependence may remain. A typical example is drawing from a random parameter: conditional on $Θ$ , the variables are iid, but marginally they are dependent. This leads to de Finetti-type representation.

12.10 De Finetti’s theorem preview

De Finetti’s theorem states, in one classical form, that an infinite exchangeable sequence of Bernoulli random variables is conditionally iid given a random parameter $Θ \in [0, 1]$ . That is,

𝑃 (𝑋_{1} = 𝑥_{1}, \dots, 𝑋_{𝑛} = 𝑥_{𝑛}) = \int_{0}^{1} 𝜃^{\sum 𝑥_{𝑖}} (1 - 𝜃)^{𝑛 - \sum 𝑥_{𝑖}} 𝜈 (𝑑 𝜃)

for some mixing measure $𝜈$ .

The theorem converts exchangeability into mixture-of-iid structure. This is a major carrier transformation: symmetry under permutations becomes conditional independence over a latent law. It is central to Bayesian statistics, representation theory of random sequences, and probabilistic symmetry principles.

12.11 Independence as carrier certificate

Independence is a certificate about the joint carrier. It asserts that the joint law is product, or that relevant σ-algebras factor. Without this certificate, products of probabilities, multiplication of MGFs, convolution formulas, Chernoff bounds, and many limit theorems are not licensed.

The carrier view prevents a common error: variables with unrelated names are not automatically independent. Independence must come from construction, assumption, product measure, randomized mechanism, or proven factorization. In advanced probability, most work is spent either exploiting independence or replacing it with weaker dependence-control structures.

12.12 False independence and coupling errors

False independence arises when marginal information is mistaken for joint information. Knowing $𝑋 \sim 𝜇$ and $𝑌 \sim 𝜈$ does not determine $𝑃 (𝑋 \leq 𝑌)$ , $𝐸 [𝑋 𝑌]$ , or $𝑃 (𝑋 = 𝑌)$ . Those require a coupling. If the coupling is product, the variables are independent; if not, dependence must be analyzed.

Another error is assuming pairwise independence gives mutual independence, or assuming conditional independence survives marginalization. A third is using independent copies without constructing them on a product extension. The general rule is:

joint expression requires joint carrier .

No joint carrier, no joint probability.

Chapter 13 — Product Measures and Fubini Theory

13.1 Product σ-algebras

For measurable spaces $(𝑆, 𝑆)$ and $(𝑇, 𝑇)$ , the product σ-algebra is

𝑆 \otimes 𝑇 = 𝜎 ({𝐴 \times 𝐵 : 𝐴 \in 𝑆, 𝐵 \in 𝑇}) .

It is the event space generated by measurable rectangles. Coordinate projections are measurable by construction.

Product σ-algebras are the natural carrier for joint random variables. If $𝑋 : Ω \to 𝑆$ and $𝑌 : Ω \to 𝑇$ , then $(𝑋, 𝑌) : Ω \to 𝑆 \times 𝑇$ is measurable with respect to $𝑆 \otimes 𝑇$ . Joint laws live on this product event structure.

13.2 Product measures

Given σ-finite measures $𝜇$ and $𝜈$ , the product measure $𝜇 \otimes 𝜈$ is the unique measure on $𝑆 \otimes 𝑇$ satisfying

(𝜇 \otimes 𝜈) (𝐴 \times 𝐵) = 𝜇 (𝐴) 𝜈 (𝐵) .

For probability measures, it defines the law of independent coordinates.

Product measure is not just a convenience. It is the mathematical object behind independent sampling. If one says “sample $𝑋 \sim 𝜇$ and independently sample $𝑌 \sim 𝜈$ ,” the joint law is $𝜇 \otimes 𝜈$ . Any other joint law with the same marginals is a coupling but not independent.

13.3 Tonelli’s theorem

Tonelli’s theorem states that if $𝑓 : 𝑆 \times 𝑇 \to [0, \infty]$ is measurable, then

\int_{𝑆 \times 𝑇} 𝑓 𝑑 (𝜇 \otimes 𝜈) = \int_{𝑆} (\int_{𝑇} 𝑓 (𝑠, 𝑡) 𝜈 (𝑑 𝑡)) 𝜇 (𝑑 𝑠) = \int_{𝑇} (\int_{𝑆} 𝑓 (𝑠, 𝑡) 𝜇 (𝑑 𝑠)) 𝜈 (𝑑 𝑡) .

No integrability assumption is needed beyond nonnegativity; the value may be infinite.

Tonelli is the theorem of safe rearrangement for nonnegative quantities. It justifies summing or integrating in either order when no cancellation is possible. Many expectation identities for counts, occupation times, and nonnegative random fields are Tonelli arguments.

13.4 Fubini’s theorem

Fubini’s theorem extends Tonelli to signed or complex functions under integrability:

\int ∣ 𝑓 ∣ 𝑑 (𝜇 \otimes 𝜈) < \infty .

Then iterated integrals exist almost everywhere, are integrable, and

\int 𝑓 𝑑 (𝜇 \otimes 𝜈) = \int \int 𝑓 (𝑠, 𝑡) 𝜈 (𝑑 𝑡) 𝜇 (𝑑 𝑠) = \int \int 𝑓 (𝑠, 𝑡) 𝜇 (𝑑 𝑠) 𝜈 (𝑑 𝑡) .

The integrability hypothesis is load-bearing. Without absolute integrability, changing order can change the value or produce undefined expressions. In probability, Fubini licenses identities such as $𝐸 [𝐸 [𝑋 ∣ 𝑌]] = 𝐸 [𝑋]$ and many conditioning calculations, but only when the relevant integrability conditions hold.

13.5 Iterated expectation

If $𝑋$ is integrable on a product probability space $(𝑆 \times 𝑇, 𝜇 \otimes 𝜈)$ , then

𝐸 [𝑋] = \int_{𝑆} 𝐸 [𝑋 (𝑠, \cdot)] 𝜇 (𝑑 𝑠) = \int_{𝑇} 𝐸 [𝑋 (\cdot, 𝑡)] 𝜈 (𝑑 𝑡) .

This is expectation computed one coordinate at a time.

Iterated expectation generalizes to conditional expectation:

𝐸 [𝑋] = 𝐸 [𝐸 [𝑋 ∣ 𝐺]] .

The inner expectation averages over unresolved randomness; the outer expectation averages over the conditioning information. This is the measure-theoretic form of “average the conditional averages.”

13.6 Independent product construction

To construct independent random variables with given laws $𝜇_{𝑖}$ , take the product space

Ω = \prod_{𝑖} 𝑆_{𝑖}

with product measure $⨂_{𝑖} 𝜇_{𝑖}$ , and define coordinate maps

𝑋_{𝑖} (𝜔) = 𝜔_{𝑖} .

Then the $𝑋_{𝑖}$ are independent and $𝑋_{𝑖} \sim 𝜇_{𝑖}$ .

This construction proves that independent copies exist under standard measurable-space hypotheses. It also clarifies that an “independent copy of $𝑋$ ” is not created inside the original sample space automatically. One may need to extend the probability space to carry the copy.

13.7 Infinite products

Infinite product measures are built from finite-dimensional cylinder probabilities. A cylinder set constrains finitely many coordinates:

{𝜔 : 𝜔_{𝑖_{1}} \in 𝐴_{1}, \dots, 𝜔_{𝑖_{𝑘}} \in 𝐴_{𝑘}} .

Its probability under product measure is

\prod_{𝑗 = 1}^{𝑘} 𝜇_{𝑖_{𝑗}} (𝐴_{𝑗}) .

The infinite product σ-algebra is generated by these finite-coordinate events. It contains many asymptotic events obtained by countable operations, such as convergence of averages and infinitely many successes. Infinite product construction is the backbone of iid sequences and independent stochastic inputs.

13.8 Random sequences

A random sequence is a measurable map

𝑋 : Ω \to 𝑆^{𝑁},

or equivalently a sequence of coordinate random variables $𝑋_{𝑛} : Ω \to 𝑆$ . Its law is a probability measure on sequence space. If the coordinates are independent with common law $𝜇$ , the law is $𝜇^{\otimes 𝑁}$ .

Sequence-space thinking is cleaner than treating each coordinate separately. Tail events, empirical measures, stopping times, and path properties are events on $𝑆^{𝑁}$ . Limit theorems then become statements about the measure of subsets of sequence space.

13.9 Coordinate random variables

On a product space $Ω = \prod_{𝑖} 𝑆_{𝑖}$ , the coordinate random variable is

𝑋_{𝑖} (𝜔) = 𝜔_{𝑖} .

The product σ-algebra is the smallest σ-algebra making all coordinate maps measurable. Finite-dimensional distributions are pushforwards of the product law under finite coordinate projections.

Coordinate variables are canonical, but not every sequence of random variables originally appears on a product space. The product representation is a model realization. What matters is whether the joint law of the given sequence matches the coordinate law on the canonical product carrier.

13.10 Kolmogorov consistency

A family of finite-dimensional distributions ${𝜇_{𝐼}}$ is consistent if marginalizing from a larger finite set $𝐽$ to a smaller finite set $𝐼 \subseteq 𝐽$ gives $𝜇_{𝐼}$ . Formally,

(𝜋_{𝐽 \to 𝐼})_{*} 𝜇_{𝐽} = 𝜇_{𝐼} .

Consistency is necessary because any genuine process must have compatible finite projections.

Kolmogorov’s extension theorem says that, under suitable state-space hypotheses, consistency is sufficient to produce a probability measure on the infinite product. The theorem turns local finite-dimensional specifications into a global stochastic process carrier.

13.11 Product-space pathologies

Product spaces can behave badly when measurability, completion, or topology is mishandled. The product of completed σ-algebras need not equal the completion of the product σ-algebra. Sections of measurable sets may be measurable under good hypotheses, but projections of measurable sets may require analytic-set machinery.

Another pathology is assuming pointwise path regularity from finite-dimensional distributions. A process law on $𝑅^{[0, \infty)}$ may exist without having continuous or càdlàg paths. Regularity requires separate certificates such as continuity criteria, separability, or modification theorems. Product construction gives existence of coordinates; it does not automatically give nice sample paths.

Chapter 14 — Conditional Probability and Conditioning

14.1 Conditioning on positive-probability events

If $𝐵 \in 𝐹$ with $𝑃 (𝐵) > 0$ , define

𝑃 (𝐴 ∣ 𝐵) = \frac{𝑃 (𝐴 \cap 𝐵)}{𝑃 (𝐵)} .

This creates a new probability law on $Ω$ , concentrated on $𝐵$ , or equivalently a normalized restriction of $𝑃$ to $𝐵$ .

The condition $𝑃 (𝐵) > 0$ is essential. If $𝑃 (𝐵) = 0$ , the ratio is undefined. Many paradoxes in continuous probability arise from informal conditioning on null events. Conditioning on $𝑋 = 𝑥$ for a continuous variable requires regular conditional distributions or limiting procedures, not the elementary ratio.

14.2 Conditioning on partitions

If ${𝐵_{𝑖}}$ is a countable partition of $Ω$ with $𝑃 (𝐵_{𝑖}) > 0$ , conditioning on the partition means replacing a random quantity by its average on the cell containing the outcome. For an event $𝐴$ ,

𝑃 (𝐴 ∣ 𝐵_{𝑖}) = \frac{𝑃 (𝐴 \cap 𝐵_{𝑖})}{𝑃 (𝐵_{𝑖})} .

The law of total probability is

𝑃 (𝐴) = \sum_{𝑖} 𝑃 (𝐴 ∣ 𝐵_{𝑖}) 𝑃 (𝐵_{𝑖}) .

For an integrable $𝑋$ , the conditional expectation given the partition is

𝐸 [𝑋 ∣ 𝜎 (𝐵_{𝑖})] = \sum_{𝑖} \frac{𝐸 [𝑋 1_{𝐵_{𝑖}}]}{𝑃 (𝐵_{𝑖})} 1_{𝐵_{𝑖}} .

Thus conditioning on a finite or countable partition is averaging over information cells.

14.3 Conditioning on σ-algebras

Conditioning on a σ-algebra $𝐺 \subseteq 𝐹$ means conditioning on available information. The conditional expectation $𝐸 [𝑋 ∣ 𝐺]$ is $𝐺$ -measurable and satisfies

\int_{𝐺} 𝐸 [𝑋 ∣ 𝐺] 𝑑 𝑃 = \int_{𝐺} 𝑋 𝑑 𝑃

for all $𝐺 \in 𝐺$ .

The σ-algebra formulation subsumes conditioning on events, partitions, random variables, and filtrations. Conditioning on a random variable $𝑌$ means conditioning on $𝜎 (𝑌)$ . The result is a $𝜎 (𝑌)$ -measurable object, often representable as a function $𝑔 (𝑌)$ under standard conditions.

14.4 Conditional expectation

For $𝑋 \in 𝐿^{1}$ , conditional expectation is the unique almost sure equivalence class $𝑍 = 𝐸 [𝑋 ∣ 𝐺]$ such that $𝑍$ is $𝐺$ -measurable and

𝐸 [𝑍 1_{𝐺}] = 𝐸 [𝑋 1_{𝐺}]

for every $𝐺 \in 𝐺$ . It preserves all integrals over events visible to $𝐺$ .

Key properties include linearity, positivity, monotonicity, Jensen’s inequality in conditional form, and the tower property:

𝐸 [𝐸 [𝑋 ∣ 𝐺] ∣ 𝐻] = 𝐸 [𝑋 ∣ 𝐻] if 𝐻 \subseteq 𝐺 .

Conditional expectation is the projection of $𝑋$ onto the information carrier $𝐺$ .

14.5 Conditional distributions

A conditional distribution of $𝑋$ given $𝑌$ is a family of probability measures

𝐾 (𝑦, 𝐴) = 𝑃 (𝑋 \in 𝐴 ∣ 𝑌 = 𝑦)

such that $𝑦 \mapsto 𝐾 (𝑦, 𝐴)$ is measurable and

𝑃 (𝑋 \in 𝐴, 𝑌 \in 𝐵) = \int_{𝐵} 𝐾 (𝑦, 𝐴) 𝜇_{𝑌} (𝑑 𝑦) .

This is a Markov kernel from the state space of $𝑌$ to the state space of $𝑋$ .

Conditional distributions refine conditional expectation. If $𝑔$ is integrable,

𝐸 [𝑔 (𝑋) ∣ 𝑌] = \int 𝑔 (𝑥) 𝐾 (𝑌, 𝑑 𝑥) .

Existence of regular conditional distributions requires suitable measurable-space hypotheses, usually standard Borel spaces.

14.6 Regular conditional probability

A regular conditional probability is a conditional distribution that behaves as an actual probability measure in the conditioned variable and as a measurable function in the conditioning value. For each $𝑦$ , $𝐾 (𝑦, \cdot)$ is a probability measure; for each event $𝐴$ , $𝐾 (\cdot, 𝐴)$ is measurable.

Regular conditional probabilities justify expressions such as $𝑃 (𝑋 \in 𝐴 ∣ 𝑌 = 𝑦)$ , even when $𝑃 (𝑌 = 𝑦) = 0$ . But they are versions, defined only up to $𝜇_{𝑌}$ -null sets in $𝑦$ . Treating a chosen version as canonical at every point can produce errors, especially on null conditioning values.

14.7 Disintegration

Disintegration decomposes a joint measure into conditional measures over a marginal:

𝛾 (𝑑 𝑥, 𝑑 𝑦) = 𝐾 (𝑦, 𝑑 𝑥) 𝜇_{𝑌} (𝑑 𝑦) .

It says a joint law can be represented by first drawing $𝑌 \sim 𝜇_{𝑌}$ , then drawing $𝑋$ according to $𝐾 (𝑌, \cdot)$ , under appropriate space regularity.

This is the measure-theoretic form of conditional modeling. It generalizes density factorization

𝑓_{𝑋, 𝑌} (𝑥, 𝑦) = 𝑓_{𝑋 ∣ 𝑌} (𝑥 ∣ 𝑦) 𝑓_{𝑌} (𝑦) .

The density formula is only a special representation; disintegration is the invariant carrier.

14.8 Bayes theorem in measure form

Bayes’ theorem becomes a change of disintegration. Suppose a prior $𝜋 (𝑑 𝜃)$ and likelihood kernel $𝐿 (𝜃, 𝑑 𝑥)$ define a joint law

𝛾 (𝑑 𝜃, 𝑑 𝑥) = 𝐿 (𝜃, 𝑑 𝑥) 𝜋 (𝑑 𝜃) .

The posterior is a conditional distribution of $𝜃$ given $𝑥$ :

𝜋 (𝑑 𝜃 ∣ 𝑥) = \frac{𝐿 (𝜃, 𝑑 𝑥) 𝜋 (𝑑 𝜃)}{\int 𝐿 (𝜃^{'}, 𝑑 𝑥) 𝜋 (𝑑 𝜃^{'})}

in density notation.

The invariant statement is not the formula with densities; it is the existence and identification of the reverse conditional kernel. Bayes’ theorem is therefore disintegration reversal: it transports from generative factorization to inferential factorization.

14.9 Versions of conditional expectation

Conditional expectation is unique only almost surely. If $𝑍$ and $𝑍^{'}$ both satisfy the defining property, then

𝑃 (𝑍 = 𝑍^{'}) = 1.

They may differ on a null set. Thus $𝐸 [𝑋 ∣ 𝐺]$ is an equivalence class unless a version is selected.

Version issues become serious when evaluating conditional expectations at specific points, especially points of probability zero. A statement true almost surely in the conditioning variable may fail at an exceptional value. Any pointwise use of conditional objects requires a version certificate or regularity theorem.

14.10 Conditioning on null events

The elementary formula $𝑃 (𝐴 ∣ 𝐵) = 𝑃 (𝐴 \cap 𝐵) / 𝑃 (𝐵)$ fails when $𝑃 (𝐵) = 0$ . Continuous conditioning, such as $𝑃 (𝑋 \in 𝐴 ∣ 𝑌 = 𝑦)$ , requires regular conditional probability, limiting conditioning, or geometric disintegration.

Different limiting procedures can give different answers when conditioning on null events. This is the source of Borel-type paradoxes. The conditioning event must be replaced by a specified σ-algebra, kernel, limiting scheme, or coordinate-invariant disintegration. Null conditioning is not illegal, but it is not handled by the finite event-ratio formula.

14.11 Conditional independence

Random variables $𝑋$ and $𝑌$ are conditionally independent given $𝐺$ if for bounded measurable $𝑓, 𝑔$ ,

𝐸 [𝑓 (𝑋) 𝑔 (𝑌) ∣ 𝐺] = 𝐸 [𝑓 (𝑋) ∣ 𝐺] 𝐸 [𝑔 (𝑌) ∣ 𝐺] .

This is the conditional factorization of joint information.

Conditional independence is the language of graphical models, Markov properties, hidden-variable models, and filtering. It is stronger and more precise than saying dependence is “explained by” $𝐺$ . Once $𝐺$ is known, the remaining randomness in $𝑋$ and $𝑌$ factors.

14.12 Filtrations and information

A filtration is an increasing family of σ-algebras:

𝐹_{𝑠} \subseteq 𝐹_{𝑡}, 𝑠 \leq 𝑡 .

It represents information accumulated over time. A process $𝑋_{𝑡}$ is adapted if $𝑋_{𝑡}$ is $𝐹_{𝑡}$ -measurable for every $𝑡$ .

Filtrations are the carrier for martingales, stopping times, stochastic integration, and dynamic conditioning. The phrase “known at time $𝑡$ ” means measurable with respect to $𝐹_{𝑡}$ . Without filtration, temporal probability statements are under-specified.

14.13 Updating as transport between σ-algebras

Updating is movement from a coarser information σ-algebra to a finer one. If $𝐺 \subseteq 𝐻$ , then $𝐸 [𝑋 ∣ 𝐺]$ is the best estimate with less information, while $𝐸 [𝑋 ∣ 𝐻]$ is the updated estimate with more information. The tower property guarantees coherence:

𝐸 [𝐸 [𝑋 ∣ 𝐻] ∣ 𝐺] = 𝐸 [𝑋 ∣ 𝐺] .

This is the formal version of rational updating. New information refines the event grammar. Probabilities change not because the underlying law is incoherent, but because the conditioning σ-algebra has changed. The update is a projection onto a different information carrier.

Chapter 15 — Probability Convergence Grammar

15.1 Almost sure convergence

A sequence $𝑋_{𝑛}$ converges almost surely to $𝑋$ if

𝑃 ({𝜔 : 𝑋_{𝑛} (𝜔) \to 𝑋 (𝜔)}) = 1.

This is pointwise convergence outside a null set. It requires all $𝑋_{𝑛}$ and $𝑋$ to live on a common probability space.

Almost sure convergence is strong because it controls entire sample paths eventually for almost every outcome. It is the natural mode for strong laws, martingale convergence, and pathwise stochastic analysis. But it is still not pointwise convergence everywhere; null exceptional sets remain.

15.2 Convergence in probability

$𝑋_{𝑛} \to 𝑋$ in probability if for every $𝜀 > 0$ ,

𝑃 (∣ 𝑋_{𝑛} - 𝑋 ∣ > 𝜀) \to 0.

This says large deviations from $𝑋$ become unlikely. It also requires a common probability space because the expression $𝑋_{𝑛} - 𝑋$ must be meaningful.

Almost sure convergence implies convergence in probability, but not conversely. Convergence in probability is often the correct mode for weak laws of large numbers, estimator consistency, and randomized approximation. It controls typical behavior at each large $𝑛$ , but not necessarily eventual behavior along almost every sample path.

15.3 `Lᵖ` convergence

$𝑋_{𝑛} \to 𝑋$ in $𝐿^{𝑝}$ if

𝐸 ∣ 𝑋_{𝑛} - 𝑋 ∣^{𝑝} \to 0.

For $𝑝 > 0$ , this implies convergence in probability. For $𝑝 \geq 1$ , it is norm convergence in the Banach space $𝐿^{𝑝}$ .

$𝐿^{𝑝}$ convergence controls both probability of deviation and magnitude of deviation. $𝐿^{1}$ convergence controls expected absolute error; $𝐿^{2}$ convergence controls mean-square error; higher $𝑝$ controls stronger tail behavior. It is more quantitative than convergence in probability but less pathwise than almost sure convergence.

15.4 Convergence in distribution

$𝑋_{𝑛}$ converges in distribution to $𝑋$ , written

𝑋_{𝑛} \Rightarrow 𝑋,

if the laws $𝜇_{𝑋_{𝑛}}$ converge weakly to $𝜇_{𝑋}$ . For real variables, this is equivalent to

𝐹_{𝑋_{𝑛}} (𝑡) \to 𝐹_{𝑋} (𝑡)

at every continuity point $𝑡$ of $𝐹_{𝑋}$ .

Convergence in distribution is law-level. The variables do not need to live on a common probability space. It is the natural mode for central limit theorems and many asymptotic approximations. But it does not by itself imply convergence in probability or almost surely. The arrow is weaker because it forgets coupling.

15.5 Total variation convergence

Probability measures $𝜇_{𝑛}$ converge to $𝜇$ in total variation if

∥ 𝜇_{𝑛} - 𝜇 ∥_{T V} = \sup_{𝐴} ∣ 𝜇_{𝑛} (𝐴) - 𝜇 (𝐴) ∣ \to 0.

Equivalently, when densities exist with respect to a common dominating measure,

∥ 𝜇_{𝑛} - 𝜇 ∥_{T V} = \frac{1}{2} \int ∣ 𝑓_{𝑛} - 𝑓 ∣ 𝑑 𝜆 .

Total variation is stronger than weak convergence. It controls probabilities of all measurable events uniformly. It is central in Markov chain mixing, coupling, statistical distance, and approximation theory. A coupling characterization says total variation is the minimal mismatch probability over all couplings:

∥ 𝜇 - 𝜈 ∥_{T V} = \inf 𝑃 (𝑋 \neq 𝑌) .

15.6 Weak convergence

Weak convergence of probability measures on a metric space means

\int 𝑓 𝑑 𝜇_{𝑛} \to \int 𝑓 𝑑 𝜇

for every bounded continuous $𝑓$ . This is convergence tested by continuous bounded probes, not by all measurable events.

Weak convergence is topology-sensitive. It sees the geometry of the state space. It is weaker than total variation and does not generally imply convergence of expectations for unbounded functions. To pass expectations of unbounded functions, one needs uniform integrability, moment bounds, or stronger Wasserstein-type convergence.

15.7 Vague convergence

Vague convergence is used mainly for locally compact spaces and measures that may not be probability measures. Measures $𝜇_{𝑛}$ converge vaguely to $𝜇$ if

\int 𝑓 𝑑 𝜇_{𝑛} \to \int 𝑓 𝑑 𝜇

for every continuous compactly supported $𝑓$ .

Vague convergence is weaker than weak convergence because compactly supported tests may not see mass escaping to infinity. It is appropriate for point processes, extreme-value theory, and infinite measures. For probability measures, vague convergence plus tightness can often recover weak convergence.

15.8 Relations between convergence modes

The basic implication chain is:

𝐿^{𝑝} convergence \Rightarrow convergence in probability \Rightarrow convergence in distribution .

Almost sure convergence also implies convergence in probability. If $𝑋_{𝑛} \Rightarrow 𝑐$ where $𝑐$ is constant, then $𝑋_{𝑛} \to 𝑐$ in probability.

No reverse implication holds generally without extra hypotheses. Distributional convergence is law-level and can occur without a shared sample space. Probability convergence requires coupling. Almost sure convergence requires pathwise eventual control. $𝐿^{𝑝}$ convergence requires moment control. Each arrow has a different carrier.

15.9 Counterexamples separating convergence modes

Let $𝑋_{𝑛}$ be independent Bernoulli variables with $𝑃 (𝑋_{𝑛} = 1) = 1 / 𝑛$ . Then $𝑋_{𝑛} \to 0$ in probability, since $𝑃 (∣ 𝑋_{𝑛} ∣ > 𝜀) = 1 / 𝑛 \to 0$ , but $𝑋_{𝑛}$ does not converge almost surely to zero if the events are arranged with divergent sum and independence in a Borel–Cantelli construction.

Let $𝑋_{𝑛} = 𝑛$ with probability $1 / 𝑛$ , else $0$ . Then $𝑋_{𝑛} \to 0$ in probability, but $𝐸 [𝑋_{𝑛}] = 1$ , so $𝑋_{𝑛} \to̸ 0$ in $𝐿^{1}$ . Let $𝑋_{𝑛} \sim 𝑁 (0, 1)$ independently of $𝑋 \sim 𝑁 (0, 1)$ ; then $𝑋_{𝑛} \Rightarrow 𝑋$ , but without coupling there is no reason for $𝑋_{𝑛} \to 𝑋$ in probability. These examples prove the convergence modes are not interchangeable.

15.10 Skorokhod representation

The Skorokhod representation theorem states that, under suitable conditions such as Polish state spaces, if $𝑋_{𝑛} \Rightarrow 𝑋$ , then one can construct random variables $𝑌_{𝑛}, 𝑌$ on a new probability space such that

𝑌_{𝑛} \overset{d}{=} 𝑋_{𝑛}, 𝑌 \overset{d}{=} 𝑋, 𝑌_{𝑛} \to 𝑌 almost surely .

This converts weak convergence into almost sure convergence after changing the carrier.

The theorem is powerful but dangerous if misread. It does not say the original $𝑋_{𝑛}$ converge almost surely. It says there exists a coupling with almost sure convergence. Therefore it is a law-level-to-coupling liftback theorem, not a statement about the original sample space.

15.11 Borel–Cantelli lemmas

For events $𝐴_{𝑛}$ , define

𝐴_{𝑛} i.o. = lim sup 𝐴_{𝑛} = ⋂_{𝑚 = 1}^{\infty} ⋃_{𝑛 \geq 𝑚} 𝐴_{𝑛} .

The first Borel–Cantelli lemma states that if

\sum_{𝑛} 𝑃 (𝐴_{𝑛}) < \infty,

then

𝑃 (𝐴_{𝑛} i.o.) = 0.

No independence is required.

The second Borel–Cantelli lemma states that if the $𝐴_{𝑛}$ are independent and

\sum_{𝑛} 𝑃 (𝐴_{𝑛}) = \infty,

then

𝑃 (𝐴_{𝑛} i.o.) = 1.

Thus summability controls eventual occurrence. The second direction requires independence or sufficient dependence control.

15.12 Cauchy criteria in probability

A sequence $𝑋_{𝑛}$ is Cauchy in probability if for every $𝜀 > 0$ ,

𝑃 (∣ 𝑋_{𝑛} - 𝑋_{𝑚} ∣ > 𝜀) \to 0

as $𝑚, 𝑛 \to \infty$ . In many standard settings, Cauchy in probability implies convergence in probability to some random variable.

For $𝐿^{𝑝}$ , the Cauchy criterion is norm-based:

∥ 𝑋_{𝑛} - 𝑋_{𝑚} ∥_{𝑝} \to 0.

Completeness of $𝐿^{𝑝}$ then supplies an $𝐿^{𝑝}$ limit. Cauchy criteria are useful when the limit is not explicitly known, as in stochastic integration, martingale convergence, and construction of processes.

15.13 Convergence of expectations

Convergence of random variables does not automatically imply convergence of expectations. Sufficient conditions include dominated convergence, bounded convergence, monotone convergence, $𝐿^{1}$ convergence, or convergence in probability plus uniform integrability.

For weak convergence, bounded continuous test functions are safe:

𝑋_{𝑛} \Rightarrow 𝑋 \Rightarrow 𝐸 [𝑓 (𝑋_{𝑛})] \to 𝐸 [𝑓 (𝑋)]

for bounded continuous $𝑓$ . For unbounded $𝑓$ , additional conditions are required. The missing bridge is usually uniform integrability of $𝑓 (𝑋_{𝑛})$ .

15.14 Uniform integrability as missing bridge

Uniform integrability converts weak or probability convergence into expectation convergence. If $𝑋_{𝑛} \to 𝑋$ in probability and ${𝑋_{𝑛}}$ is uniformly integrable, then

𝐸 [𝑋_{𝑛}] \to 𝐸 [𝑋] .

If $𝑋_{𝑛} \Rightarrow 𝑋$ and ${∣ 𝑋_{𝑛} ∣}$ has sufficient uniform integrability under a coupling or moment condition, then expectations can often be transferred.

This is the recurring audit rule: convergence controls where most mass goes; uniform integrability controls what rare large mass can do. Without it, expectations can remain fixed, diverge, or oscillate despite convergence in probability or distribution.

Chapter 16 — Laws and Weak Convergence

16.1 Probability measures on metric spaces

A probability law on a metric space $𝑆$ is a probability measure on its Borel σ-algebra $𝐵 (𝑆)$ . Metric structure allows one to define weak convergence, tightness, continuity sets, compactness, and convergence-determining classes.

The move from real-valued variables to metric-space-valued random elements is essential for stochastic processes, empirical measures, random graphs, and random functions. A law on $𝐶 [0, 1]$ , $𝐷 [0, 1]$ , or the space of probability measures is still a probability measure; only the state carrier changes.

16.2 Bounded continuous test functions

Weak convergence $𝜇_{𝑛} \Rightarrow 𝜇$ is defined by

\int 𝑓 𝑑 𝜇_{𝑛} \to \int 𝑓 𝑑 𝜇

for every bounded continuous $𝑓 : 𝑆 \to 𝑅$ . These functions probe the law without being sensitive to null boundary artifacts.

Boundedness prevents tail mass from distorting the test integral; continuity prevents tests from seeing abrupt boundary behavior not controlled by weak convergence. Indicator functions are generally not continuous, so event probabilities require continuity-set conditions.

16.3 Portmanteau theorem

The Portmanteau theorem gives equivalent formulations of weak convergence. On metric spaces, $𝜇_{𝑛} \Rightarrow 𝜇$ iff

\underset{𝑛}{lim sup} 𝜇_{𝑛} (𝐹) \leq 𝜇 (𝐹)

for every closed set $𝐹$ , equivalently

\underset{𝑛}{lim inf} 𝜇_{𝑛} (𝐺) \geq 𝜇 (𝐺)

for every open set $𝐺$ , and equivalently

𝜇_{𝑛} (𝐴) \to 𝜇 (𝐴)

for every $𝜇$ -continuity set $𝐴$ , meaning $𝜇 (\partial 𝐴) = 0$ .

The theorem explains why convergence of CDFs is required only at continuity points. Discontinuities are atoms or boundary masses where indicator tests are not continuous probes. Weak convergence controls events whose boundaries carry no limiting mass.

16.4 Tightness

A family of probability measures ${𝜇_{𝑖}}$ on a metric space is tight if for every $𝜀 > 0$ , there exists compact $𝐾$ such that

\sup_{𝑖} 𝜇_{𝑖} (𝐾^{𝑐}) < 𝜀 .

Tightness says mass does not escape to infinity or into noncompact regions.

On $𝑅^{𝑑}$ , tightness often follows from moment bounds:

\sup_{𝑖} \int ∣ 𝑥 ∣^{𝑝} 𝜇_{𝑖} (𝑑 𝑥) < \infty \Rightarrow {𝜇_{𝑖}} tight .

Tightness is the compactness gate for weak convergence. It gives subsequential limits, but it does not identify the limit; identification requires convergence of test functions, characteristic functions, finite-dimensional distributions, or other determining data.

16.5 Prokhorov theorem

Prokhorov’s theorem states that, on Polish spaces, a family of probability measures is relatively compact in the weak topology iff it is tight. Thus tightness is not merely sufficient but exactly the compactness criterion in good spaces.

This theorem is central in process convergence. To prove $𝑋_{𝑛} \Rightarrow 𝑋$ in a function space, one typically proves tightness of the laws and then identifies all subsequential limits by finite-dimensional distributions or martingale problems. Tightness gives existence of limit candidates; identification eliminates ambiguity.

16.6 Weak convergence on `ℝᵈ`

For laws on $𝑅^{𝑑}$ , weak convergence can be tested by bounded continuous functions, by convergence of distribution functions at continuity points, or by characteristic functions:

𝜙_{𝜇_{𝑛}} (𝑡) = \int 𝑒^{𝑖 ⟨ 𝑡, 𝑥 ⟩} 𝜇_{𝑛} (𝑑 𝑥) .

Lévy’s continuity theorem states that pointwise convergence of characteristic functions to a function continuous at zero gives weak convergence to the corresponding law.

In $𝑅^{𝑑}$ , Cramér–Wold also applies: $𝑋_{𝑛} \Rightarrow 𝑋$ iff

⟨ 𝑢, 𝑋_{𝑛} ⟩ \Rightarrow ⟨ 𝑢, 𝑋 ⟩

for every $𝑢 \in 𝑅^{𝑑}$ . This reduces multivariate convergence to one-dimensional projections.

16.7 Weak convergence on function spaces

Weak convergence on spaces such as $𝐶 [0, 1]$ or $𝐷 [0, 1]$ requires more than convergence of finite-dimensional distributions. One must also prove tightness in the function-space topology. For $𝐶 [0, 1]$ , tightness is often certified by modulus-of-continuity estimates. For $𝐷 [0, 1]$ , Skorokhod topologies handle jumps.

The process-level law contains path regularity. Finite-dimensional distributions only describe coordinates at finitely many times; they do not control oscillation between times. Thus process convergence requires both finite-dimensional convergence and tightness. This is the standard two-gate structure.

16.8 Empirical measures

Given samples $𝑋_{1}, \dots, 𝑋_{𝑛}$ , the empirical measure is

𝜇_{𝑛} = \frac{1}{𝑛} \sum_{𝑖 = 1}^{𝑛} 𝛿_{𝑋_{𝑖}} .

It is a random probability measure. For iid samples with law $𝜇$ , the empirical measure converges weakly to $𝜇$ almost surely under broad conditions:

𝜇_{𝑛} \Rightarrow 𝜇 .

Empirical measures convert samples into law-level random objects. The Glivenko–Cantelli theorem strengthens this on $𝑅$ to uniform convergence of CDFs:

\sup_{𝑥} ∣ 𝐹_{𝑛} (𝑥) - 𝐹 (𝑥) ∣ \to 0

almost surely. Empirical process theory studies fluctuations around this convergence.

16.9 Wasserstein convergence

The $𝑝$ -Wasserstein distance between probability measures on a metric space is

𝑊_{𝑝} (𝜇, 𝜈) = {(\inf_{𝛾 \in Π (𝜇, 𝜈)} \int 𝑑 (𝑥, 𝑦)^{𝑝} 𝛾 (𝑑 𝑥, 𝑑 𝑦))}^{1 / 𝑝},

where $Π (𝜇, 𝜈)$ is the set of couplings. Wasserstein convergence combines weak convergence with moment control.

On $𝑅^{𝑑}$ , $𝑊_{𝑝} (𝜇_{𝑛}, 𝜇) \to 0$ iff $𝜇_{𝑛} \Rightarrow 𝜇$ and the $𝑝$ -th moments converge appropriately. This makes Wasserstein distance a liftback metric: it remembers geometry and tail magnitude, not just weak law convergence.

16.10 Lévy metric

The Lévy metric metrizes weak convergence of probability measures on $𝑅$ . For distribution functions $𝐹, 𝐺$ , it measures the smallest $𝜀$ such that

𝐹 (𝑥 - 𝜀) - 𝜀 \leq 𝐺 (𝑥) \leq 𝐹 (𝑥 + 𝜀) + 𝜀

for all $𝑥$ . It permits small horizontal and vertical errors.

The metric is useful because weak convergence of real laws is exactly convergence in this metric. It is less commonly used in computations than characteristic functions or bounded-Lipschitz metrics, but it formalizes the CDF geometry of weak convergence.

16.11 Convergence-determining classes

A class $𝐶$ of test functions or sets is convergence-determining if convergence on $𝐶$ implies weak convergence. On $𝑅$ , intervals $(- \infty, 𝑡]$ at continuity points determine convergence. On $𝑅^{𝑑}$ , characteristic functions or bounded Lipschitz functions determine convergence.

Convergence-determining classes reduce verification. Instead of testing every bounded continuous function, one tests a smaller structured family. The class must be rich enough to identify the law. Insufficient tests can miss mass or dependence.

16.12 Mapping theorem

If $𝑋_{𝑛} \Rightarrow 𝑋$ and $𝑔 : 𝑆 \to 𝑇$ is continuous, then

𝑔 (𝑋_{𝑛}) \Rightarrow 𝑔 (𝑋) .

More generally, it is enough that the discontinuity set of $𝑔$ has $𝑃_{𝑋}$ -measure zero. This is the mapping theorem.

The theorem transports weak convergence through functions. It is widely used for statistics: once an estimator converges in distribution, continuous transformations of it converge by mapping. Discontinuous transformations require boundary audits; atoms at discontinuities can break the conclusion.

16.13 Continuous mapping theorem

The continuous mapping theorem is the random-variable version of the mapping theorem. If

𝑋_{𝑛} \Rightarrow 𝑋

and $𝑔$ is continuous at $𝑋$ almost surely, then

𝑔 (𝑋_{𝑛}) \Rightarrow 𝑔 (𝑋) .

For vector-valued variables, this handles sums, products, norms, maxima, and smooth transformations when continuous.

The theorem is law-level. It does not claim pointwise convergence of $𝑔 (𝑋_{𝑛})$ . It says the laws transport through continuous maps. For discontinuous $𝑔$ , one must verify that the limit avoids discontinuity points almost surely.

16.14 Slutsky’s theorem

Slutsky’s theorem states that if

𝑋_{𝑛} \Rightarrow 𝑋, 𝑌_{𝑛} \to 𝑐

in probability, where $𝑐$ is constant, then

𝑋_{𝑛} + 𝑌_{𝑛} \Rightarrow 𝑋 + 𝑐, 𝑋_{𝑛} 𝑌_{𝑛} \Rightarrow 𝑐 𝑋,

and if $𝑐 \neq 0$ ,

𝑋_{𝑛} / 𝑌_{𝑛} \Rightarrow 𝑋 / 𝑐 .

Slutsky’s theorem is the standard tool for replacing unknown normalizing constants or nuisance estimators by consistent estimates. It combines weak convergence with probability convergence. The constant limit is important; if $𝑌_{𝑛} \Rightarrow 𝑌$ nonconstant, joint convergence is required to conclude anything about $𝑋_{𝑛} + 𝑌_{𝑛}$ .

Chapter 17 — Laws of Large Numbers

17.1 Weak law of large numbers

The weak law states that for iid integrable random variables $𝑋_{𝑖}$ with mean $𝜇$ ,

\frac{1}{𝑛} \sum_{𝑖 = 1}^{𝑛} 𝑋_{𝑖} \to 𝜇

in probability, under standard hypotheses. With finite variance, the proof is immediate from Chebyshev:

Var (\frac{1}{𝑛} \sum_{𝑖} 𝑋_{𝑖}) = \frac{𝜎^{2}}{𝑛} .

Thus

𝑃 (∣ \frac{1}{𝑛} \sum_{𝑖} 𝑋_{𝑖} - 𝜇 ∣ > 𝜀) \leq \frac{𝜎^{2}}{𝑛 𝜀^{2}} .

The weak law says empirical averages are close to expectation with high probability for large $𝑛$ . It is a typical-sample theorem, not a pathwise theorem. It does not say every infinite sequence has average $𝜇$ , nor that convergence occurs almost surely.

17.2 Strong law of large numbers

The strong law upgrades convergence to almost sure:

\frac{1}{𝑛} \sum_{𝑖 = 1}^{𝑛} 𝑋_{𝑖} \to 𝜇 a.s.

For iid $𝑋_{𝑖}$ , the sharp classical integrability condition is $𝐸 ∣ 𝑋_{1} ∣ < \infty$ . Under finite variance, easier proofs use maximal inequalities or subsequence arguments.

The strong law is pathwise. It says that with probability one, a realized infinite sample sequence has empirical average converging to the mean. This is the formal theorem behind long-run frequency stabilization. It still allows exceptional null sequences where convergence fails.

17.3 Kolmogorov inequality

For independent mean-zero random variables $𝑋_{𝑖}$ with finite variances, Kolmogorov’s maximal inequality states

𝑃 (\max_{1 \leq 𝑘 \leq 𝑛} ∣ \sum_{𝑖 = 1}^{𝑘} 𝑋_{𝑖} ∣ \geq 𝜆) \leq \frac{Var (\sum_{𝑖 = 1}^{𝑛} 𝑋_{𝑖})}{𝜆^{2}} .

If the variables are independent,

Var (\sum_{𝑖} 𝑋_{𝑖}) = \sum_{𝑖} Var (𝑋_{𝑖}) .

This inequality controls the maximum partial sum, not just the final sum. It is therefore suited to almost sure convergence proofs. Strong laws require control over all sufficiently large partial sums, and maximal inequalities provide that pathwise bridge.

17.4 Three-series theorem

Kolmogorov’s three-series theorem characterizes almost sure convergence of sums of independent random variables. For independent $𝑋_{𝑛}$ , the series $\sum 𝑋_{𝑛}$ converges almost surely iff, for some truncation level $𝑐 > 0$ , three series involving large jumps, truncated means, and truncated variances converge:

\sum 𝑃 (∣ 𝑋_{𝑛} ∣ > 𝑐) < \infty,

\sum 𝐸 [𝑋_{𝑛} 1_{{∣ 𝑋_{𝑛} ∣ \leq 𝑐}}] converges,

\sum Var (𝑋_{𝑛} 1_{{∣ 𝑋_{𝑛} ∣ \leq 𝑐}}) < \infty .

The theorem decomposes convergence into jump control, drift control, and fluctuation control. It is the precise independent-sum audit. Large rare terms, accumulated bias, and residual variance are the three possible obstructions.

17.5 Etemadi’s proof

Etemadi gave a clean proof of the strong law for pairwise independent identically distributed integrable random variables. The proof avoids some heavier machinery and shows that full mutual independence is not always necessary for averaging.

The conceptual point is that strong laws require enough independence to control deviations of partial sums. Pairwise independence can suffice when combined with identical distribution and truncation. This is an independence-strength lesson: different limit theorems require different factorization payloads.

17.6 Truncation methods

Truncation replaces $𝑋_{𝑖}$ by bounded variables such as

𝑋_{𝑖}^{(𝑛)} = 𝑋_{𝑖} 1_{{∣ 𝑋_{𝑖} ∣ \leq 𝑛}} .

The goal is to separate ordinary fluctuations from rare large jumps. For integrable $𝑋$ ,

𝐸 [∣ 𝑋 ∣ 1_{{∣ 𝑋 ∣ > 𝑛}}] \to 0.

This makes tail contributions negligible after normalization.

Truncation is a core probability technique because many theorems are easy for bounded variables and hard for unbounded ones. The proof strategy is: prove the theorem for truncated variables, show the discarded tails do not matter, and then lift back to the original variables. The tail estimate is the decisive debt payment.

17.7 Identically distributed versus independent

Identically distributed means all variables have the same law. Independent means the joint law factors. Neither implies the other. A constant sequence $𝑋_{𝑛} = 𝑋_{1}$ is identically distributed but maximally dependent. Independent variables may have different distributions.

The law of large numbers typically requires both a stable average law and weak enough dependence. Identical distribution supplies a common mean; independence supplies variance or fluctuation control. General versions replace identical distribution with average moment conditions and replace independence with mixing, martingale differences, or ergodicity.

17.8 Weak dependence versions

Weak dependence versions of LLN allow correlations but require them to decay or average out. If $𝑋_{𝑖}$ are centered and

Var (\frac{1}{𝑛} \sum_{𝑖 = 1}^{𝑛} 𝑋_{𝑖}) \to 0,

then the average converges to zero in $𝐿^{2}$ and hence in probability. This criterion can hold under covariance summability or mixing conditions.

For stationary sequences, ergodic theorems replace independence. The average converges to a conditional expectation on the invariant σ-algebra. If the system is ergodic, that conditional expectation is constant. Thus independence is one route to averaging, but not the only route.

17.9 Ergodic theorem preview

Birkhoff’s ergodic theorem states that for a measure-preserving transformation $𝑇$ and $𝑓 \in 𝐿^{1}$ ,

\frac{1}{𝑛} \sum_{𝑘 = 0}^{𝑛 - 1} 𝑓 (𝑇^{𝑘} 𝜔) \to 𝐸 [𝑓 ∣ 𝐼] (𝜔) a.s.,

where $𝐼$ is the invariant σ-algebra. If the system is ergodic, $𝐼$ is trivial and the limit is $𝐸 𝑓$ .

This generalizes the strong law from independent samples to deterministic or dependent dynamical sampling. The limit is not automatically the ensemble mean; it is the invariant-information conditional mean. Ergodicity is the gate that collapses time average to ensemble average.

17.10 Empirical averages and model liftback

The LLN connects formal probability to empirical averaging:

{\overset{ˉ}{𝑋}}_{𝑛} = \frac{1}{𝑛} \sum_{𝑖 = 1}^{𝑛} 𝑋_{𝑖} \approx 𝐸 𝑋 .

But the theorem lives inside a model. To export it to data, one must justify that observations are modeled as iid, weakly dependent, stationary ergodic, or otherwise governed by the theorem’s assumptions.

The liftback error is to treat LLN as saying “averages always stabilize.” They stabilize under specific carrier conditions. Heavy tails, nonstationarity, dependence, selection bias, and adversarial sampling can all break the empirical interpretation. LLN is a mathematical certificate after hypotheses are paid, not a universal law of data.

Chapter 18 — Central Limit Theory

18.1 Normal distribution

The normal distribution $𝑁 (𝜇, 𝜎^{2})$ has density

𝑓 (𝑥) = \frac{1}{\sqrt{2 𝜋 𝜎^{2}}} \exp (- \frac{(𝑥 - 𝜇)^{2}}{2 𝜎^{2}}) .

The standard normal is $𝑁 (0, 1)$ . Its characteristic function is

𝜙 (𝑡) = 𝑒^{- 𝑡^{2} / 2} .

The normal law is stable under independent summation: sums of independent normal variables are normal. It also emerges as the universal finite-variance fluctuation limit for many independent sums. The CLT explains Gaussian appearance not as a primitive assumption but as a consequence of aggregation under appropriate normalization.

18.2 Characteristic functions

For a real random variable $𝑋$ ,

𝜙_{𝑋} (𝑡) = 𝐸 [𝑒^{𝑖 𝑡 𝑋}] .

If $𝑋, 𝑌$ are independent,

𝜙_{𝑋 + 𝑌} (𝑡) = 𝜙_{𝑋} (𝑡) 𝜙_{𝑌} (𝑡) .

This multiplicative property makes characteristic functions ideal for sums.

If $𝑋$ has mean $0$ and variance $𝜎^{2}$ , then near zero,

𝜙_{𝑋} (𝑡) = 1 - \frac{𝜎^{2} 𝑡^{2}}{2} + 𝑜 (𝑡^{2}),

under finite second moment. In CLT proofs, this local expansion is raised to the $𝑛$ -th power after scaling $𝑡 / \sqrt{𝑛}$ , producing the Gaussian exponential.

18.3 Lévy continuity theorem

Lévy’s continuity theorem states that if characteristic functions $𝜙_{𝑛} (𝑡)$ converge pointwise to a function $𝜙 (𝑡)$ continuous at $0$ , then $𝜙$ is the characteristic function of some probability law $𝜇$ , and

𝜇_{𝑛} \Rightarrow 𝜇 .

Conversely, weak convergence implies pointwise convergence of characteristic functions.

This theorem is the main transport gate from Fourier analysis to weak convergence. It allows one to prove distributional limits by proving analytic convergence of characteristic functions. Continuity at zero prevents mass loss.

18.4 Classical CLT

Let $𝑋_{1}, 𝑋_{2}, \dots$ be iid with mean $𝜇$ and variance $0 < 𝜎^{2} < \infty$ . Then

\frac{\sum_{𝑖 = 1}^{𝑛} 𝑋_{𝑖} - 𝑛 𝜇}{𝜎 \sqrt{𝑛}} \Rightarrow 𝑁 (0, 1) .

This is the classical central limit theorem.

The normalization is essential. The sum has mean $𝑛 𝜇$ and variance $𝑛 𝜎^{2}$ , so subtracting $𝑛 𝜇$ centers it and dividing by $𝜎 \sqrt{𝑛}$ gives unit variance. The theorem describes fluctuations around the law-of-large-numbers scale. It does not describe rare large deviations far into the tails.

18.5 Lindeberg–Feller CLT

For triangular arrays $𝑋_{𝑛, 𝑘}$ , the Lindeberg–Feller theorem gives conditions under which normalized sums converge to normal. Let independent centered variables have variances $𝜎_{𝑛, 𝑘}^{2}$ and total variance $𝑠_{𝑛}^{2} = \sum_{𝑘} 𝜎_{𝑛, 𝑘}^{2}$ . The Lindeberg condition is

\frac{1}{𝑠_{𝑛}^{2}} \sum_{𝑘} 𝐸 [𝑋_{𝑛, 𝑘}^{2} 1_{{∣ 𝑋_{𝑛, 𝑘} ∣ > 𝜀 𝑠_{𝑛}}}] \to 0

for every $𝜀 > 0$ .

This condition says no single summand contributes a macroscopic part of the variance. It is a tail-negligibility certificate. The theorem generalizes iid CLT to nonidentically distributed arrays and identifies the obstruction: large individual jumps.

18.6 Lyapunov CLT

The Lyapunov condition is a stronger, easier-to-check sufficient condition. If independent centered variables have total variance $𝑠_{𝑛}^{2}$ , and for some $𝛿 > 0$ ,

\frac{1}{𝑠_{𝑛}^{2 + 𝛿}} \sum_{𝑘} 𝐸 ∣ 𝑋_{𝑛, 𝑘} ∣^{2 + 𝛿} \to 0,

then the normalized sum converges to $𝑁 (0, 1)$ .

Lyapunov implies Lindeberg by Markov-type estimates. It pays the large-jump debt using a higher moment. The cost is stronger hypotheses. In applications, Lyapunov is often simpler; Lindeberg is sharper.

18.7 Triangular arrays

A triangular array is a collection $𝑋_{𝑛, 𝑘}$ for $1 \leq 𝑘 \leq 𝑘_{𝑛}$ , where the $𝑛$ -th row is summed and normalized. Arrays model changing distributions, infinitesimal summands, and approximations where no fixed iid sequence exists.

Triangular arrays are the natural carrier for general CLT theory. They separate row-level normalization from variable-level assumptions. The main audit questions are: are variables independent within rows, are they centered, what is the row variance, and do large terms vanish relative to total scale?

18.8 Berry–Esseen theorem

For iid variables with mean $0$ , variance $𝜎^{2} > 0$ , and finite third absolute moment $𝜌 = 𝐸 ∣ 𝑋 ∣^{3}$ , the Berry–Esseen theorem gives

\sup_{𝑥} ∣ 𝑃 (\frac{\sum_{𝑖 = 1}^{𝑛} 𝑋_{𝑖}}{𝜎 \sqrt{𝑛}} \leq 𝑥) - Φ (𝑥) ∣ \leq 𝐶 \frac{𝜌}{𝜎^{3} \sqrt{𝑛}} .

It quantifies the CLT rate.

The theorem turns asymptotic convergence into finite- $𝑛$ error control. The third absolute moment is the rate debt. Without quantitative error, a CLT only says convergence eventually occurs; Berry–Esseen says how fast in Kolmogorov distance.

18.9 Multivariate CLT

For iid random vectors $𝑋_{𝑖} \in 𝑅^{𝑑}$ with mean $𝑚$ and covariance matrix $Σ$ ,

\frac{1}{\sqrt{𝑛}} \sum_{𝑖 = 1}^{𝑛} (𝑋_{𝑖} - 𝑚) \Rightarrow 𝑁 (0, Σ) .

The limiting Gaussian is characterized by

𝐸 𝑒^{𝑖 ⟨ 𝑡, 𝑍 ⟩} = 𝑒^{- \frac{1}{2} 𝑡^{⊤} Σ 𝑡} .

The Cramér–Wold device reduces the proof to one-dimensional CLTs: convergence of all projections $⟨ 𝑡, 𝑋_{𝑛} ⟩$ implies multivariate convergence. The covariance matrix encodes all second-order limiting geometry.

18.10 Delta method

\sqrt{𝑛} ({\hat{𝜃}}_{𝑛} - 𝜃) \Rightarrow 𝑍

and $𝑔$ is differentiable at $𝜃$ , then

\sqrt{𝑛} (𝑔 ({\hat{𝜃}}_{𝑛}) - 𝑔 (𝜃)) \Rightarrow 𝐷 𝑔 (𝜃) 𝑍 .

In one dimension,

\sqrt{𝑛} (𝑔 ({\hat{𝜃}}_{𝑛}) - 𝑔 (𝜃)) \Rightarrow 𝑔^{'} (𝜃) 𝑍 .

The delta method transports asymptotic normality through smooth transformations. If the first derivative vanishes, higher-order delta methods are needed with different scaling. The differentiability and nondegeneracy assumptions are the liftback gates.

18.11 Stable convergence

Stable convergence strengthens convergence in distribution by preserving joint convergence with bounded variables measurable with respect to an underlying σ-algebra. One writes

𝑋_{𝑛} \Rightarrow_{s t} 𝑋

when $𝑋$ may live on an extension and joint limits with background randomness are retained.

This mode appears in martingale CLTs, random environments, and asymptotic statistics with mixed normal limits. Ordinary weak convergence forgets coupling to the original information; stable convergence remembers enough joint structure to support conditional limits.

18.12 Failure of CLT under heavy tails

If $𝑋_{𝑖}$ have infinite variance, the $\sqrt{𝑛}$ Gaussian CLT may fail. Heavy-tailed variables with

𝑃 (∣ 𝑋 ∣ > 𝑡) \sim 𝐶 𝑡^{- 𝛼}, 0 < 𝛼 < 2,

can converge, after normalization $𝑛^{1 / 𝛼}$ , to stable laws rather than Gaussians.

The obstruction is that large jumps do not become negligible. Variance is infinite, and no single quadratic scale controls fluctuations. The correct limit carrier becomes stable law theory, not Gaussian theory. Thus “sum of many variables is normal” is false without tail hypotheses.

18.13 CLT versus tail probabilities

The CLT describes probabilities at fluctuation scale:

𝑆_{𝑛} - 𝑛 𝜇 = 𝑂 (\sqrt{𝑛}) .

It does not accurately describe rare deviations such as

𝑆_{𝑛} - 𝑛 𝜇 \geq 𝑐 𝑛 .

Those belong to large deviation theory, where probabilities decay exponentially and rate functions, not Gaussian densities, govern behavior.

Using CLT for far-tail estimates is a common error. Moderate deviations interpolate between CLT and large deviations, but require their own hypotheses. The scale must be declared before a limit theorem is applied.

18.14 Gaussian approximation residue

A Gaussian approximation has residue: centering error, variance estimation error, dependence error, tail error, lattice correction, finite-sample rate, and test-function class. A bare CLT only gives weak convergence:

𝐸 𝑓 (𝑍_{𝑛}) \to 𝐸 𝑓 (𝑍)

for bounded continuous $𝑓$ . It does not automatically give density approximation, tail approximation, local probability approximation, or uniform finite- $𝑛$ accuracy.

The correct terminal depends on the claim. If the claim is asymptotic distribution, CLT may suffice. If the claim is finite probability, confidence interval accuracy, rare-event estimate, or density approximation, additional quantitative certificates are required.

Chapter 19 — Poisson and Rare-Event Limits

19.1 Law of small numbers

The law of small numbers says that the sum of many rare, approximately independent indicators tends to a Poisson distribution. If $𝑋_{𝑛} \sim Binomial (𝑛, 𝜆 / 𝑛)$ , then

𝑋_{𝑛} \Rightarrow Poisson (𝜆) .

The heuristic is: many opportunities, each with small probability, with total expected count near $𝜆$ .

The Poisson law is therefore the rare-event analogue of the Gaussian law. Gaussian limits arise from many small additive fluctuations with finite variance; Poisson limits arise from sparse counts of rare events. The scaling regime determines the limit carrier.

19.2 Poisson approximation

For a sum of indicators

𝑊 = \sum_{𝑖 \in 𝐼} 𝐼_{𝑖}, 𝑝_{𝑖} = 𝑃 (𝐼_{𝑖} = 1),

the natural Poisson parameter is

𝜆 = 𝐸 𝑊 = \sum_{𝑖} 𝑝_{𝑖} .

If the indicators are rare and weakly dependent, then $𝑊$ is close to $Poisson (𝜆)$ .

Exact approximation requires an error metric such as total variation:

𝑑_{T V} (𝐿 (𝑊), Poisson (𝜆)) .

The quality depends on probabilities of individual events and dependence neighborhoods. Rare-event approximation is not just matching the mean; dependence can create clusters and destroy Poisson behavior.

19.3 Binomial-to-Poisson limit

Let $𝑋_{𝑛} \sim Binomial (𝑛, 𝑝_{𝑛})$ with $𝑛 𝑝_{𝑛} \to 𝜆$ . Then for fixed $𝑘$ ,

𝑃 (𝑋_{𝑛} = 𝑘) = (\binom{𝑛}{𝑘}) 𝑝_{𝑛}^{𝑘} (1 - 𝑝_{𝑛})^{𝑛 - 𝑘} \to 𝑒^{- 𝜆} \frac{𝜆^{𝑘}}{𝑘!} .

Thus $𝑋_{𝑛} \Rightarrow Poisson (𝜆)$ .

The proof shows the three components: $(\binom{𝑛}{𝑘}) \sim 𝑛^{𝑘} / 𝑘!$ , $𝑝_{𝑛}^{𝑘} \sim (𝜆 / 𝑛)^{𝑘}$ , and $(1 - 𝑝_{𝑛})^{𝑛} \to 𝑒^{- 𝜆}$ . This is the canonical sparse independent limit.

19.4 Chen–Stein method

The Chen–Stein method gives quantitative Poisson approximation for dependent indicators. It characterizes the Poisson distribution by an operator equation and bounds the distance between $𝑊$ and Poisson through local dependence terms.

A typical bound involves dependency neighborhoods $𝐵_{𝑖}$ , with errors built from sums such as

\sum_{𝑖} \sum_{𝑗 \in 𝐵_{𝑖}} 𝑝_{𝑖} 𝑝_{𝑗}

and joint probabilities

\sum_{𝑖} \sum_{𝑗 \in 𝐵_{𝑖}, 𝑗 \neq 𝑖} 𝑃 (𝐼_{𝑖} = 1, 𝐼_{𝑗} = 1) .

The method pays the dependence debt explicitly. It is widely used in random graphs, occupancy problems, pattern matching, and rare-event counts.

19.5 Rare indicators

Rare indicators are variables $𝐼_{𝑖} = 1_{𝐴_{𝑖}}$ with small $𝑃 (𝐴_{𝑖})$ . Their sum counts rare events. The Poisson regime requires not only that each $𝑃 (𝐴_{𝑖})$ is small, but that the total mean remains finite:

\sum_{𝑖} 𝑃 (𝐴_{𝑖}) \to 𝜆 .

If rare events occur in clusters, the limit may be compound Poisson rather than Poisson. If dependence is too strong, no Poisson limit may hold. The audit is: individual rarity, total intensity, and clustering control.

19.6 Dependency neighborhoods

A dependency neighborhood $𝐵_{𝑖}$ for $𝐼_{𝑖}$ is a set of indices containing variables that may significantly depend on $𝐼_{𝑖}$ . Outside $𝐵_{𝑖}$ , approximate independence is assumed or proved. The smaller and weaker these neighborhoods are, the closer the count is to Poisson.

In random graphs, the indicator for a subgraph copy depends on indicators for overlapping copies. Nonoverlapping copies may be independent. Thus dependency neighborhoods are combinatorial overlap structures. Poisson approximation reduces to counting overlaps and showing their contribution vanishes.

19.7 Point processes preview

A point process is a random counting measure

𝑁 = \sum_{𝑖} 𝛿_{𝑋_{𝑖}}

on a space $𝑆$ . A Poisson point process with intensity measure $𝜇$ satisfies:

𝑁 (𝐴) \sim Poisson (𝜇 (𝐴))

for measurable $𝐴$ , and counts on disjoint sets are independent.

Poisson point processes are the spatial version of rare-event limits. Instead of only counting how many rare events occur, they record where they occur. Many rare-event limits are better stated as convergence of point processes, with ordinary Poisson count convergence as a projection.

19.8 Poissonization and depoissonization

Poissonization replaces a fixed sample size $𝑛$ by a Poisson random sample size $𝑁 \sim Poisson (𝑛)$ . This often makes counts independent. For example, in occupancy problems, Poissonizing the number of balls makes bin occupancies independent Poisson variables.

Depoissonization transfers estimates back to fixed $𝑛$ . This transfer requires error control showing that randomizing the sample size did not change the target quantity too much. Poissonization is a carrier simplification; depoissonization is the liftback.

19.9 Occupancy and collision limits

In occupancy with $𝑚$ balls and $𝑛$ boxes, the number of boxes with exactly $𝑟$ balls is a rare-event count in many regimes. Collision counts are sums over pairs:

𝐶 = \sum_{𝑖 < 𝑗} 1_{{𝑋_{𝑖} = 𝑋_{𝑗}}} .

When $𝑚^{2} / 𝑛 \to 𝜆$ , collisions can have Poisson limits.

The exact limit depends on scaling. If $𝑚 ≪ \sqrt{𝑛}$ , collisions vanish. If $𝑚 \sim 𝑐 \sqrt{𝑛}$ , collisions have nontrivial Poisson behavior. If $𝑚 ≫ \sqrt{𝑛}$ , collisions become abundant. Occupancy models therefore display rare-event thresholds through elementary counting.

Chapter 20 — Local Limit Theorems

20.1 Global versus local convergence

A global limit theorem, such as the CLT, says distribution functions converge:

𝑃 (\frac{𝑆_{𝑛} - 𝑎_{𝑛}}{𝑏_{𝑛}} \leq 𝑥) \to Φ (𝑥) .

A local limit theorem estimates point probabilities or small-window probabilities:

𝑃 (𝑆_{𝑛} = 𝑘) \approx \frac{1}{𝑏_{𝑛}} 𝑔 (\frac{𝑘 - 𝑎_{𝑛}}{𝑏_{𝑛}})

in lattice cases, or density-level approximations in continuous cases.

Local theorems are stronger. Weak convergence tests large intervals with continuity boundaries; local convergence probes individual atoms or shrinking windows. It requires finer Fourier control and must account for lattice structure.

20.2 Lattice distributions

A distribution is lattice if it is supported on

𝑎 + ℎ 𝑍

for some span $ℎ > 0$ . Sums of lattice variables remain on a lattice. For integer-valued iid variables with mean $𝜇$ and variance $𝜎^{2}$ , a typical local CLT has the form

𝑃 (𝑆_{𝑛} = 𝑘) \sim \frac{1}{𝜎 \sqrt{𝑛}} 𝜑 (\frac{𝑘 - 𝑛 𝜇}{𝜎 \sqrt{𝑛}}),

for admissible lattice points $𝑘$ , with $𝜑$ the standard normal density.

The lattice span matters. If the variable only takes even values, odd target points have probability zero. A local theorem that ignores lattice support is false. Lattice audit is therefore mandatory.

20.3 Nonlattice distributions

A nonlattice distribution is not concentrated on any shifted arithmetic progression $𝑎 + ℎ 𝑍$ . For sums of nonlattice variables, local results are usually stated in terms of densities, small intervals, or smoothed probabilities rather than point masses, since $𝑃 (𝑆_{𝑛} = 𝑥) = 0$ in continuous cases.

If $𝑆_{𝑛}$ has density $𝑓_{𝑛}$ , a density local CLT may state

𝜎 \sqrt{𝑛} 𝑓_{𝑛} (𝑛 𝜇 + 𝜎 \sqrt{𝑛} 𝑥) \to 𝜑 (𝑥)

uniformly in $𝑥$ , under regularity conditions. Nonlattice assumptions prevent periodic Fourier obstructions that appear in lattice cases.

20.4 Fourier inversion

Local limit theorems rely heavily on Fourier inversion. For integer-valued $𝑆_{𝑛}$ ,

𝑃 (𝑆_{𝑛} = 𝑘) = \frac{1}{2 𝜋} \int_{- 𝜋}^{𝜋} 𝑒^{- 𝑖 𝑡 𝑘} 𝜙_{𝑆_{𝑛}} (𝑡) 𝑑 𝑡 .

For continuous densities,

𝑓_{𝑆_{𝑛}} (𝑥) = \frac{1}{2 𝜋} \int_{𝑅} 𝑒^{- 𝑖 𝑡 𝑥} 𝜙_{𝑆_{𝑛}} (𝑡) 𝑑 𝑡

when inversion is justified.

The proof strategy splits the Fourier domain. Near $𝑡 = 0$ , characteristic functions approximate the Gaussian exponential. Away from zero, one needs decay bounds to show contributions are negligible. Local results require stronger global Fourier control than ordinary CLT.

20.5 Aperiodicity

For integer-valued variables, aperiodicity means the support is not contained in a proper sublattice $𝑎 + ℎ 𝑍$ with $ℎ > 1$ . Equivalently, the characteristic function satisfies

∣ 𝜙 (𝑡) ∣ < 1

for $𝑡 \in [- 𝜋, 𝜋] ∖ {0}$
in the span-one case.

Aperiodicity prevents periodic zeros and inaccessible residue classes. Without it, the local limit must be restricted to reachable lattice points and corrected by the span. Aperiodicity is the discrete analogue of nonlattice regularity.

20.6 Gaussian local limit theorem

For iid integer-valued aperiodic variables with mean $𝜇$ and variance $𝜎^{2}$ , the Gaussian local limit theorem states

\sup_{𝑘} ∣ 𝜎 \sqrt{𝑛} 𝑃 (𝑆_{𝑛} = 𝑘) - 𝜑 (\frac{𝑘 - 𝑛 𝜇}{𝜎 \sqrt{𝑛}}) ∣ \to 0.

This is stronger than the CLT because it approximates individual probabilities.

The theorem shows that the distribution mass near $𝑘$ is approximately the Gaussian density times the lattice cell width $1 / (𝜎 \sqrt{𝑛})$ . It is the correct bridge from continuous Gaussian shape to discrete point probabilities.

20.7 Edgeworth expansions

Edgeworth expansions refine normal approximation by adding correction terms involving cumulants. A typical density-level expansion has the form

𝑓_{𝑛} (𝑥) = 𝜑 (𝑥) [1 + \frac{𝑃_{1} (𝑥)}{\sqrt{𝑛}} + \frac{𝑃_{2} (𝑥)}{𝑛} + \dots] + 𝑜 (𝑛^{- 𝑚 / 2}),

where $𝑃_{𝑖}$ are polynomials determined by cumulants.

These expansions give higher-order asymptotics beyond CLT. They require stronger moment and smoothness conditions and may fail in far tails. Edgeworth expansions are asymptotic series, not automatically positive densities or uniform global approximations.

20.8 Saddle-point methods preview

Saddle-point methods estimate probabilities using complex analytic or exponential tilting techniques. They are especially useful for local probabilities and tail probabilities where Gaussian approximation is insufficient. The method locates the dominant contribution to an integral representation, often from a point where a phase derivative vanishes.

In probability, saddle-point methods appear in sums, combinatorial enumeration, branching processes, and large deviations. They refine exponential-scale estimates by adding prefactors. The carrier is analytic: moment generating functions, cumulant generating functions, and contour integrals.

20.9 Local probability estimates

Local estimates determine probabilities of specific values or small windows:

𝑃 (𝑆_{𝑛} = 𝑘), 𝑃 (𝑆_{𝑛} \in [𝑥, 𝑥 + Δ]) .

They are required in random walks, renewal theory, combinatorics, statistical mechanics, and number-theoretic probability. Global convergence may be too coarse because it cannot resolve individual mass points.

A local estimate usually needs variance scale, lattice/nonlattice classification, smoothness or aperiodicity, and Fourier decay. The output is often uniform over central ranges and weaker in tails. Tail-local estimates may require large deviation or saddle-point machinery.

20.10 Lattice obstruction audit

Before applying a local limit theorem, check the support. If $𝑋$ is supported on $𝑎 + ℎ 𝑍$ , then $𝑆_{𝑛}$ is supported on

𝑛 𝑎 + ℎ 𝑍 .

Any claimed approximation at points outside this set is false because the probability is exactly zero.

The audit also includes span, periodicity, residue classes, and whether the target variable has density or atoms. The CLT can ignore these details because intervals blur them; local limits cannot. Local probability is where carrier arithmetic reappears.