Probability Theory — Chapters 30–40

Chapter 30 — Stochastic Differential Equations

30.1 Itô SDEs

An Itô stochastic differential equation has the form

𝑑 𝑋_{𝑡} = 𝑏 (𝑡, 𝑋_{𝑡}) 𝑑 𝑡 + 𝜎 (𝑡, 𝑋_{𝑡}) 𝑑 𝐵_{𝑡}, 𝑋_{0} = 𝜉,

where $𝐵_{𝑡}$ is Brownian motion, $𝑏$ is the drift coefficient, and $𝜎$ is the diffusion coefficient. The rigorous meaning is the integral equation

𝑋_{𝑡} = 𝜉 + \int_{0}^{𝑡} 𝑏 (𝑠, 𝑋_{𝑠}) 𝑑 𝑠 + \int_{0}^{𝑡} 𝜎 (𝑠, 𝑋_{𝑠}) 𝑑 𝐵_{𝑠} .

The first integral is an ordinary Lebesgue integral along the sample path; the second is an Itô stochastic integral.

In $𝑑$ dimensions with an $𝑚$ -dimensional Brownian driver,

𝑑 𝑋_{𝑡} = 𝑏 (𝑡, 𝑋_{𝑡}) 𝑑 𝑡 + 𝜎 (𝑡, 𝑋_{𝑡}) 𝑑 𝐵_{𝑡},

where $𝑏 : 𝑅_{+} \times 𝑅^{𝑑} \to 𝑅^{𝑑}$ and

𝜎 : 𝑅_{+} \times 𝑅^{𝑑} \to 𝑅^{𝑑 \times 𝑚} .

The instantaneous covariance matrix is

𝑎 (𝑡, 𝑥) = 𝜎 (𝑡, 𝑥) 𝜎 (𝑡, 𝑥)^{⊤} .

Different $𝜎$ can induce the same covariance matrix $𝑎$ , so the diffusion law may depend on $𝑎$ rather than on a unique factorization.

An SDE should be understood as a carrier consisting of probability space, filtration, Brownian motion, coefficient functions, initial law, and solution concept. The differential notation alone does not specify whether the solution is strong or weak, whether explosion is allowed, what happens at boundaries, or whether uniqueness is pathwise or merely in law.

30.2 Strong and weak solutions

A strong solution is constructed on a prescribed filtered probability space carrying a prescribed Brownian motion. The solution $𝑋_{𝑡}$ is adapted to that filtration and satisfies the integral equation. Informally, once the Brownian path and initial condition are supplied, $𝑋$ is determined as a measurable functional of those inputs when pathwise uniqueness holds.

A weak solution allows the probability space, filtration, Brownian motion, and process $𝑋$ to be constructed together. The target is the existence of a joint law satisfying the SDE. Weak solutions therefore belong to a law-level carrier.

The two uniqueness notions differ. Pathwise uniqueness means that if $𝑋$ and $𝑌$ solve the SDE on the same probability space with the same Brownian motion and initial value, then

𝑃 (𝑋_{𝑡} = 𝑌_{𝑡} for all 𝑡) = 1.

Uniqueness in law means all weak solutions have the same distribution on path space.

The distinction is fundamental:

strong solution = pathwise construction, weak solution = law construction .

Equality in law is not pathwise identity. A theorem proving weak uniqueness cannot automatically be used to compare two solutions driven by the same noise.

30.3 Existence and uniqueness

A standard sufficient condition for strong existence and pathwise uniqueness is global Lipschitz continuity:

∣ 𝑏 (𝑡, 𝑥) - 𝑏 (𝑡, 𝑦) ∣ + ∥ 𝜎 (𝑡, 𝑥) - 𝜎 (𝑡, 𝑦) ∥ \leq 𝐿 ∣ 𝑥 - 𝑦 ∣,

together with linear growth:

∣ 𝑏 (𝑡, 𝑥) ∣ + ∥ 𝜎 (𝑡, 𝑥) ∥ \leq 𝐶 (1 + ∣ 𝑥 ∣) .

Under these assumptions, Picard iteration produces a unique global strong solution.

Starting from $𝑋_{𝑡}^{(0)} = 𝜉$ , define

𝑋_{𝑡}^{(𝑛 + 1)} = 𝜉 + \int_{0}^{𝑡} 𝑏 (𝑠, 𝑋_{𝑠}^{(𝑛)}) 𝑑 𝑠 + \int_{0}^{𝑡} 𝜎 (𝑠, 𝑋_{𝑠}^{(𝑛)}) 𝑑 𝐵_{𝑠} .

Itô isometry, BDG inequalities, and Grönwall's lemma show convergence in suitable process norms.

Local Lipschitz conditions usually produce uniqueness only up to explosion time. Irregular coefficients may still admit weak or even strong solutions, but the proof route changes. Possible tools include martingale problems, monotonicity methods, one-dimensional comparison, Zvonkin transformations, Dirichlet forms, or compactness methods.

The gate structure is:

formal SDE \to existence \to solution concept \to uniqueness class \to globality .

These are separate results.

30.4 Explosion times

A local solution may cease to remain finite after a random time. Define

𝜏_{𝑛} = \inf {𝑡 \geq 0 : ∣ 𝑋_{𝑡} ∣ \geq 𝑛}, 𝜏_{\infty} = \lim_{𝑛 \to \infty} 𝜏_{𝑛} .

The process explodes if

𝑃 (𝜏_{\infty} < \infty) > 0.

Global Lipschitz and linear-growth conditions prevent explosion. More flexible nonexplosion criteria use Lyapunov functions. Suppose $𝑉 : 𝑅^{𝑑} \to [0, \infty)$ satisfies

𝑉 (𝑥) \to \infty as ∣ 𝑥 ∣ \to \infty

and

𝐴 𝑉 (𝑥) \leq 𝐶 (1 + 𝑉 (𝑥)),

where $𝐴$ is the generator. Itô's formula and Grönwall estimates can then bound

𝐸 [𝑉 (𝑋_{𝑡 \land 𝜏_{𝑛}})]

uniformly in $𝑛$ , implying nonexplosion under suitable hypotheses.

Explosion is a local-global gate. Proving the SDE has a solution for every bounded stopping region does not prove the process exists globally. The passage

local solution \to global process

requires a separate nonexplosion certificate.

30.5 Diffusion generators

For

𝑑 𝑋_{𝑡} = 𝑏 (𝑋_{𝑡}) 𝑑 𝑡 + 𝜎 (𝑋_{𝑡}) 𝑑 𝐵_{𝑡},

define

𝑎 (𝑥) = 𝜎 (𝑥) 𝜎 (𝑥)^{⊤} .

The generator acts formally on smooth $𝑓$ as

𝐴 𝑓 (𝑥) = 𝑏 (𝑥) \cdot \nabla 𝑓 (𝑥) + \frac{1}{2} \sum_{𝑖, 𝑗} 𝑎_{𝑖 𝑗} (𝑥) \partial_{𝑖 𝑗} 𝑓 (𝑥) .

Itô's formula gives

𝑓 (𝑋_{𝑡}) - 𝑓 (𝑋_{0}) - \int_{0}^{𝑡} 𝐴 𝑓 (𝑋_{𝑠}) 𝑑 𝑠

as a local martingale, and a true martingale under suitable integrability. This identity connects the SDE carrier to the martingale-problem carrier.

The generator contains infinitesimal transition information:

𝐴 𝑓 (𝑥) = \lim_{𝑡 ↓ 0} \frac{𝐸_{𝑥} [𝑓 (𝑋_{𝑡})] - 𝑓 (𝑥)}{𝑡} .

It links stochastic dynamics to PDE, potential theory, semigroups, invariant measures, and spectral analysis.

The differential expression is not the complete generator. Its domain matters. Boundary conditions and function-space closure can produce different Markov processes from similar formal differential operators.

30.6 Fokker–Planck equations

If $𝑋_{𝑡}$ has density $𝑝 (𝑡, 𝑥)$ , then formally its law evolves according to the forward Kolmogorov or Fokker–Planck equation

\partial_{𝑡} 𝑝 = 𝐴^{*} 𝑝 .

For a diffusion,

\partial_{𝑡} 𝑝 = - \sum_{𝑖} \partial_{𝑖} (𝑏_{𝑖} 𝑝) + \frac{1}{2} \sum_{𝑖, 𝑗} \partial_{𝑖 𝑗} (𝑎_{𝑖 𝑗} 𝑝) .

The SDE and Fokker–Planck equation describe different carriers. The SDE describes individual random paths. The Fokker–Planck equation describes transport of probability mass. One may have a well-defined process without a smooth density, so the PDE representation requires regularity beyond process existence.

Stationary densities satisfy

𝐴^{*} 𝑝_{\infty} = 0

together with normalization and boundary conditions. Solving this equation may identify an invariant law, but uniqueness of the invariant law and convergence toward it require additional recurrence or ergodicity arguments.

30.7 Kolmogorov backward equation

For

𝑢 (𝑡, 𝑥) = 𝐸_{𝑥} [𝑓 (𝑋_{𝑡})],

the semigroup relation gives

𝑢 (𝑡, 𝑥) = 𝑃_{𝑡} 𝑓 (𝑥) .

Under regularity conditions,

\partial_{𝑡} 𝑢 = 𝐴 𝑢, 𝑢 (0, 𝑥) = 𝑓 (𝑥) .

For a terminal-value problem,

𝑢 (𝑡, 𝑥) = 𝐸_{𝑡, 𝑥} [𝑔 (𝑋_{𝑇})],

one obtains

\partial_{𝑡} 𝑢 + 𝐴 𝑢 = 0, 𝑢 (𝑇, 𝑥) = 𝑔 (𝑥) .

The forward equation acts on probability distributions; the backward equation acts on observables or value functions:

forward : law \to law, backward : observable \to expected observable .

This distinction is central in filtering, stochastic control, potential theory, and mathematical finance. Confusing the adjoint roles of $𝐴$ and $𝐴^{*}$ produces incorrect PDE transport.

30.8 Change of measure

Suppose

𝑑 𝑋_{𝑡} = 𝑏_{𝑡} 𝑑 𝑡 + 𝜎_{𝑡} 𝑑 𝐵_{𝑡} .

Under an equivalent probability measure defined through a suitable exponential martingale,

𝑍_{𝑇} = \exp (- \int_{0}^{𝑇} 𝜃_{𝑠} 𝑑 𝐵_{𝑠} - \frac{1}{2} \int_{0}^{𝑇} ∣ 𝜃_{𝑠} ∣^{2} 𝑑 𝑠),

one can define

\frac{𝑑 𝑄}{𝑑 𝑃} = 𝑍_{𝑇} .

Then

{\tilde{𝐵}}_{𝑡} = 𝐵_{𝑡} + \int_{0}^{𝑡} 𝜃_{𝑠} 𝑑 𝑠

is Brownian motion under $𝑄$ , and the drift representation of $𝑋$ changes accordingly.

This is a probability-law transport, not a path transformation. The underlying trajectories can remain the same while their weights change.

The density process must be a true martingale with expectation one. A local martingale is insufficient for defining a probability measure with total mass one. Conditions such as Novikov's criterion are sufficient:

𝐸 \exp (\frac{1}{2} \int_{0}^{𝑇} ∣ 𝜃_{𝑠} ∣^{2} 𝑑 𝑠) < \infty .

30.9 Stochastic flows

Under suitable regularity, the family of solutions

𝑋_{𝑡}^{𝑥}

indexed by initial state $𝑥$ forms a stochastic flow. Instead of studying one random trajectory, one studies the random map

Φ_{𝑠, 𝑡} (𝜔, 𝑥) = 𝑋_{𝑡}^{𝑠, 𝑥} (𝜔) .

The flow property is

Φ_{𝑠, 𝑡} = Φ_{𝑢, 𝑡} \circ Φ_{𝑠, 𝑢}, 𝑠 \leq 𝑢 \leq 𝑡,

with appropriate noise-shift interpretation.

When coefficients are sufficiently smooth, $𝑥 \mapsto 𝑋_{𝑡}^{𝑥}$ may be differentiable. The Jacobian flow satisfies a linearized SDE. For example,

𝑑 𝐽_{𝑡} = 𝐷 𝑏 (𝑋_{𝑡}) 𝐽_{𝑡} 𝑑 𝑡 + 𝐷 𝜎 (𝑋_{𝑡}) 𝐽_{𝑡} 𝑑 𝐵_{𝑡} .

Stochastic flows are used in random dynamical systems, stochastic geometry, fluid models, filtering, and sensitivity analysis. Flow regularity is stronger than existence of separate solutions for each initial point. Simultaneous control over uncountably many initial conditions is a separate gate.

30.10 Boundary behavior

An SDE inside a domain $𝐷$ does not determine what happens at $\partial 𝐷$ . Possible behaviors include absorption, reflection, killing, stickiness, entrance, exit, and natural inaccessibility.

Absorbing behavior stops the process or sends it to a cemetery state at

𝜏_{\partial 𝐷} = \inf {𝑡 : 𝑋_{𝑡} \in \partial 𝐷} .

Reflection adds a bounded-variation correction:

𝑑 𝑋_{𝑡} = 𝑏 (𝑋_{𝑡}) 𝑑 𝑡 + 𝜎 (𝑋_{𝑡}) 𝑑 𝐵_{𝑡} + 𝑛 (𝑋_{𝑡}) 𝑑 𝐿_{𝑡},

where $𝐿_{𝑡}$ is boundary local time and $𝑛$ is an inward normal field.

In one dimension, scale functions and speed measures classify boundaries. In higher dimensions, boundary behavior connects to PDE boundary conditions: absorbing corresponds to Dirichlet-type conditions; reflection corresponds to Neumann-type conditions.

The same interior generator can support distinct processes under different boundary conditions. The boundary rule is therefore part of the carrier.

30.11 Numerical schemes

For time step $Δ 𝑡$ , Euler–Maruyama approximates

𝑑 𝑋_{𝑡} = 𝑏 (𝑋_{𝑡}) 𝑑 𝑡 + 𝜎 (𝑋_{𝑡}) 𝑑 𝐵_{𝑡}

𝑋_{𝑘 + 1} = 𝑋_{𝑘} + 𝑏 (𝑋_{𝑘}) Δ 𝑡 + 𝜎 (𝑋_{𝑘}) Δ 𝐵_{𝑘},

where

Δ 𝐵_{𝑘} \sim 𝑁 (0, Δ 𝑡) .

Strong error concerns coupled trajectories:

𝐸 {[\max_{𝑘 \leq 𝑛} ∣ 𝑋_{𝑡_{𝑘}} - 𝑋_{𝑘} ∣^{𝑝}]}^{1 / 𝑝} .

Weak error concerns observables:

∣ 𝐸 [𝑓 (𝑋_{𝑇})] - 𝐸 [𝑓 (𝑋_{𝑛})] ∣ .

A method can have better weak order than strong order because matching distributions is easier than matching paths.

Higher-order methods include Milstein schemes, which add terms involving derivatives of $𝜎$ and iterated stochastic integrals. Numerical validity also requires stability, preservation of positivity or constraints, control of invariant measures, and treatment of stiffness.

30.12 SDE model validity

An SDE is a mathematical model, not a proof that a real system contains Brownian noise. To justify the model, one must identify state variables, time scale, source of randomness, Markov approximation, coefficient calibration, boundary conditions, and the regime in which diffusion scaling is valid.

Diffusion approximations often arise from limits of discrete systems:

small jumps + high frequency + centering + variance scaling \to diffusion .

If jumps are heavy-tailed, a Lévy process may be the correct limit instead. If memory persists, a Markov SDE may be invalid. If noise is multiplicative, interpretation of stochastic integration can matter.

The full liftback chain is

mechanism/data \to stochastic scaling \to SDE carrier \to mathematical solution \to empirical validation .

Part IX — Concentration, High-Dimensional Probability, and Random Structures

Chapter 31 — Concentration Inequalities

31.1 Hoeffding inequality

Let $𝑋_{1}, \dots, 𝑋_{𝑛}$ be independent random variables satisfying

𝑎_{𝑖} \leq 𝑋_{𝑖} \leq 𝑏_{𝑖}

almost surely. Then

𝑃 (\sum_{𝑖 = 1}^{𝑛} (𝑋_{𝑖} - 𝐸 𝑋_{𝑖}) \geq 𝑡) \leq \exp (- \frac{2 𝑡^{2}}{\sum_{𝑖 = 1}^{𝑛} (𝑏_{𝑖} - 𝑎_{𝑖})^{2}}) .

A corresponding two-sided inequality follows by applying the estimate to both tails.

Hoeffding's inequality is obtained by exponential tilting and a bound on the moment generating function of bounded centered variables. It is distribution-free within the bounded range. The price of universality is that it ignores actual variance and can be conservative when variance is small relative to range.

For iid Bernoulli or bounded averages,

{\overset{ˉ}{𝑋}}_{𝑛} = \frac{1}{𝑛} \sum_{𝑖} 𝑋_{𝑖},

the bound takes the exponential form

𝑃 (∣ {\overset{ˉ}{𝑋}}_{𝑛} - 𝐸 𝑋_{1} ∣ \geq 𝜀) \leq 2 𝑒^{- 𝑐 𝑛 𝜀^{2}} .

This improves the polynomial $1 / 𝑛$ scale provided by Chebyshev.

31.2 Bernstein inequality

Bernstein's inequality incorporates both variance and maximum magnitude. For independent centered $𝑋_{𝑖}$ with

∣ 𝑋_{𝑖} ∣ \leq 𝑀, 𝑉 = \sum_{𝑖} 𝐸 [𝑋_{𝑖}^{2}],

a standard form is

𝑃 (\sum_{𝑖} 𝑋_{𝑖} \geq 𝑡) \leq \exp (- \frac{𝑡^{2}}{2 (𝑉 + 𝑀 𝑡 / 3)}) .

Two regimes appear. For $𝑡 ≪ 𝑉 / 𝑀$ ,

𝑃 (𝑆 \geq 𝑡) \approx 𝑒^{- 𝑡^{2} / (2 𝑉)},

which is Gaussian. For $𝑡 ≫ 𝑉 / 𝑀$ ,

𝑃 (𝑆 \geq 𝑡) ≲ 𝑒^{- 𝑐 𝑡 / 𝑀},

which is exponential.

Bernstein is therefore sensitive to actual fluctuation energy $𝑉$ and worst-case jump scale $𝑀$ . It is often superior to Hoeffding when the variables are bounded but sparse or low variance.

31.3 Bennett inequality

Bennett's inequality is a sharper variance-sensitive concentration bound for bounded independent summands. Under assumptions similar to Bernstein,

𝑃 (𝑆 \geq 𝑡) \leq \exp [- \frac{𝑉}{𝑀^{2}} ℎ (\frac{𝑀 𝑡}{𝑉})],

where

ℎ (𝑢) = (1 + 𝑢) \log (1 + 𝑢) - 𝑢 .

Bernstein's inequality can be derived by lower-bounding $ℎ (𝑢)$ with a simpler rational expression. Bennett therefore retains more exact convex rate information.

The function $ℎ$ reflects the Legendre transform of an MGF bound. This exposes the relation between concentration inequalities and large deviations: concentration produces a finite- $𝑛$ upper certificate; the convex rate function records the exponential cost geometry.

31.4 Chernoff method

The Chernoff method begins from Markov's inequality:

𝑃 (𝑋 \geq 𝑎) = 𝑃 (𝑒^{𝜆 𝑋} \geq 𝑒^{𝜆 𝑎}) \leq 𝑒^{- 𝜆 𝑎} 𝐸 [𝑒^{𝜆 𝑋}] .

Optimizing over $𝜆 > 0$ gives

𝑃 (𝑋 \geq 𝑎) \leq \exp (- \sup_{𝜆 > 0} {𝜆 𝑎 - \log 𝐸 [𝑒^{𝜆 𝑋}]}) .

For independent sums,

\log 𝐸 [𝑒^{𝜆 \sum_{𝑖} 𝑋_{𝑖}}] = \sum_{𝑖} \log 𝐸 [𝑒^{𝜆 𝑋_{𝑖}}] .

Thus an MGF estimate for individual summands becomes a tail estimate for the sum.

Chernoff is a method rather than one inequality. Hoeffding, Bernstein, Bennett, and many large deviation upper bounds are obtained by choosing different MGF estimates. The method fails if exponential moments do not exist in the required direction.

31.5 Subgaussian random variables

A centered random variable $𝑋$ is subgaussian if there exists $𝐾 > 0$ such that

𝐸 [𝑒^{𝜆 𝑋}] \leq 𝑒^{𝐾^{2} 𝜆^{2} / 2}

for all $𝜆 \in 𝑅$ , or equivalently up to constants,

𝑃 (∣ 𝑋 ∣ \geq 𝑡) \leq 2 𝑒^{- 𝑐 𝑡^{2} / 𝐾^{2}} .

The subgaussian norm is often written

∥ 𝑋 ∥_{𝜓_{2}} = \inf {𝐾 > 0 : 𝐸 𝑒^{𝑋^{2} / 𝐾^{2}} \leq 2} .

Gaussian variables and bounded centered variables are subgaussian.

Independent subgaussian sums remain subgaussian:

{∥ \sum_{𝑖} 𝑎_{𝑖} 𝑋_{𝑖} ∥}_{𝜓_{2}} ≲ {(\sum_{𝑖} 𝑎_{𝑖}^{2} ∥ 𝑋_{𝑖} ∥_{𝜓_{2}}^{2})}^{1 / 2} .

This creates a reusable tail calculus for high-dimensional probability.

31.6 Subexponential random variables

A random variable is subexponential if its tail satisfies roughly

𝑃 (∣ 𝑋 ∣ \geq 𝑡) \leq 2 𝑒^{- 𝑐 𝑡 / 𝐾}

for sufficiently large $𝑡$ , or equivalently its Orlicz norm

∥ 𝑋 ∥_{𝜓_{1}} = \inf {𝐾 > 0 : 𝐸 𝑒^{∣ 𝑋 ∣ / 𝐾} \leq 2}

is finite.

Products of subgaussian variables are often subexponential. Sums of independent centered subexponential variables obey Bernstein-type bounds:

𝑃 (∣ \sum_{𝑖} 𝑋_{𝑖} ∣ \geq 𝑡) \leq 2 \exp [- 𝑐 \min (\frac{𝑡^{2}}{\sum_{𝑖} 𝐾_{𝑖}^{2}}, \frac{𝑡}{\max_{𝑖} 𝐾_{𝑖}})] .

The two-regime structure reflects Gaussian accumulation at moderate scale and single-large-term behavior farther into the tail.

31.7 Bounded differences

Suppose $𝑋_{1}, \dots, 𝑋_{𝑛}$ are independent and

𝑍 = 𝑓 (𝑋_{1}, \dots, 𝑋_{𝑛}) .

Assume changing coordinate $𝑖$ alone can change $𝑓$ by at most $𝑐_{𝑖}$ :

∣ 𝑓 (𝑥) - 𝑓 (𝑥^{'}) ∣ \leq 𝑐_{𝑖}

whenever $𝑥, 𝑥^{'}$ differ only in coordinate $𝑖$ .

Then $𝑍$ is concentrated around its mean. The standard route constructs the Doob exposure martingale

𝑀_{𝑖} = 𝐸 [𝑍 ∣ 𝑋_{1}, \dots, 𝑋_{𝑖}] .

The martingale increments are controlled by $𝑐_{𝑖}$ , so Azuma–Hoeffding applies.

The method is valuable because it applies to nonlinear functions of many independent inputs. The essential primitive is sensitivity, not additivity.

31.8 McDiarmid inequality

Under the bounded differences assumptions,

𝑃 (𝑍 - 𝐸 𝑍 \geq 𝑡) \leq \exp (- \frac{2 𝑡^{2}}{\sum_{𝑖} 𝑐_{𝑖}^{2}}),

and similarly for the lower tail.

Applications include graph statistics, randomized algorithms, occupancy statistics, empirical risks, permutation statistics, and combinatorial optimization. One reveals inputs sequentially and bounds how much a single revelation can alter the final quantity.

The weakness is worst-case sensitivity. If rare inputs can cause large changes but typical changes are small, McDiarmid may be too crude. Variance-sensitive or self-bounding inequalities then provide sharper alternatives.

31.9 Talagrand inequalities

Talagrand-type concentration inequalities give stronger control for product spaces by combining sensitivity with geometric or convex structure. Several inequivalent inequalities carry Talagrand's name. A representative phenomenon is that a Lipschitz convex function of independent bounded variables has subgaussian concentration around its median or mean.

Another version uses convex distance on product spaces:

𝑃 (𝐴) 𝑃 (𝐴_{𝑡}^{𝑐}) \leq 𝑒^{- 𝑐 𝑡^{2}} .

The distance is designed to measure how many coordinates must change, with optimized weights.

Talagrand inequalities often improve bounded-difference estimates because they incorporate structure beyond maximum coordinate influence. They are central in empirical processes, random graphs, random combinatorics, and high-dimensional geometry.

31.10 Gaussian concentration

If $𝐺 \sim 𝑁 (0, 𝐼_{𝑛})$ and $𝑓 : 𝑅^{𝑛} \to 𝑅$ is $𝐿$ -Lipschitz, then

𝑃 (∣ 𝑓 (𝐺) - 𝐸 𝑓 (𝐺) ∣ \geq 𝑡) \leq 2 𝑒^{- 𝑡^{2} / (2 𝐿^{2})}

up to precise constants depending on formulation.

The dimension $𝑛$ does not appear explicitly. High-dimensional Gaussian measure concentrates because Lipschitz observables cannot vary rapidly enough to overcome the geometry of Gaussian mass.

Gaussian concentration can be proved through isoperimetry, log-Sobolev inequalities, Herbst's argument, or transportation inequalities. It is a prototype for dimension-free concentration.

31.11 Isoperimetric concentration

Isoperimetric inequalities relate boundary size to volume. In probability, they imply concentration because a set with probability at least $1 / 2$ rapidly captures almost all mass when enlarged by a small metric radius.

For Gaussian measure $𝛾_{𝑛}$ , half-spaces minimize boundary measure among sets of fixed Gaussian measure. This produces Gaussian concentration:

𝛾_{𝑛} (𝐴_{𝑟}^{𝑐}) \leq 𝑒^{- 𝑟^{2} / 2}

up to constants when $𝛾_{𝑛} (𝐴) \geq 1 / 2$ .

The principle is:

isoperimetry \to set concentration \to Lipschitz-function concentration .

Geometry of the carrier controls fluctuation of observables.

31.12 Log-Sobolev inequalities

A log-Sobolev inequality controls entropy by an energy form. For Gaussian measure, a standard form is

Ent (𝑓^{2}) \leq 2 \int ∣ \nabla 𝑓 ∣^{2} 𝑑 𝛾 .

Here

Ent (𝑔) = 𝐸 [𝑔 \log 𝑔] - 𝐸 [𝑔] \log 𝐸 [𝑔] .

The Herbst argument converts a log-Sobolev inequality into exponential moment bounds and then into concentration inequalities. Thus functional inequalities become tail certificates.

Log-Sobolev inequalities are stronger than ordinary Poincaré inequalities. They imply hypercontractivity and strong convergence-to-equilibrium results for Markov semigroups.

31.13 Transport inequalities

A transportation-cost inequality compares a distance between probability measures to relative entropy. A basic example is

𝑊_{2} (𝜈, 𝜇)^{2} \leq 2 𝐶 𝐻 (𝜈 ∣ 𝜇) .

Such inequalities imply concentration of Lipschitz functions.

The interpretation is that moving probability mass far from its reference distribution requires entropy cost. This unifies optimal transport, information theory, and concentration.

Transport inequalities can tensorize across independent products, making them powerful in high-dimensional systems. The carrier geometry is encoded by Wasserstein distance; probabilistic deviation is encoded by entropy.

31.14 Concentration versus asymptotics

Concentration inequalities and asymptotic limit theorems answer different questions. A CLT identifies the shape of typical normalized fluctuations:

\frac{𝑆_{𝑛} - 𝑛 𝜇}{\sqrt{𝑛}} \Rightarrow 𝑁 (0, 𝜎^{2}) .

Concentration gives finite- $𝑛$ tail bounds:

𝑃 (∣ 𝑆_{𝑛} - 𝐸 𝑆_{𝑛} ∣ \geq 𝑡) \leq explicit bound .

A concentration bound may be nonsharp in its constants or exact rate function. A CLT may give no finite- $𝑛$ error. A large deviation principle may give the correct exponential rate but omit useful finite-sample prefactors.

The correct hierarchy is:

shape \neq finite bound \neq exponential rate .

Chapter 32 — Empirical Processes

32.1 Empirical distribution functions

Given iid real observations $𝑋_{1}, \dots, 𝑋_{𝑛}$ with CDF $𝐹$ , the empirical distribution function is

𝐹_{𝑛} (𝑥) = \frac{1}{𝑛} \sum_{𝑖 = 1}^{𝑛} 1_{{𝑋_{𝑖} \leq 𝑥}} .

For each fixed $𝑥$ , the strong law gives

𝐹_{𝑛} (𝑥) \to 𝐹 (𝑥)

almost surely.

The empirical CDF is itself a random function. Fixed- $𝑥$ convergence does not imply uniform convergence over all $𝑥$ ; that requires a uniform law of large numbers. Empirical process theory studies random fluctuations indexed by classes of functions or sets.

32.2 Glivenko–Cantelli theorem

The Glivenko–Cantelli theorem states

\sup_{𝑥 \in 𝑅} ∣ 𝐹_{𝑛} (𝑥) - 𝐹 (𝑥) ∣ \to 0

almost surely.

This is stronger than pointwise LLN because the supremum ranges over an uncountable index set. The special monotonicity structure of CDFs permits uniform control.

A general class $𝐹$ is called Glivenko–Cantelli for a law $𝑃$ if

\sup_{𝑓 \in 𝐹} ∣ 𝑃_{𝑛} 𝑓 - 𝑃 𝑓 ∣ \to 0,

where

𝑃_{𝑛} 𝑓 = \frac{1}{𝑛} \sum_{𝑖 = 1}^{𝑛} 𝑓 (𝑋_{𝑖}), 𝑃 𝑓 = 𝐸 [𝑓 (𝑋)] .

Uniform laws of large numbers are therefore empirical-process statements.

32.3 Donsker theorem

Define the empirical process

𝛼_{𝑛} (𝑥) = \sqrt{𝑛} (𝐹_{𝑛} (𝑥) - 𝐹 (𝑥)) .

Donsker's theorem states that, after appropriate indexing and under standard conditions,

𝛼_{𝑛} \Rightarrow 𝐵 \circ 𝐹,

where $𝐵$ is a Brownian bridge.

For $𝐹$ continuous, the covariance is

Cov (𝐵 (𝐹 (𝑥)), 𝐵 (𝐹 (𝑦))) = 𝐹 (𝑥 \land 𝑦) - 𝐹 (𝑥) 𝐹 (𝑦) .

This is a functional CLT. Instead of one statistic converging to one Gaussian variable, an entire indexed process converges to a Gaussian process. Finite-dimensional convergence is insufficient; tightness in function space is required.

32.4 VC dimension

The Vapnik–Chervonenkis dimension of a class of sets $𝐴$ measures its combinatorial capacity. A finite set ${𝑥_{1}, \dots, 𝑥_{𝑚}}$ is shattered if

{𝐴 \cap {𝑥_{1}, \dots, 𝑥_{𝑚}} : 𝐴 \in 𝐴}

contains all $2^{𝑚}$ subsets. The VC dimension is the largest shattered-set size.

Finite VC dimension controls uniform deviations:

\sup_{𝐴 \in 𝐴} ∣ 𝑃_{𝑛} (𝐴) - 𝑃 (𝐴) ∣ .

The reason is combinatorial: a low-VC class cannot realize too many distinct label patterns on a finite sample.

VC theory links probability concentration to statistical learning. Complexity of the function class determines sample size needed for uniform generalization.

32.5 Covering numbers

For a metric or pseudometric space $(𝐹, 𝑑)$ , the covering number

𝑁 (𝜀, 𝐹, 𝑑)

is the minimum number of $𝑑$ -balls of radius $𝜀$ required to cover $𝐹$ .

In empirical processes, common metrics include

𝑑_{𝑃} (𝑓, 𝑔) = {(𝐸 [(𝑓 - 𝑔)^{2}])}^{1 / 2}

or empirical versions. Covering numbers quantify effective size at scale $𝜀$ .

A finite class has cardinality as crude complexity. An infinite class requires multiscale complexity. Covering numbers are the bridge:

infinite function class \to finite approximations at each scale .

32.6 Entropy integrals

Metric entropy is

\log 𝑁 (𝜀, 𝐹, 𝑑) .

Dudley-type entropy bounds control expected suprema of Gaussian or subgaussian processes:

𝐸 \sup_{𝑓 \in 𝐹} 𝑋_{𝑓} ≲ \int_{0}^{diam (𝐹)} \sqrt{\log 𝑁 (𝜀, 𝐹, 𝑑)} 𝑑 𝜀 .

Entropy integrals convert geometric complexity into stochastic supremum bounds. The square root of log covering number reflects Gaussian fluctuation scale.

A class may be infinite and highly structured while still having a finite entropy integral. Conversely, excessive metric entropy can obstruct uniform convergence or tightness.

32.7 Symmetrization

Let

𝑃_{𝑛} 𝑓 = \frac{1}{𝑛} \sum_{𝑖 = 1}^{𝑛} 𝑓 (𝑋_{𝑖}) .

To bound

𝐸 \sup_{𝑓 \in 𝐹} ∣ 𝑃_{𝑛} 𝑓 - 𝑃 𝑓 ∣,

introduce an independent ghost sample $𝑋_{𝑖}^{'}$ or Rademacher signs $𝜀_{𝑖}$ . Symmetrization gives bounds of the form

𝐸 \sup_{𝑓 \in 𝐹} ∣ 𝑃_{𝑛} 𝑓 - 𝑃 𝑓 ∣ \leq 2 𝐸 \sup_{𝑓 \in 𝐹} ∣ \frac{1}{𝑛} \sum_{𝑖 = 1}^{𝑛} 𝜀_{𝑖} 𝑓 (𝑋_{𝑖}) ∣ .

This replaces an unknown population mean by a centered random-sign process conditional on the data. The resulting process is easier to analyze with concentration, contraction inequalities, and chaining.

Symmetrization is a carrier conversion:

empirical-minus-population \to random signed empirical process .

32.8 Rademacher complexity

The empirical Rademacher complexity of a function class $𝐹$ is

{\hat{𝑅}}_{𝑛} (𝐹) = 𝐸_{𝜀} [\sup_{𝑓 \in 𝐹} \frac{1}{𝑛} \sum_{𝑖 = 1}^{𝑛} 𝜀_{𝑖} 𝑓 (𝑋_{𝑖}) | 𝑋_{1}, \dots, 𝑋_{𝑛}] .

Its expectation over the sample gives population Rademacher complexity.

This quantity measures how well the class can align with random noise. A highly expressive class can fit random signs and has large complexity. Uniform generalization bounds often take the form

\sup_{𝑓 \in 𝐹} ∣ 𝑃_{𝑛} 𝑓 - 𝑃 𝑓 ∣ ≲ 𝑅_{𝑛} (𝐹) + concentration term .

Rademacher complexity adapts to actual function values and can be sharper than purely combinatorial VC bounds.

32.9 Chaining

Chaining controls the supremum of a stochastic process by approximating the index set at progressively finer scales. Instead of using one crude $𝜀$ -net, construct nested nets

𝑇_{0} \subseteq 𝑇_{1} \subseteq \dots

and decompose

𝑋_{𝑡} - 𝑋_{𝑡_{0}} = \sum_{𝑘 \geq 1} (𝑋_{𝜋_{𝑘} (𝑡)} - 𝑋_{𝜋_{𝑘 - 1} (𝑡)}) .

Each scale contributes a small increment but has more candidate points. Balancing increment size against combinatorial multiplicity produces entropy integrals or generic chaining bounds.

The method is one of the deepest general tools for suprema of Gaussian and subgaussian processes. It shows that one-scale union bounds are often structurally wasteful.

32.10 Uniform laws of large numbers

A uniform law of large numbers states

\sup_{𝑓 \in 𝐹} ∣ 𝑃_{𝑛} 𝑓 - 𝑃 𝑓 ∣ \to 0

in probability or almost surely. Unlike ordinary LLN, the index $𝑓$ may be selected after observing data, so pointwise convergence is insufficient.

Sufficient conditions use VC dimension, covering numbers, bracketing entropy, Rademacher complexity, or compactness. Envelope functions control integrability:

∣ 𝑓 (𝑥) ∣ \leq 𝐹 (𝑥)

for all $𝑓 \in 𝐹$ .

Uniform LLNs are the probability foundation of empirical risk minimization. They justify replacing population optimization

\inf_{𝑓} 𝑃 𝑓

with sample optimization

\inf_{𝑓} 𝑃_{𝑛} 𝑓

only when the function class is not too complex.

32.11 Statistical learning links

Statistical learning theory studies the gap between empirical and population performance:

𝑅 (𝑓) - {\hat{𝑅}}_{𝑛} (𝑓) .

Uniform concentration permits data-dependent choice $\hat{𝑓}$ :

𝑅 (\hat{𝑓}) \leq {\hat{𝑅}}_{𝑛} (\hat{𝑓}) + \sup_{𝑓 \in 𝐹} ∣ 𝑅 (𝑓) - {\hat{𝑅}}_{𝑛} (𝑓) ∣ .

Complexity controls generalization. Too small a class creates approximation error; too large a class creates estimation error. This produces the bias-complexity tradeoff.

Probability theory contributes concentration, symmetrization, empirical-process bounds, and complexity measures. The learning problem adds optimization and model-selection structure. Training error alone is not a generalization certificate.

Chapter 33 — Random Graphs and Probabilistic Combinatorics

33.1 Erdős–Rényi models

The two classical Erdős–Rényi models are $𝐺 (𝑛, 𝑝)$ , where each of the

(\binom{𝑛}{2})

possible edges appears independently with probability $𝑝$ , and $𝐺 (𝑛, 𝑚)$ , where exactly $𝑚$ edges are chosen uniformly from all graphs with $𝑚$ edges.

In $𝐺 (𝑛, 𝑝)$ , the edge count is

𝐸 \sim Binomial ((\binom{𝑛}{2}), 𝑝) .

The expected degree is

(𝑛 - 1) 𝑝 .

Random graph theory studies global properties generated by local random edge decisions: connectivity, components, cycles, clique number, chromatic number, expansion, spectra, and subgraph counts.

The model changes regime as $𝑝 = 𝑝 (𝑛)$ varies. Sparse and dense random graphs are not perturbations of one another; they have different structural carriers.

33.2 Thresholds

A graph property $𝑃$ has a threshold scale $𝑝_{𝑐} (𝑛)$ if

𝑃 (𝐺 (𝑛, 𝑝) \in 𝑃) \to 0

when $𝑝 ≪ 𝑝_{𝑐}$ , and tends to one when $𝑝 ≫ 𝑝_{𝑐}$ , for monotone properties.

Thresholds arise when expected obstruction counts change scale. For isolated vertices,

𝐸 𝐼 = 𝑛 (1 - 𝑝)^{𝑛 - 1} \approx 𝑛 𝑒^{- 𝑝 𝑛} .

This becomes order one around

𝑝 \sim \frac{\log 𝑛}{𝑛},

the connectivity threshold scale.

Expectation suggests a threshold but does not prove sharp transition. Second moments, Poisson approximation, coupling, or sharp-threshold theorems are often required.

33.3 First moment method

For a nonnegative integer-valued random variable $𝑋$ ,

𝑃 (𝑋 > 0) \leq 𝐸 [𝑋] .

Therefore, if

𝐸 [𝑋] \to 0,

then $𝑋 = 0$ with high probability.

Conversely, to prove deterministic existence, if a random object has expected number of bad features less than one,

𝐸 [𝐵] < 1,

then some realization has $𝐵 = 0$ .

The first moment method is asymmetric: it easily proves nonexistence of random structures when expected count vanishes and deterministic existence through averaging, but it generally cannot prove $𝑋 > 0$ with high probability from $𝐸 𝑋 \to \infty$ . Variance or dependence may dominate.

33.4 Second moment method

For $𝑋 \geq 0$ ,

𝑃 (𝑋 > 0) \geq \frac{(𝐸 𝑋)^{2}}{𝐸 [𝑋^{2}]}

by Paley–Zygmund at threshold $0$ . Thus if

\frac{Var (𝑋)}{(𝐸 𝑋)^{2}} \to 0,

then $𝑋 > 0$ with high probability.

The method analyzes overlap among candidate structures. If

𝑋 = \sum_{𝛼} 𝐼_{𝛼},

then

𝐸 [𝑋^{2}] = \sum_{𝛼, 𝛽} 𝑃 (𝐼_{𝛼} = 𝐼_{𝛽} = 1) .

Pairs $(𝛼, 𝛽)$ are classified by intersection pattern. The overlap geometry is the dependence ledger.

33.5 Lovász local lemma

The Lovász local lemma proves that a collection of individually unlikely bad events can simultaneously be avoided when dependencies are sufficiently sparse. A symmetric form states: if each event $𝐴_{𝑖}$ satisfies

𝑃 (𝐴_{𝑖}) \leq 𝑝

and depends on at most $𝑑$ other events, then

𝑒 𝑝 (𝑑 + 1) \leq 1

implies

𝑃 (⋂_{𝑖} 𝐴_{𝑖}^{𝑐}) > 0.

The union bound requires

\sum_{𝑖} 𝑃 (𝐴_{𝑖}) < 1,

which can fail badly when there are many events. The local lemma exploits sparse dependence rather than total event count.

It is an existence theorem. Algorithmic versions such as Moser–Tardos provide constructive randomized procedures under suitable variable-event structure.

33.6 Janson inequalities

Janson inequalities estimate lower tails for counts of dependent rare structures built from independent underlying variables. Suppose

𝑋 = \sum_{𝛼} 𝐼_{𝛼}

counts substructures, with mean $𝜇$ , and define an overlap-dependence parameter

Δ = \sum_{𝛼 \sim 𝛽} 𝐸 [𝐼_{𝛼} 𝐼_{𝛽}] .

Then estimates of the form

𝑃 (𝑋 = 0) \leq \exp (- 𝜇 + \frac{Δ}{2})

hold in standard settings.

Janson's method is suited to subgraph counts, patterns, and set systems. The key object is not independence of indicators but dependence generated by shared primitive variables.

33.7 Random regular graphs

A random $𝑑$ -regular graph is sampled uniformly from graphs where every vertex has degree $𝑑$ . Edges are not independent, so $𝐺 (𝑛, 𝑝)$ methods do not transfer automatically.

The configuration model constructs a multigraph by assigning $𝑑$ half-edges to each vertex and pairing all half-edges uniformly. Conditional on simplicity, the resulting graph is uniform over simple $𝑑$ -regular graphs.

The configuration model converts a constrained graph problem into random matching. One must then audit loops, multiple edges, conditioning on simplicity, and whether the probability of simplicity remains non-negligible in the parameter regime.

33.8 Percolation preview

Percolation can be viewed as a random graph on a fixed infinite underlying graph. Each edge or vertex is independently retained with probability $𝑝$ . The principal question is emergence of an infinite connected component.

Random graph giant-component transitions and lattice percolation share branching-process heuristics but differ geometrically. Locally tree-like random graphs are often approximated by branching processes. Lattices contain strong short-cycle and spatial structure.

Thus branching approximation is a carrier conversion whose validity depends on local geometry.

33.9 Phase transitions

In $𝐺 (𝑛, 𝑝)$ , the scale

𝑝 \sim \frac{1}{𝑛}

marks the giant-component phase transition. If $𝑝 = 𝑐 / 𝑛$ with $𝑐 < 1$ , components are typically small. If $𝑐 > 1$ , a component of order $𝑛$ emerges.

The critical window is much narrower and has distinct scaling behavior. Near

𝑝 = \frac{1}{𝑛} + 𝜆 𝑛^{- 4 / 3},

component sizes occur on the $𝑛^{2 / 3}$ scale.

A phase transition is therefore not one number but a layered structure:

subcritical regime \to critical window \to supercritical regime .

33.10 Branching-process approximations

Exploring a sparse random graph component from a vertex initially resembles a branching process. Each discovered vertex produces approximately

Binomial (𝑛, 𝑝)

new neighbors, approaching $Poisson (𝑐)$ when $𝑝 = 𝑐 / 𝑛$ .

The mean offspring $𝑐$ predicts the phase transition at $𝑐 = 1$ . But the approximation is local and breaks when many explored vertices create collisions or depletion. Branching processes certify early exploration, not the entire global graph.

The liftback requires controlling the scale at which tree-likeness persists.

33.11 Contiguity

Two sequences of probability measures $𝑃_{𝑛}, 𝑄_{𝑛}$ are contiguous if

𝑃_{𝑛} (𝐴_{𝑛}) \to 0 ⟺ 𝑄_{𝑛} (𝐴_{𝑛}) \to 0

in mutual-contiguity form. One-sided contiguity uses only one implication.

Contiguity permits transfer of high-probability properties between random models even when their laws are not close in total variation. It is widely used between configuration models and uniform regular graphs, planted and unplanted models, and alternative random graph constructions.

Contiguity transports negligible events, not exact probabilities or expectations. It is weaker than total variation equivalence but stronger than mere heuristic similarity.

33.12 Random construction as existence certificate

The probabilistic method proves deterministic objects exist by showing a random construction has positive probability of satisfying required properties:

𝑃 (good object) > 0 \Rightarrow \exists good object .

The probability model is temporary. The terminal conclusion is deterministic existence. Methods include first moment, alteration, second moment, local lemma, random greedy algorithms, entropy compression, and concentration.

The central firewall is:

positive probability existence \neq efficient algorithm .

An existence certificate may provide no practical construction unless a constructive liftback is supplied.

Chapter 34 — Random Matrices

34.1 Wigner matrices

A Wigner matrix is a symmetric or Hermitian random matrix whose upper-triangular entries are independent up to symmetry, typically centered and variance-scaled:

𝐻_{𝑖 𝑗} \sim 𝑛^{- 1 / 2} 𝜉_{𝑖 𝑗} .

The $𝑛^{- 1 / 2}$ normalization keeps eigenvalues on an $𝑂 (1)$ scale.

Random matrix theory asks about eigenvalue distributions, extreme eigenvalues, eigenvector delocalization, spacing statistics, and universality.

The entries may be independent, but eigenvalues are strongly dependent nonlinear functions of all entries. The spectral carrier is therefore not a collection of independent scalar variables.

34.2 Sample covariance matrices

Given a data matrix

𝑋 \in 𝑅^{𝑛 \times 𝑝},

the sample covariance matrix is

𝑆 = \frac{1}{𝑛} 𝑋^{⊤} 𝑋 .

When entries are random, one studies the spectrum as $𝑛, 𝑝 \to \infty$ , often with

\frac{𝑝}{𝑛} \to 𝛾 \in (0, \infty) .

In fixed dimension, sample covariance converges to population covariance. In high dimension with $𝑝 / 𝑛$ non-negligible, spectral distortion remains order one. Classical low-dimensional intuition fails.

Sample covariance matrices connect random matrix theory to multivariate statistics, principal component analysis, signal detection, and high-dimensional inference.

34.3 Spectral norm bounds

The spectral norm of a symmetric matrix is

∥ 𝐻 ∥_{o p} = \max_{𝑖} ∣ 𝜆_{𝑖} (𝐻) ∣ .

For random matrices, one seeks bounds such as

∥ 𝐻 ∥_{o p} ≲ \sqrt{𝑛}

for unnormalized iid-entry matrices, or $𝑂 (1)$ under Wigner scaling.

Methods include matrix concentration inequalities, $𝜀$ -nets, trace moments, noncommutative Bernstein inequalities, and resolvent estimates.

Entrywise control is not enough. A matrix may have small entries but large operator norm through coherent alignment. Spectral norm is a global geometric observable.

34.4 Semicircle law

For a normalized Wigner matrix $𝐻_{𝑛}$ , the empirical spectral distribution

𝜇_{𝐻_{𝑛}} = \frac{1}{𝑛} \sum_{𝑖 = 1}^{𝑛} 𝛿_{𝜆_{𝑖}}

converges to the semicircle distribution with density

𝜌_{s c} (𝑥) = \frac{1}{2 𝜋} \sqrt{4 - 𝑥^{2}} 1_{[- 2, 2]} (𝑥) .

This is a global spectral law. It describes the bulk distribution of eigenvalues after normalization. It does not by itself determine largest-eigenvalue fluctuations or microscopic spacing.

Proofs use moments or resolvents. The moment method identifies limiting moments with Catalan-number counts of noncrossing pairings.

34.5 Marchenko–Pastur law

For sample covariance matrices with $𝑝 / 𝑛 \to 𝛾$ , the empirical spectral law converges under standard assumptions to the Marchenko–Pastur distribution. Its absolutely continuous part has density

𝜌_{𝛾} (𝑥) = \frac{\sqrt{(𝑏 - 𝑥) (𝑥 - 𝑎)}}{2 𝜋 𝛾 𝑥} 1_{[𝑎, 𝑏]} (𝑥),

where

𝑎 = (1 - \sqrt{𝛾})^{2}, 𝑏 = (1 + \sqrt{𝛾})^{2},

with an additional atom at zero in the appropriate rectangular regime.

This law describes systematic high-dimensional eigenvalue spread even when the population covariance is identity. Random matrix effects are therefore not negligible sampling noise; they alter the whole spectral carrier.

34.6 Concentration of measure

Many random matrix observables concentrate because they are Lipschitz functions of high-dimensional random input. Examples include spectral norm, empirical spectral statistics, and singular values.

For Hermitian matrices,

∣ 𝜆_{𝑘} (𝐴) - 𝜆_{𝑘} (𝐵) ∣ \leq ∥ 𝐴 - 𝐵 ∥_{o p}

and Hoffman–Wielandt-type inequalities control aggregate eigenvalue displacement by Frobenius norm.

Concentration converts entry-level independence or log-concavity into spectral stability. But local eigenvalue statistics require much finer tools than global Lipschitz concentration.

34.7 Resolvent method

For a matrix $𝐻$ , define the resolvent

𝐺 (𝑧) = (𝐻 - 𝑧 𝐼)^{- 1}, Im 𝑧 > 0.

The normalized trace

𝑚_{𝑛} (𝑧) = \frac{1}{𝑛} Tr 𝐺 (𝑧)

is the Stieltjes transform of the empirical spectral measure:

𝑚_{𝑛} (𝑧) = \int \frac{1}{𝑥 - 𝑧} 𝜇_{𝐻_{𝑛}} (𝑑 𝑥) .

The resolvent converts spectral questions into analytic matrix identities. Self-consistent equations for $𝑚_{𝑛} (𝑧)$ identify limiting laws. Fine control as $Im 𝑧$ shrinks leads to local semicircle laws.

The resolvent is a carrier transformation:

eigenvalue point process \to analytic function in upper half-plane .

34.8 Trace method

The trace method studies moments

\frac{1}{𝑛} 𝐸 Tr (𝐻^{𝑘}) = \frac{1}{𝑛} \sum_{𝑖_{1}, \dots, 𝑖_{𝑘}} 𝐸 [𝐻_{𝑖_{1} 𝑖_{2}} \dots 𝐻_{𝑖_{𝑘} 𝑖_{1}}] .

Independence and centering eliminate most index patterns. Surviving walks correspond to combinatorial structures such as pairings and trees.

For fixed $𝑘$ , moments identify global spectral laws. For growing $𝑘$ , the method can bound extreme eigenvalues. However, high moments create complicated combinatorics and require careful control.

34.9 Universality preview

Universality means that many spectral statistics depend only weakly on the detailed entry distribution. After normalization, local eigenvalue spacing or edge fluctuations can converge to the same limit for broad classes of matrix ensembles.

Global universality is comparatively coarse: semicircle or Marchenko–Pastur laws require limited moment assumptions. Local universality is much sharper and needs stronger control, though modern methods often permit broad distributions.

The correct statement must specify scale:

global density \neq mesoscopic statistics \neq local spacing \neq edge fluctuations .

34.10 Free probability preview

Large independent random matrices often become asymptotically free under suitable invariance or independence conditions. This means normalized traces of centered alternating products vanish asymptotically:

\frac{1}{𝑛} Tr (𝑃_{1} (𝐴_{𝑛}) 𝑄_{1} (𝐵_{𝑛}) \dots 𝑃_{𝑘} (𝐴_{𝑛}) 𝑄_{𝑘} (𝐵_{𝑛})) \to 0.

Free probability then predicts spectral laws of sums and products through free convolution. It is an algebraic transport from matrix asymptotics to noncommutative probability.

The word “free” does not mean classically independent entries. It describes a limiting noncommutative factorization law.

34.11 Random matrices in statistics and physics

In statistics, random matrices govern high-dimensional covariance estimation, PCA, canonical correlation, spiked models, and noise spectra. The classical assumption $𝑝$ fixed while $𝑛 \to \infty$ can fail when $𝑝 / 𝑛$ remains positive.

In physics, random matrices model complex spectra, quantum chaotic systems, disordered media, transport, and interacting systems. The reason is not that every Hamiltonian is literally random, but that spectral statistics may be universal under limited structural information.

Applications require a model-liftback audit. A random-matrix ensemble can capture spectral statistics without reproducing all microscopic mechanisms.

Part X — Ergodic, Information, and Statistical Probability

Chapter 35 — Ergodic Theory and Probability

35.1 Measure-preserving systems

A measure-preserving dynamical system is

(Ω, 𝐹, 𝑃, 𝑇),

where $𝑇 : Ω \to Ω$ is measurable and

𝑃 (𝑇^{- 1} 𝐴) = 𝑃 (𝐴)

for every $𝐴 \in 𝐹$ .

The orbit of $𝜔$ is

𝜔, 𝑇 𝜔, 𝑇^{2} 𝜔, \dots

and an observable $𝑓$ generates a stationary sequence

𝑋_{𝑛} (𝜔) = 𝑓 (𝑇^{𝑛} 𝜔) .

Measure preservation says the ensemble law remains invariant under evolution. It does not imply individual trajectories explore the whole state space, nor that time averages equal ensemble averages. Those require stronger gates.

35.2 Stationarity

A process $(𝑋_{𝑛})$ is strictly stationary if

(𝑋_{𝑛_{1}}, \dots, 𝑋_{𝑛_{𝑘}}) \overset{d}{=} (𝑋_{𝑛_{1} + ℎ}, \dots, 𝑋_{𝑛_{𝑘} + ℎ})

for all finite index selections and allowed shifts $ℎ$ .

A measure-preserving system produces stationary processes by observation:

𝑋_{𝑛} = 𝑓 \circ 𝑇^{𝑛} .

Stationarity is distributional time-translation symmetry. It permits long-run analysis but does not imply independence, mixing, or ergodicity. A mixture of two stationary ergodic processes can be stationary but nonergodic.

35.3 Ergodicity

A measure-preserving transformation $𝑇$ is ergodic if every invariant event

𝑇^{- 1} 𝐴 = 𝐴

modulo null sets has probability $0$ or $1$ .

Equivalent formulations include absence of nonconstant invariant $𝐿^{2}$ functions:

𝑓 \circ 𝑇 = 𝑓 \Rightarrow 𝑓 constant a.s.

Ergodicity means the system cannot be decomposed into nontrivial invariant probabilistic components. It is not randomness and does not require mixing. Deterministic systems can be ergodic.

35.4 Birkhoff ergodic theorem

For a measure-preserving transformation $𝑇$ and $𝑓 \in 𝐿^{1}$ ,

\frac{1}{𝑛} \sum_{𝑘 = 0}^{𝑛 - 1} 𝑓 (𝑇^{𝑘} 𝜔) \to 𝐸 [𝑓 ∣ 𝐼] (𝜔)

almost surely, where $𝐼$ is the invariant σ-algebra.

If $𝑇$ is ergodic, then

𝐸 [𝑓 ∣ 𝐼] = 𝐸 𝑓,

\frac{1}{𝑛} \sum_{𝑘 = 0}^{𝑛 - 1} 𝑓 (𝑇^{𝑘} 𝜔) \to 𝐸 𝑓

almost surely.

The theorem does not say every trajectory samples every state uniformly. It says time averages of integrable observables converge to invariant-component averages.

35.5 Von Neumann mean ergodic theorem

Let $𝑈$ be the Koopman operator

𝑈 𝑓 = 𝑓 \circ 𝑇

on $𝐿^{2}$ . Since $𝑇$ is measure-preserving, $𝑈$ is an isometry. The mean ergodic theorem states

\frac{1}{𝑛} \sum_{𝑘 = 0}^{𝑛 - 1} 𝑈^{𝑘} 𝑓 \to 𝑃_{I n v} 𝑓

in $𝐿^{2}$ , where $𝑃_{I n v}$ is orthogonal projection onto the invariant subspace.

This is a Hilbert-space theorem. It gives mean convergence, whereas Birkhoff gives pointwise almost sure convergence under $𝐿^{1}$ assumptions.

The theorem exposes the operator-theoretic carrier:

dynamics \to unitary/isometric operator \to projection onto invariant modes .

35.6 Mixing conditions

Mixing is stronger than ergodicity. Strong mixing can be expressed as

𝑃 (𝐴 \cap 𝑇^{- 𝑛} 𝐵) \to 𝑃 (𝐴) 𝑃 (𝐵) .

This says distant-time events become asymptotically independent.

Other notions include weak mixing, $𝛼$ -mixing, $𝜙$ -mixing, $𝜌$ -mixing, and absolute regularity. They differ in which event/function classes are controlled and at what rate.

Mixing rates influence CLTs, concentration, empirical-process convergence, and statistical estimation. Saying “the process mixes” is incomplete without declaring the mixing notion and rate.

35.7 Ergodic decomposition

A stationary probability measure can often be decomposed into ergodic components:

𝑃 = \int 𝑃_{𝜃} 𝜈 (𝑑 𝜃),

where each $𝑃_{𝜃}$ is ergodic.

The invariant σ-algebra identifies the latent component. Time averages converge to component-specific expectations:

\frac{1}{𝑛} \sum_{𝑘 = 0}^{𝑛 - 1} 𝑓 (𝑇^{𝑘} 𝜔) \to \int 𝑓 𝑑 𝑃_{Θ (𝜔)} .

Thus nonergodicity does not necessarily mean time averages fail to converge. They may converge to random limits determined by the invariant component. The failure is not convergence but collapse to one universal ensemble mean.

35.8 Time average versus ensemble average

The time average of an observable along one trajectory is

{\overline{𝑓}}_{𝑛} (𝜔) = \frac{1}{𝑛} \sum_{𝑘 = 0}^{𝑛 - 1} 𝑓 (𝑇^{𝑘} 𝜔) .

The ensemble average is

𝐸 𝑓 = \int 𝑓 𝑑 𝑃 .

Birkhoff gives

{\overline{𝑓}}_{𝑛} \to 𝐸 [𝑓 ∣ 𝐼] .

Only under ergodicity does this become $𝐸 𝑓$ .

Thus the equation

time average = ensemble average

is not a definition of probability and not automatic stationarity. It is a theorem requiring invariant-component triviality for the chosen carrier.

35.9 Ergodicity breaking

Ergodicity breaking means the invariant σ-algebra is nontrivial or the system decomposes into dynamically distinct components. Different trajectories can have different long-run averages:

\lim_{𝑛 \to \infty} {\overline{𝑓}}_{𝑛} (𝜔) = 𝑔 (𝜔),

with nonconstant invariant $𝑔$ .

In some applied literature, “ergodicity breaking” is used more broadly for aging, nonstationarity, metastability, or finite-time trapping. These are distinct mechanisms. A rigorous analysis must specify whether failure is due to nonergodic invariant decomposition, insufficient observation time, nonstationarity, or nonexistence of relevant averages.

35.10 Probabilistic interpretation errors

A common error is interpreting stationarity as ergodicity. Another is interpreting ergodicity as independence. A third is treating ensemble averages as automatically observable from one trajectory.

The correct hierarchy is

stationary \Rightarrow̸ ergodic,

ergodic \Rightarrow̸ mixing,

and

mixing \Rightarrow̸ independence at finite lag .

The carrier determines the theorem. Time sampling, ensemble sampling, and repeated independent experiments are different experimental structures.

Chapter 36 — Information-Theoretic Probability

36.1 Entropy

For a discrete random variable $𝑋$ with mass function $𝑝 (𝑥)$ , Shannon entropy is

𝐻 (𝑋) = - \sum_{𝑥} 𝑝 (𝑥) \log 𝑝 (𝑥) .

It measures uncertainty or coding cost relative to the logarithm base.

For a pair $(𝑋, 𝑌)$ ,

𝐻 (𝑋, 𝑌) = - \sum_{𝑥, 𝑦} 𝑝 (𝑥, 𝑦) \log 𝑝 (𝑥, 𝑦) .

Conditional entropy is

𝐻 (𝑋 ∣ 𝑌) = 𝐻 (𝑋, 𝑌) - 𝐻 (𝑌) .

Entropy is law-level. It does not measure one realized sample's surprise directly; the pointwise quantity is

- \log 𝑝 (𝑋),

whose expectation is entropy.

Differential entropy for continuous variables,

ℎ (𝑋) = - \int 𝑓 (𝑥) \log 𝑓 (𝑥) 𝑑 𝑥,

behaves differently and is not invariant under coordinate change.

36.2 Relative entropy

The relative entropy of $𝑃$ with respect to $𝑄$ is

𝐷 (𝑃 ∥ 𝑄) = \int \log (\frac{𝑑 𝑃}{𝑑 𝑄}) 𝑑 𝑃

when $𝑃 ≪ 𝑄$ , and $+ \infty$ otherwise.

For discrete laws,

𝐷 (𝑃 ∥ 𝑄) = \sum_{𝑥} 𝑝 (𝑥) \log \frac{𝑝 (𝑥)}{𝑞 (𝑥)} .

Relative entropy is nonnegative:

𝐷 (𝑃 ∥ 𝑄) \geq 0,

with equality iff $𝑃 = 𝑄$ under standard identification. It is not symmetric and is not a metric.

Relative entropy measures change-of-measure cost, coding inefficiency, statistical distinguishability, and large-deviation cost. It is the rate function in Sanov's theorem.

36.3 Mutual information

Mutual information is

𝐼 (𝑋; 𝑌) = 𝐷 (𝑃_{𝑋, 𝑌} ∥ 𝑃_{𝑋} \otimes 𝑃_{𝑌}) .

Thus

𝐼 (𝑋; 𝑌) \geq 0,

with equality iff $𝑋$ and $𝑌$ are independent.

For discrete variables,

𝐼 (𝑋; 𝑌) = 𝐻 (𝑋) + 𝐻 (𝑌) - 𝐻 (𝑋, 𝑌) = 𝐻 (𝑋) - 𝐻 (𝑋 ∣ 𝑌) .

Mutual information measures dependence at the law level. Unlike covariance, it detects nonlinear dependence in general. It remains invariant under invertible transformations of variables under appropriate measurable formulations.

36.4 Pinsker inequality

Pinsker's inequality relates total variation distance to relative entropy:

∥ 𝑃 - 𝑄 ∥_{T V} \leq \sqrt{\frac{1}{2} 𝐷 (𝑃 ∥ 𝑄)}

under natural logarithm convention.

Thus small relative entropy guarantees small discrepancy of event probabilities:

∣ 𝑃 (𝐴) - 𝑄 (𝐴) ∣ \leq \sqrt{\frac{1}{2} 𝐷 (𝑃 ∥ 𝑄)} .

The converse is false without further restrictions. Two laws can be close in total variation while relative entropy is large due to behavior on small-probability regions.

Pinsker is a transport from information divergence to probabilistic approximation.

36.5 Data processing inequality

𝑋 \to 𝑌 \to 𝑍

forms a Markov chain, then

𝐼 (𝑋; 𝑍) \leq 𝐼 (𝑋; 𝑌) .

Processing data cannot increase information about the original source.

More generally, for a measurable map or Markov kernel $𝐾$ ,

𝐷 (𝑃 𝐾 ∥ 𝑄 𝐾) \leq 𝐷 (𝑃 ∥ 𝑄) .

This is contraction of information divergence under stochastic transport.

The theorem formalizes loss under coarse-graining. A statistic cannot contain more information about a parameter than the full data from which it was computed.

36.6 Typical sets

For iid $𝑋_{1}, \dots, 𝑋_{𝑛}$ with entropy $𝐻 (𝑋)$ , the asymptotic equipartition property says

- \frac{1}{𝑛} \log 𝑃 (𝑋_{1}, \dots, 𝑋_{𝑛}) \to 𝐻 (𝑋)

almost surely under standard discrete assumptions.

Thus typical sequences have probabilities approximately

𝑒^{- 𝑛 𝐻 (𝑋)}

and the number of typical sequences is approximately

𝑒^{𝑛 𝐻 (𝑋)} .

Typical sets reconcile probability concentration with combinatorial counting. Almost all mass lies in an exponentially large set whose members have approximately equal exponential probability.

36.7 Shannon source coding theorem

For a discrete memoryless source with entropy $𝐻 (𝑋)$ , lossless coding below rate $𝐻 (𝑋)$ is asymptotically impossible, while rates above $𝐻 (𝑋)$ are achievable with vanishing error.

The theorem is built from typical sets. Since there are approximately

2^{𝑛 𝐻 (𝑋)}

typical length- $𝑛$ sequences, one needs roughly $𝑛 𝐻 (𝑋)$ bits to index them.

Entropy is therefore operational: it is the asymptotic compression threshold. The result is not that every sequence can be compressed to entropy length; atypical sequences exist. The theorem is probabilistic and asymptotic.

36.8 Channel coding preview

A communication channel is a Markov kernel

𝑃_{𝑌 ∣ 𝑋} .

For an input law $𝑃_{𝑋}$ , the mutual information is

𝐼 (𝑋; 𝑌) .

The channel capacity is

𝐶 = \sup_{𝑃_{𝑋}} 𝐼 (𝑋; 𝑌) .

The channel coding theorem states that communication rates below $𝐶$ are achievable with vanishing error under the channel model, while rates above $𝐶$ are impossible.

The probability carrier includes codebook, message distribution, channel law, decoder, and block length. Capacity is an asymptotic information-transport limit, not a claim about zero-error communication at finite block length.

36.9 Entropy method in probability

The entropy method derives concentration inequalities by bounding entropy of exponential transforms. For a random variable $𝑍$ ,

Ent (𝑒^{𝜆 𝑍}) = 𝐸 [𝑒^{𝜆 𝑍} 𝜆 𝑍] - 𝐸 [𝑒^{𝜆 𝑍}] \log 𝐸 [𝑒^{𝜆 𝑍}] .

Tensorization allows entropy of functions of independent variables to be decomposed or bounded by coordinatewise contributions. Differential inequalities for

\log 𝐸 [𝑒^{𝜆 (𝑍 - 𝐸 𝑍)}]

then yield concentration through Chernoff optimization.

This method connects log-Sobolev inequalities, bounded differences, self-bounding functions, and Gaussian concentration.

36.10 KL divergence as transport cost

Relative entropy can be interpreted as the cost of changing one law into another. In exponential tilting,

\frac{𝑑 𝑄_{𝜃}}{𝑑 𝑃} = 𝑒^{𝜃 𝑋 - Λ (𝜃)},

and

𝐷 (𝑄_{𝜃} ∥ 𝑃) = 𝜃 𝐸_{𝑄_{𝜃}} [𝑋] - Λ (𝜃) .

The large-deviation rate function can be written variationally as the minimum entropy cost among laws forcing a constraint:

𝐼 (𝑎) = \inf_{𝜈 : \int 𝑥 𝑑 𝜈 = 𝑎} 𝐷 (𝜈 ∥ 𝜇) .

This makes rare events optimization problems over alternative probability laws. The least expensive law under which the rare event becomes typical controls the exponential probability.

36.11 Large deviations and entropy

Sanov's theorem states

𝑃 (𝐿_{𝑛} \approx 𝜈) ≍ 𝑒^{- 𝑛 𝐷 (𝜈 ∥ 𝜇)},

where $𝐿_{𝑛}$ is the empirical measure and $𝜇$ the true iid law.

Cramér's theorem follows by contraction:

𝐼 (𝑎) = \inf_{𝜈 : \int 𝑥 𝑑 𝜈 = 𝑎} 𝐷 (𝜈 ∥ 𝜇) .

Thus entropy is not only a coding quantity. It is a geometric action governing empirical-law deviations. Information theory and large deviation theory share the same convex-duality carrier.

Chapter 37 — Statistical Inference as Probability Liftback

37.1 Statistical models

A statistical model is a family of probability laws

𝑃 = {𝑃_{𝜃} : 𝜃 \in Θ} .

The observed data $𝑋$ are modeled as sampled from one unknown law $𝑃_{𝜃}$ .

The parameter $𝜃$ may be finite-dimensional, infinite-dimensional, structural, or merely an index. A model specifies which distributions are considered possible; inference attempts to recover features of the generating law from data.

Model specification precedes estimation. If the true data law lies outside $𝑃$ , the problem becomes misspecified inference, not ordinary parameter estimation.

37.2 Sampling distributions

A statistic is a measurable function

𝑇 = 𝑇 (𝑋_{1}, \dots, 𝑋_{𝑛}) .

Its sampling distribution under parameter $𝜃$ is

𝐿_{𝜃} (𝑇) .

Inference procedures are calibrated through these distributions. Standard errors, confidence intervals, test critical values, and asymptotic approximations all depend on the sampling law.

The observed statistic is one realized value. Its sampling distribution describes hypothetical repetition under the model. Confusing realized data with the random statistic is the same variable/value category error encountered in elementary probability.

37.3 Likelihood

Given an observed value $𝑥$ , the likelihood is

𝐿 (𝜃; 𝑥) \propto 𝑝_{𝜃} (𝑥),

viewed as a function of $𝜃$ . For dominated models,

𝑝_{𝜃} = \frac{𝑑 𝑃_{𝜃}}{𝑑 𝜇}

relative to a common dominating measure $𝜇$ .

Likelihood is not a probability distribution over $𝜃$ unless a prior and normalization are added. Multiplying $𝐿 (𝜃; 𝑥)$ by a factor depending only on $𝑥$ does not change likelihood comparisons.

Maximum likelihood estimation chooses

\hat{𝜃} \in \arg \max_{𝜃 \in Θ} 𝐿 (𝜃; 𝑥) .

Existence and uniqueness of the maximizer are separate gates.

37.4 Sufficiency

A statistic $𝑇 (𝑋)$ is sufficient for $𝜃$ if the conditional distribution of the full data given $𝑇$ does not depend on $𝜃$ . Intuitively, $𝑇$ retains all model-relevant information about $𝜃$ .

The factorization theorem gives a practical criterion in dominated models:

𝑝_{𝜃} (𝑥) = 𝑔_{𝜃} (𝑇 (𝑥)) ℎ (𝑥) .

Sufficiency is a carrier compression:

𝑋 \to 𝑇 (𝑋)

that preserves inferential information for the parameter family. It does not mean $𝑇$ preserves every property of the sample; only parameter-relevant likelihood information.

37.5 Exponential families

A regular exponential family has density

𝑝_{𝜃} (𝑥) = ℎ (𝑥) \exp (𝜂 (𝜃)^{⊤} 𝑇 (𝑥) - 𝐴 (𝜃)) .

Here $𝑇 (𝑥)$ is the sufficient statistic, $𝜂 (𝜃)$ the natural parameter, and $𝐴 (𝜃)$ the log-partition function.

Derivatives of $𝐴$ encode moments:

\nabla 𝐴 (𝜂) = 𝐸_{𝜂} [𝑇 (𝑋)],

\nabla^{2} 𝐴 (𝜂) = {Cov}_{𝜂} (𝑇 (𝑋)) .

Exponential families connect convex analysis, entropy, maximum likelihood, Bayesian conjugacy, and information geometry.

37.6 Estimators

An estimator is a measurable function of data:

{\hat{𝜃}}_{𝑛} = 𝑇_{𝑛} (𝑋_{1}, \dots, 𝑋_{𝑛}) .

Its quality may be evaluated by bias, variance, mean squared error, risk, consistency, asymptotic distribution, robustness, or minimax criteria.

An estimator is itself a random variable before data realization. Different estimators trade bias against variance and local efficiency against robustness.

The inference problem is not simply to produce a number. It is to certify behavior of the random mapping from samples to estimates under a declared model class.

37.7 Bias and variance

Bias is

{Bias}_{𝜃} (\hat{𝜃}) = 𝐸_{𝜃} [\hat{𝜃}] - 𝜃 .

Variance is

{Var}_{𝜃} (\hat{𝜃}) .

Mean squared error decomposes as

𝐸_{𝜃} [(\hat{𝜃} - 𝜃)^{2}] = {Var}_{𝜃} (\hat{𝜃}) + {Bias}_{𝜃} (\hat{𝜃})^{2} .

Unbiasedness is not universally optimal. A slightly biased estimator can have much smaller MSE. The relevant criterion depends on loss function and parameter domain.

In high-dimensional inference, regularization deliberately introduces bias to control variance and instability.

37.8 Consistency

An estimator is consistent if

{\hat{𝜃}}_{𝑛} \to 𝜃

in probability under $𝑃_{𝜃}$ . Strong consistency uses almost sure convergence.

Consistency is a large-sample property. It does not quantify finite-sample error or convergence rate. An estimator can be consistent but practically poor at relevant sample sizes.

Consistency proofs often use uniform LLNs and identifiability:

𝑀_{𝑛} (𝜃) \to 𝑀 (𝜃)

uniformly, with $𝑀$ uniquely optimized at the true parameter. Then argmax or argmin continuity transports criterion convergence to parameter convergence.

37.9 Asymptotic normality

Many regular estimators satisfy

\sqrt{𝑛} ({\hat{𝜃}}_{𝑛} - 𝜃_{0}) \Rightarrow 𝑁 (0, 𝑉) .

For M-estimators, a typical expansion is

0 = Ψ_{𝑛} ({\hat{𝜃}}_{𝑛}) \approx Ψ_{𝑛} (𝜃_{0}) + 𝐴 ({\hat{𝜃}}_{𝑛} - 𝜃_{0}),

leading to

\sqrt{𝑛} ({\hat{𝜃}}_{𝑛} - 𝜃_{0}) \approx - 𝐴^{- 1} \sqrt{𝑛} Ψ_{𝑛} (𝜃_{0}) .

A CLT for the score or estimating equation then yields asymptotic normality.

This route requires differentiability, identifiability, nonsingular derivative matrix, and stochastic remainder control. Writing a Taylor expansion is not enough; the remainder must be negligible at $𝑛^{- 1 / 2}$ scale.

37.10 Fisher information

For a smooth parametric model with score

𝑆_{𝜃} (𝑋) = \frac{\partial}{\partial 𝜃} \log 𝑝_{𝜃} (𝑋),

Fisher information is

𝐼 (𝜃) = 𝐸_{𝜃} [𝑆_{𝜃} (𝑋)^{2}]

in one dimension, or

𝐼 (𝜃) = 𝐸_{𝜃} [𝑆_{𝜃} 𝑆_{𝜃}^{⊤}]

in multiple dimensions.

Under regularity,

𝐼 (𝜃) = - 𝐸_{𝜃} [\frac{\partial^{2}}{\partial 𝜃^{2}} \log 𝑝_{𝜃} (𝑋)] .

Fisher information measures local curvature of likelihood and local distinguishability of nearby parameter laws.

37.11 Cramér–Rao bound

For an unbiased estimator $𝑇$ of a scalar parameter $𝜃$ , under regularity conditions,

{Var}_{𝜃} (𝑇) \geq \frac{1}{𝐼_{𝑛} (𝜃)},

where for iid samples

𝐼_{𝑛} (𝜃) = 𝑛 𝐼_{1} (𝜃) .

The bound follows from covariance between estimator and score plus Cauchy–Schwarz. It is a local lower bound under a specific model and regularity class.

It does not imply every estimator must obey the same finite-sample bound under bias, irregular models, boundary parameters, or different loss structures. The hypothesis ledger is essential.

37.12 Hypothesis testing

A test is a measurable decision rule

𝜑 (𝑋) \in [0, 1]

or ${0, 1}$ . For null $𝐻_{0}$ and alternative $𝐻_{1}$ , type I error is

𝑃_{𝐻_{0}} (reject),

and type II error is

𝑃_{𝐻_{1}} (fail to reject) .

The Neyman–Pearson lemma identifies the most powerful test for simple-versus-simple hypotheses using likelihood ratios:

\frac{𝑝_{1} (𝑋)}{𝑝_{0} (𝑋)} .

Composite hypotheses require additional structure: generalized likelihood ratio, score tests, Wald tests, minimax testing, or Bayesian decision rules.

A $𝑝$ -value is not the probability that the null hypothesis is true. It is a tail probability computed under the null model.

37.13 Confidence intervals

A confidence procedure produces a random set $𝐶 (𝑋)$ satisfying

𝑃_{𝜃} (𝜃 \in 𝐶 (𝑋)) \geq 1 - 𝛼

for all $𝜃$ in the intended parameter class.

Coverage is a property of the random procedure under repeated sampling, not a posterior probability for a fixed realized interval unless a Bayesian model is added.

Asymptotic confidence intervals use approximations such as

{\hat{𝜃}}_{𝑛} \pm 𝑧_{1 - 𝛼 / 2} \frac{\hat{s e}}{\sqrt{𝑛}} .

Validity requires asymptotic normality, consistent variance estimation, and sufficient sample size. Finite-sample coverage can differ substantially.

37.14 Bayesian posterior calculus

A prior $𝜋 (𝑑 𝜃)$ and likelihood $𝑃_{𝜃} (𝑑 𝑥)$ define a joint law

𝜋 (𝑑 𝜃) 𝑃_{𝜃} (𝑑 𝑥) .

The posterior is the conditional distribution

𝜋 (𝑑 𝜃 ∣ 𝑥) .

In density form,

𝜋 (𝜃 ∣ 𝑥) \propto 𝑝_{𝜃} (𝑥) 𝜋 (𝜃) .

Bayesian inference is ordinary conditional probability on an expanded parameter-data carrier. The prior is part of the model. Posterior validity relative to the model does not by itself validate the prior or likelihood empirically.

Posterior consistency and asymptotic normality require separate theorems; Bayesian updating alone does not guarantee learning under misspecification or nonidentifiability.

37.15 Model misspecification

A model is misspecified when the true data law $𝑃_{0}$ is not in

{𝑃_{𝜃} : 𝜃 \in Θ} .

Estimators may then converge to pseudo-true parameters minimizing a discrepancy such as

𝜃^{*} = \arg \min_{𝜃} 𝐷 (𝑃_{0} ∥ 𝑃_{𝜃}) .

Standard errors derived under correct specification may become wrong. Sandwich covariance formulas often appear:

𝐴^{- 1} 𝐵 𝐴^{- 1},

where curvature $𝐴$ and score variance $𝐵$ no longer coincide.

Misspecification is not automatically fatal, but the interpretation changes. One estimates the best approximation within a model class, not necessarily a true generative parameter.

37.16 Empirical versus formal probability

Formal probability says:

given 𝑃_{𝜃}, these consequences follow .

Empirical inference asks:

does the observed system warrant 𝑃_{𝜃} ?

The gap is filled by design, calibration, diagnostics, robustness, replication, and mechanism. Mathematical correctness of an estimator under iid Gaussian data does not prove the actual data are iid Gaussian.

The full inference chain is

world \to measurement \to data-generating assumptions \to probability model \to statistical procedure \to uncertainty certificate \to domain liftback .

Part XI — Advanced Carriers and Extensions

Chapter 38 — Coupling Theory

38.1 Couplings of probability measures

Given probability measures $𝜇$ on $𝑆$ and $𝜈$ on $𝑇$ , a coupling is a probability measure

𝛾 \in 𝑃 (𝑆 \times 𝑇)

with marginals $𝜇, 𝜈$ :

𝛾 (𝐴 \times 𝑇) = 𝜇 (𝐴), 𝛾 (𝑆 \times 𝐵) = 𝜈 (𝐵) .

Equivalently, a coupling is a pair $(𝑋, 𝑌)$ on one probability space satisfying

𝑋 \sim 𝜇, 𝑌 \sim 𝜈 .

The set of all couplings is

Π (𝜇, 𝜈) .

Marginals do not determine joint behavior. Coupling theory studies how to choose a joint carrier to optimize equality, order, distance, coalescence, or another relation.

38.2 Maximal coupling

A maximal coupling maximizes

𝑃 (𝑋 = 𝑌)

among all couplings of $𝜇, 𝜈$ . For countable or suitably dominated laws,

\sup_{𝛾 \in Π (𝜇, 𝜈)} 𝑃 (𝑋 = 𝑌) = 1 - ∥ 𝜇 - 𝜈 ∥_{T V} .

Equivalently,

∥ 𝜇 - 𝜈 ∥_{T V} = \inf_{𝛾 \in Π (𝜇, 𝜈)} 𝑃 (𝑋 \neq 𝑌) .

This turns total variation distance into an optimal mismatch probability. The abstract distance between laws becomes a concrete joint-event probability.

38.3 Monotone coupling

For probability measures on an ordered space such as $𝑅$ , stochastic domination

𝜇 ⪯_{s t} 𝜈

means

𝜇 ((𝑡, \infty)) \leq 𝜈 ((𝑡, \infty))

for all $𝑡$ , equivalently their CDFs satisfy the reverse order.

A monotone coupling constructs

𝑋 \leq 𝑌 a.s.

with $𝑋 \sim 𝜇$ , $𝑌 \sim 𝜈$ . On $𝑅$ , quantile coupling does this:

𝑋 = 𝐹_{𝜇}^{- 1} (𝑈), 𝑌 = 𝐹_{𝜈}^{- 1} (𝑈),

with common $𝑈 \sim U n i f o r m (0, 1)$ .

Monotone coupling converts distributional order into pathwise order.

38.4 Strassen theorem

Strassen-type theorems characterize when probability measures admit couplings supported on a specified relation. A typical order version states that

𝜇 ⪯_{s t} 𝜈

under suitable conditions iff there exists a coupling $(𝑋, 𝑌)$ such that

𝑋 \leq 𝑌 a.s.

More generally, if $𝑅 \subseteq 𝑆 \times 𝑇$ is closed, one seeks

𝛾 (𝑅) = 1.

Marginal inequalities over appropriate sets characterize existence of such a coupling.

Strassen's theorem is a law-to-relation liftback theorem.

38.5 Coupling inequalities

Any coupling gives probability-distance bounds. For a coupling $(𝑋, 𝑌)$ ,

∥ 𝐿 (𝑋) - 𝐿 (𝑌) ∥_{T V} \leq 𝑃 (𝑋 \neq 𝑌) .

For Wasserstein distance,

𝑊_{𝑝} (𝜇, 𝜈)^{𝑝} \leq 𝐸 [𝑑 (𝑋, 𝑌)^{𝑝}]

for any coupling, with equality after optimizing.

Coupling inequalities convert joint-path control into law-distance control. The quality of the estimate depends entirely on coupling design.

38.6 Coupling from the past

Coupling from the past is an exact sampling method for certain Markov chains. One uses shared randomness to run all possible initial states from a sufficiently remote past toward time zero. If all trajectories coalesce by time zero, their common value has exactly the stationary distribution.

The method avoids burn-in approximation. It converts coalescence into exact stationarity.

Feasibility depends on monotonicity, finite state space, or other structure allowing simultaneous control of all initial states. Coalescence must be certified, not assumed.

38.7 Wasserstein coupling

The $𝑝$ -Wasserstein distance is

𝑊_{𝑝} (𝜇, 𝜈) = {(\inf_{𝛾 \in Π (𝜇, 𝜈)} \int 𝑑 (𝑥, 𝑦)^{𝑝} 𝑑 𝛾 (𝑥, 𝑦))}^{1 / 𝑝} .

An optimal coupling minimizes expected transport cost.

Unlike total variation, Wasserstein distance sees geometry. Two point masses at nearby locations have total variation distance one but small Wasserstein distance:

∥ 𝛿_{𝑥} - 𝛿_{𝑦} ∥_{T V} = 1

for $𝑥 \neq 𝑦$ , while

𝑊_{𝑝} (𝛿_{𝑥}, 𝛿_{𝑦}) = 𝑑 (𝑥, 𝑦) .

This makes Wasserstein metrics appropriate for approximation where small spatial displacement should count as small error.

38.8 Coupling for Markov chains

To compare Markov chains started from $𝑥$ and $𝑦$ , construct

(𝑋_{𝑛}, 𝑌_{𝑛})

with correct marginal transition laws. If the coupling is coalescent,

𝑋_{𝑛} = 𝑌_{𝑛} \Rightarrow 𝑋_{𝑛 + 𝑘} = 𝑌_{𝑛 + 𝑘},

then the coupling time

𝑇 = \inf {𝑛 : 𝑋_{𝑛} = 𝑌_{𝑛}}

controls convergence:

∥ 𝑃^{𝑛} (𝑥, \cdot) - 𝑃^{𝑛} (𝑦, \cdot) ∥_{T V} \leq 𝑃 (𝑇 > 𝑛) .

Couplings may be synchronous, reflection, maximal, monotone, or problem-specific. Good coupling design exploits geometry of the transition mechanism.

38.9 Coupling as liftback from law-level claims

A law-level statement may be too weak for a pathwise claim. Coupling constructs a common carrier where stronger comparisons become meaningful.

Examples:

𝜇_{𝑛} \Rightarrow 𝜇 \to Skorokhod coupling with a.s. convergence,

𝜇 ⪯_{s t} 𝜈 \to 𝑋 \leq 𝑌 a.s. under monotone coupling,

𝑊_{𝑝} (𝜇, 𝜈) small \to 𝐸 [𝑑 (𝑋, 𝑌)^{𝑝}] small under near-optimal coupling .

The firewall is that the coupling relation is newly constructed. It is not automatically true for the original variables.

Chapter 39 — Optimal Transport

39.1 Transport plans

Let $𝜇$ and $𝜈$ be probability measures on spaces $𝑋, 𝑌$ . A transport plan is a coupling

𝛾 \in Π (𝜇, 𝜈) .

Given cost $𝑐 (𝑥, 𝑦)$ , the Kantorovich transport problem is

\inf_{𝛾 \in Π (𝜇, 𝜈)} \int 𝑐 (𝑥, 𝑦) 𝛾 (𝑑 𝑥, 𝑑 𝑦) .

The plan describes how mass at $𝑥$ is probabilistically distributed to destinations $𝑦$ . Unlike a deterministic map, one source point can split mass across many destinations.

Optimal transport is therefore coupling theory plus optimization over geometric cost.

39.2 Monge problem

The original Monge problem seeks a deterministic map

𝑇 : 𝑋 \to 𝑌

such that

𝑇_{*} 𝜇 = 𝜈

and minimizes

\int 𝑐 (𝑥, 𝑇 (𝑥)) 𝜇 (𝑑 𝑥) .

The problem is nonlinear and may have no solution because a map cannot split mass. For example, a point mass cannot be mapped deterministically into a diffuse distribution.

This is the primitive failure that motivates Kantorovich relaxation:

transport map \to transport plan .

39.3 Kantorovich relaxation

The Kantorovich problem replaces deterministic maps by arbitrary couplings:

\inf_{𝛾 \in Π (𝜇, 𝜈)} \int 𝑐 𝑑 𝛾 .

The feasible set $Π (𝜇, 𝜈)$ is convex, and the objective is linear in $𝛾$ .

This convexification creates existence and duality machinery. Under lower semicontinuity and tightness conditions, minimizers often exist.

However, a Kantorovich optimizer need not be induced by a map. Recovering Monge structure requires additional assumptions. Convex relaxation terminalizes the relaxed problem, not automatically the original deterministic one.

39.4 Duality

Kantorovich duality gives

\inf_{𝛾 \in Π (𝜇, 𝜈)} \int 𝑐 (𝑥, 𝑦) 𝑑 𝛾 = \sup_{𝜑, 𝜓} {\int 𝜑 𝑑 𝜇 + \int 𝜓 𝑑 𝜈 : 𝜑 (𝑥) + 𝜓 (𝑦) \leq 𝑐 (𝑥, 𝑦)} .

The dual potentials certify optimality. If equality holds

𝜑 (𝑥) + 𝜓 (𝑦) = 𝑐 (𝑥, 𝑦)

on the support of an optimal plan, primal and dual structures match.

For cost $𝑐 (𝑥, 𝑦) = 𝑑 (𝑥, 𝑦)$ , the dual simplifies to the Kantorovich–Rubinstein formula:

𝑊_{1} (𝜇, 𝜈) = \sup_{∥ 𝑓 ∥_{L i p} \leq 1} ∣ \int 𝑓 𝑑 𝜇 - \int 𝑓 𝑑 𝜈 ∣ .

39.5 Wasserstein distances

For $𝑝 \geq 1$ ,

𝑊_{𝑝} (𝜇, 𝜈) = {(\inf_{𝛾 \in Π (𝜇, 𝜈)} \int 𝑑 (𝑥, 𝑦)^{𝑝} 𝑑 𝛾)}^{1 / 𝑝} .

The space

𝑃_{𝑝} (𝑋)

consists of probability measures with finite $𝑝$ -th moment.

Wasserstein convergence is stronger than weak convergence:

𝑊_{𝑝} (𝜇_{𝑛}, 𝜇) \to 0

implies weak convergence plus $𝑝$ -moment convergence under standard settings.

Wasserstein geometry turns probability distributions into points of a metric space whose metric reflects spatial displacement.

39.6 Brenier theorem

For quadratic cost

𝑐 (𝑥, 𝑦) = ∣ 𝑥 - 𝑦 ∣^{2}

on $𝑅^{𝑑}$ , if the source measure $𝜇$ is absolutely continuous, Brenier's theorem gives an optimal transport map

𝑇 = \nabla 𝜑

for a convex function $𝜑$ , under suitable finite-moment conditions.

Thus the optimal Kantorovich plan is concentrated on the graph of a deterministic map:

𝛾 = (id, 𝑇)_{*} 𝜇 .

Convexity enters as the certificate for optimal map structure. The theorem is a relaxation liftback:

optimal plan \to gradient of convex potential .

39.7 Displacement convexity

Given an optimal transport map $𝑇$ , define interpolation

𝑇_{𝑡} = (1 - 𝑡) id + 𝑡 𝑇,

and

𝜇_{𝑡} = (𝑇_{𝑡})_{*} 𝜇_{0} .

This is a Wasserstein geodesic.

A functional $𝐹$ is displacement convex if

𝐹 (𝜇_{𝑡}) \leq (1 - 𝑡) 𝐹 (𝜇_{0}) + 𝑡 𝐹 (𝜇_{1})

along Wasserstein geodesics.

This differs from ordinary convexity under mixture:

(1 - 𝑡) 𝜇_{0} + 𝑡 𝜇_{1} .

The carrier geometry has changed; the relevant straight lines are transport geodesics.

39.8 Transport inequalities

A transport inequality relates Wasserstein distance to entropy:

𝑊_{2} (𝜈, 𝜇)^{2} \leq 2 𝐶 𝐷 (𝜈 ∥ 𝜇) .

This says moving law $𝜇$ to $𝜈$ by a large geometric amount requires large information cost.

Such inequalities imply concentration. Tilting $𝜇$ toward an event or large-Lipschitz deviation creates a law $𝜈$ , and entropy-transport control bounds how far the tilt can move typical mass.

This connects:

entropy \leftrightarrow geometry \leftrightarrow concentration .

39.9 Gradient flows in probability space

Many dissipative PDEs can be interpreted as gradient flows in Wasserstein space. The Fokker–Planck equation

\partial_{𝑡} 𝜌 = \nabla \cdot (𝜌 \nabla 𝑉) + Δ 𝜌

can be viewed as gradient descent of the free-energy functional

𝐹 (𝜌) = \int 𝑉 𝑑 𝜌 + \int 𝜌 \log 𝜌 𝑑 𝑥

under the $𝑊_{2}$ metric.

This transforms PDE evolution into geometry on probability measures:

state = 𝜌_{𝑡}, energy = 𝐹, metric = 𝑊_{2} .

The interpretation is not metaphorical when the variational structure is rigorously established through minimizing-movement schemes and metric-gradient-flow theory.

39.10 Probability geometry

Optimal transport reveals probability laws as geometric objects. Means, interpolations, barycenters, curvature, geodesics, gradient flows, and convexity can all be studied in measure space.

A Wasserstein barycenter minimizes

𝜈 \mapsto \sum_{𝑖} 𝑤_{𝑖} 𝑊_{2} (𝜈, 𝜇_{𝑖})^{2} .

This is an analogue of Euclidean least squares for distributions.

The carrier warning is that different probability metrics induce different geometries. Total variation, KL divergence, Hellinger distance, and Wasserstein distance are not interchangeable. Each detects different deformations.

Chapter 40 — Point Processes

40.1 Counting measures

A counting measure has the form

𝑁 = \sum_{𝑖} 𝛿_{𝑋_{𝑖}},

possibly with multiplicities. For a measurable set $𝐴$ ,

𝑁 (𝐴)

counts points lying in $𝐴$ .

A point process is a random counting measure. Thus the state space is a space of locally finite measures, not merely a random vector of points. This formulation handles random numbers of points and unbounded domains uniformly.

Measurability and topology on configuration space matter for convergence. Vague topology is commonly used because test functions have compact support:

𝑁 \mapsto \int 𝑓 𝑑 𝑁 .

40.2 Poisson point processes

A Poisson point process with intensity measure $𝜇$ satisfies:

for measurable $𝐴$ with $𝜇 (𝐴) < \infty$ ,

𝑁 (𝐴) \sim Poisson (𝜇 (𝐴));

counts on disjoint sets are independent.

Its Laplace functional is

𝐸 [𝑒^{- \int 𝑓 𝑑 𝑁}] = \exp (- \int (1 - 𝑒^{- 𝑓 (𝑥)}) 𝜇 (𝑑 𝑥))

for nonnegative measurable $𝑓$ .

The Laplace functional characterizes the law. It plays the role of a transform for random measures.

40.3 Intensity measures

The intensity measure of a point process is

Λ (𝐴) = 𝐸 [𝑁 (𝐴)] .

For a Poisson point process, intensity measure completely determines the law. For a general point process, it does not; higher-order dependence requires factorial moment measures or correlation functions.

Λ (𝑑 𝑥) = 𝜆 (𝑥) 𝑑 𝑥,

then $𝜆 (𝑥)$ is an intensity density. It describes expected local point density, not necessarily actual conditional occurrence rate.

Two point processes can have identical intensity and radically different clustering or repulsion.

40.4 Laplace functionals

For a random measure $𝑁$ , define

𝐿_{𝑁} (𝑓) = 𝐸 \exp (- \int 𝑓 𝑑 𝑁), 𝑓 \geq 0.

Under suitable conditions, the Laplace functional determines the law of $𝑁$ .

For Poisson process intensity $𝜇$ ,

𝐿_{𝑁} (𝑓) = \exp (- \int (1 - 𝑒^{- 𝑓}) 𝑑 𝜇) .

Laplace functionals transform superposition of independent point processes into multiplication. They are the random-measure analogue of characteristic or probability generating functions.

40.5 Palm distributions

Palm distributions describe a point process viewed from or conditioned relative to a typical point. Informally, they answer:

what does the configuration look like given a point at 𝑥 ?

For stationary point processes, Palm theory distinguishes a typical location from a typical point. These are not the same sampling procedures. A typical point is biased toward denser regions or larger structures depending on context.

Campbell formulas relate sums over process points to integrals against intensity and Palm distributions:

𝐸 \sum_{𝑥 \in 𝑁} 𝑓 (𝑥, 𝑁) = \int 𝐸^{𝑥} [𝑓 (𝑥, 𝑁)] Λ (𝑑 𝑥) .

Palm calculus is central in queueing, networks, stochastic geometry, and spatial statistics.

40.6 Cox processes

A Cox process, or doubly stochastic Poisson process, is a Poisson process conditional on a random intensity measure $Λ$ . Given $Λ$ ,

𝑁 ∣ Λ \sim PPP (Λ) .

Marginally, the process is not Poisson because randomness in intensity creates clustering and dependence between counts in disjoint regions:

Cov (𝑁 (𝐴), 𝑁 (𝐵))

can be positive through shared intensity fluctuations.

Cox processes model environmental heterogeneity, spatial clustering, insurance events, communication networks, and random media.

40.7 Renewal processes

A renewal process is generated by iid positive interarrival times

𝑇_{1}, 𝑇_{2}, \dots .

Arrival epochs are

𝑆_{𝑛} = 𝑇_{1} + \dots + 𝑇_{𝑛},

and the counting process is

𝑁 (𝑡) = \max {𝑛 : 𝑆_{𝑛} \leq 𝑡} .

The elementary renewal theorem gives

\frac{𝐸 [𝑁 (𝑡)]}{𝑡} \to \frac{1}{𝐸 [𝑇_{1}]}

under finite mean. Strong laws give

\frac{𝑁 (𝑡)}{𝑡} \to \frac{1}{𝐸 [𝑇_{1}]}

almost surely.

The Poisson process is the special renewal process with exponential interarrival times. Memorylessness makes it Markovian; general renewal processes retain age dependence.

40.8 Spatial point processes

Spatial point processes model random configurations in $𝑅^{𝑑}$ or another geometric space. Major classes include Poisson, Cox, determinantal, Gibbs, cluster, and hard-core processes.

Clustering and repulsion are diagnosed through pair-correlation functions, Ripley's $𝐾$ -function, factorial moments, or conditional intensities.

The carrier includes geometry. Rotation invariance, translation stationarity, boundary effects, observation windows, and metric choice influence inference. Spatial point processes are not ordinary iid samples because point count and locations may be jointly random and dependent.

40.9 Random measures

A random measure is a measurable map

𝜔 \mapsto 𝑀_{𝜔}

into a space of measures. Point processes are integer-valued random measures, but random measures also include diffuse objects such as Gaussian random measures, completely random measures, occupation measures, and empirical measures.

Integration against a random measure gives random variables:

\int 𝑓 𝑑 𝑀 .

The law of $𝑀$ can be studied through Laplace or characteristic functionals.

Random-measure language unifies point processes, Lévy noise, Bayesian nonparametric priors, branching limits, and stochastic integration.

40.10 Applications to geometry and networks

Point processes model wireless transmitters, stars, trees, cells, particles, earthquakes, traffic events, and network nodes. Geometry enters through nearest-neighbor distance, Voronoi tessellations, coverage, connectivity, interference, and percolation.

For a homogeneous Poisson process of intensity $𝜆$ in $𝑅^{𝑑}$ , void probabilities satisfy

𝑃 (𝑁 (𝐵) = 0) = 𝑒^{- 𝜆 ∣ 𝐵 ∣} .

This immediately gives nearest-neighbor distributions by taking $𝐵$ as a ball.

Network models add edges according to distance, random connection rules, or marks. This creates random geometric graphs and continuum percolation. The liftback from point process to network property requires geometric dependence analysis; node independence does not imply edge-event independence.