Probability Theory — Chapters 30–40
- Get link
- X
- Other Apps
Probability Theory — Chapters 30–40
Chapter 30 — Stochastic Differential Equations
30.1 Itô SDEs
An Itô stochastic differential equation has the form
dXt=b(t,Xt)dt+σ(t,Xt)dBt,X0=ξ,where Bt is Brownian motion, b is the drift coefficient, and σ is the diffusion coefficient. The rigorous meaning is the integral equation
Xt=ξ+∫0tb(s,Xs)ds+∫0tσ(s,Xs)dBs.The first integral is an ordinary Lebesgue integral along the sample path; the second is an Itô stochastic integral.
In d dimensions with an m-dimensional Brownian driver,
dXt=b(t,Xt)dt+σ(t,Xt)dBt,where b:R+×Rd→Rd and
σ:R+×Rd→Rd×m.The instantaneous covariance matrix is
a(t,x)=σ(t,x)σ(t,x)⊤.Different σ can induce the same covariance matrix a, so the diffusion law may depend on a rather than on a unique factorization.
An SDE should be understood as a carrier consisting of probability space, filtration, Brownian motion, coefficient functions, initial law, and solution concept. The differential notation alone does not specify whether the solution is strong or weak, whether explosion is allowed, what happens at boundaries, or whether uniqueness is pathwise or merely in law.
30.2 Strong and weak solutions
A strong solution is constructed on a prescribed filtered probability space carrying a prescribed Brownian motion. The solution Xt is adapted to that filtration and satisfies the integral equation. Informally, once the Brownian path and initial condition are supplied, X is determined as a measurable functional of those inputs when pathwise uniqueness holds.
A weak solution allows the probability space, filtration, Brownian motion, and process X to be constructed together. The target is the existence of a joint law satisfying the SDE. Weak solutions therefore belong to a law-level carrier.
The two uniqueness notions differ. Pathwise uniqueness means that if X and Y solve the SDE on the same probability space with the same Brownian motion and initial value, then
P(Xt=Yt for all t)=1.Uniqueness in law means all weak solutions have the same distribution on path space.
The distinction is fundamental:
strong solution=pathwise construction,weak solution=law construction.Equality in law is not pathwise identity. A theorem proving weak uniqueness cannot automatically be used to compare two solutions driven by the same noise.
30.3 Existence and uniqueness
A standard sufficient condition for strong existence and pathwise uniqueness is global Lipschitz continuity:
∣b(t,x)−b(t,y)∣+∥σ(t,x)−σ(t,y)∥≤L∣x−y∣,together with linear growth:
∣b(t,x)∣+∥σ(t,x)∥≤C(1+∣x∣).Under these assumptions, Picard iteration produces a unique global strong solution.
Starting from Xt(0)=ξ, define
Xt(n+1)=ξ+∫0tb(s,Xs(n))ds+∫0tσ(s,Xs(n))dBs.Itô isometry, BDG inequalities, and Grönwall's lemma show convergence in suitable process norms.
Local Lipschitz conditions usually produce uniqueness only up to explosion time. Irregular coefficients may still admit weak or even strong solutions, but the proof route changes. Possible tools include martingale problems, monotonicity methods, one-dimensional comparison, Zvonkin transformations, Dirichlet forms, or compactness methods.
The gate structure is:
formal SDE→existence→solution concept→uniqueness class→globality.These are separate results.
30.4 Explosion times
A local solution may cease to remain finite after a random time. Define
τn=inf{t≥0:∣Xt∣≥n},τ∞=n→∞limτn.The process explodes if
P(τ∞<∞)>0.Global Lipschitz and linear-growth conditions prevent explosion. More flexible nonexplosion criteria use Lyapunov functions. Suppose V:Rd→[0,∞) satisfies
V(x)→∞as ∣x∣→∞and
AV(x)≤C(1+V(x)),where A is the generator. Itô's formula and Grönwall estimates can then bound
E[V(Xt∧τn)]uniformly in n, implying nonexplosion under suitable hypotheses.
Explosion is a local-global gate. Proving the SDE has a solution for every bounded stopping region does not prove the process exists globally. The passage
local solution→global processrequires a separate nonexplosion certificate.
30.5 Diffusion generators
For
dXt=b(Xt)dt+σ(Xt)dBt,define
a(x)=σ(x)σ(x)⊤.The generator acts formally on smooth f as
Af(x)=b(x)⋅∇f(x)+21i,j∑aij(x)∂ijf(x).Itô's formula gives
f(Xt)−f(X0)−∫0tAf(Xs)dsas a local martingale, and a true martingale under suitable integrability. This identity connects the SDE carrier to the martingale-problem carrier.
The generator contains infinitesimal transition information:
Af(x)=t↓0limtEx[f(Xt)]−f(x).It links stochastic dynamics to PDE, potential theory, semigroups, invariant measures, and spectral analysis.
The differential expression is not the complete generator. Its domain matters. Boundary conditions and function-space closure can produce different Markov processes from similar formal differential operators.
30.6 Fokker–Planck equations
If Xt has density p(t,x), then formally its law evolves according to the forward Kolmogorov or Fokker–Planck equation
∂tp=A∗p.For a diffusion,
∂tp=−i∑∂i(bip)+21i,j∑∂ij(aijp).The SDE and Fokker–Planck equation describe different carriers. The SDE describes individual random paths. The Fokker–Planck equation describes transport of probability mass. One may have a well-defined process without a smooth density, so the PDE representation requires regularity beyond process existence.
Stationary densities satisfy
A∗p∞=0together with normalization and boundary conditions. Solving this equation may identify an invariant law, but uniqueness of the invariant law and convergence toward it require additional recurrence or ergodicity arguments.
30.7 Kolmogorov backward equation
For
u(t,x)=Ex[f(Xt)],the semigroup relation gives
u(t,x)=Ptf(x).Under regularity conditions,
∂tu=Au,u(0,x)=f(x).For a terminal-value problem,
u(t,x)=Et,x[g(XT)],one obtains
∂tu+Au=0,u(T,x)=g(x).The forward equation acts on probability distributions; the backward equation acts on observables or value functions:
forward:law→law,backward:observable→expected observable.This distinction is central in filtering, stochastic control, potential theory, and mathematical finance. Confusing the adjoint roles of A and A∗ produces incorrect PDE transport.
30.8 Change of measure
Suppose
dXt=btdt+σtdBt.Under an equivalent probability measure defined through a suitable exponential martingale,
ZT=exp(−∫0TθsdBs−21∫0T∣θs∣2ds),one can define
dPdQ=ZT.Then
Bt=Bt+∫0tθsdsis Brownian motion under Q, and the drift representation of X changes accordingly.
This is a probability-law transport, not a path transformation. The underlying trajectories can remain the same while their weights change.
The density process must be a true martingale with expectation one. A local martingale is insufficient for defining a probability measure with total mass one. Conditions such as Novikov's criterion are sufficient:
Eexp(21∫0T∣θs∣2ds)<∞.30.9 Stochastic flows
Under suitable regularity, the family of solutions
Xtxindexed by initial state x forms a stochastic flow. Instead of studying one random trajectory, one studies the random map
Φs,t(ω,x)=Xts,x(ω).The flow property is
Φs,t=Φu,t∘Φs,u,s≤u≤t,with appropriate noise-shift interpretation.
When coefficients are sufficiently smooth, x↦Xtx may be differentiable. The Jacobian flow satisfies a linearized SDE. For example,
dJt=Db(Xt)Jtdt+Dσ(Xt)JtdBt.Stochastic flows are used in random dynamical systems, stochastic geometry, fluid models, filtering, and sensitivity analysis. Flow regularity is stronger than existence of separate solutions for each initial point. Simultaneous control over uncountably many initial conditions is a separate gate.
30.10 Boundary behavior
An SDE inside a domain D does not determine what happens at ∂D. Possible behaviors include absorption, reflection, killing, stickiness, entrance, exit, and natural inaccessibility.
Absorbing behavior stops the process or sends it to a cemetery state at
τ∂D=inf{t:Xt∈∂D}.Reflection adds a bounded-variation correction:
dXt=b(Xt)dt+σ(Xt)dBt+n(Xt)dLt,where Lt is boundary local time and n is an inward normal field.
In one dimension, scale functions and speed measures classify boundaries. In higher dimensions, boundary behavior connects to PDE boundary conditions: absorbing corresponds to Dirichlet-type conditions; reflection corresponds to Neumann-type conditions.
The same interior generator can support distinct processes under different boundary conditions. The boundary rule is therefore part of the carrier.
30.11 Numerical schemes
For time step Δt, Euler–Maruyama approximates
dXt=b(Xt)dt+σ(Xt)dBtby
Xk+1=Xk+b(Xk)Δt+σ(Xk)ΔBk,where
ΔBk∼N(0,Δt).Strong error concerns coupled trajectories:
E[k≤nmax∣Xtk−Xk∣p]1/p.Weak error concerns observables:
∣E[f(XT)]−E[f(Xn)]∣.A method can have better weak order than strong order because matching distributions is easier than matching paths.
Higher-order methods include Milstein schemes, which add terms involving derivatives of σ and iterated stochastic integrals. Numerical validity also requires stability, preservation of positivity or constraints, control of invariant measures, and treatment of stiffness.
30.12 SDE model validity
An SDE is a mathematical model, not a proof that a real system contains Brownian noise. To justify the model, one must identify state variables, time scale, source of randomness, Markov approximation, coefficient calibration, boundary conditions, and the regime in which diffusion scaling is valid.
Diffusion approximations often arise from limits of discrete systems:
small jumps+high frequency+centering+variance scaling→diffusion.If jumps are heavy-tailed, a Lévy process may be the correct limit instead. If memory persists, a Markov SDE may be invalid. If noise is multiplicative, interpretation of stochastic integration can matter.
The full liftback chain is
mechanism/data→stochastic scaling→SDE carrier→mathematical solution→empirical validation.Part IX — Concentration, High-Dimensional Probability, and Random Structures
Chapter 31 — Concentration Inequalities
31.1 Hoeffding inequality
Let X1,…,Xn be independent random variables satisfying
ai≤Xi≤bialmost surely. Then
P(i=1∑n(Xi−EXi)≥t)≤exp(−∑i=1n(bi−ai)22t2).A corresponding two-sided inequality follows by applying the estimate to both tails.
Hoeffding's inequality is obtained by exponential tilting and a bound on the moment generating function of bounded centered variables. It is distribution-free within the bounded range. The price of universality is that it ignores actual variance and can be conservative when variance is small relative to range.
For iid Bernoulli or bounded averages,
Xˉn=n1i∑Xi,the bound takes the exponential form
P(∣Xˉn−EX1∣≥ε)≤2e−cnε2.This improves the polynomial 1/n scale provided by Chebyshev.
31.2 Bernstein inequality
Bernstein's inequality incorporates both variance and maximum magnitude. For independent centered Xi with
∣Xi∣≤M,V=i∑E[Xi2],a standard form is
P(i∑Xi≥t)≤exp(−2(V+Mt/3)t2).Two regimes appear. For t≪V/M,
P(S≥t)≈e−t2/(2V),which is Gaussian. For t≫V/M,
P(S≥t)≲e−ct/M,which is exponential.
Bernstein is therefore sensitive to actual fluctuation energy V and worst-case jump scale M. It is often superior to Hoeffding when the variables are bounded but sparse or low variance.
31.3 Bennett inequality
Bennett's inequality is a sharper variance-sensitive concentration bound for bounded independent summands. Under assumptions similar to Bernstein,
P(S≥t)≤exp[−M2Vh(VMt)],where
h(u)=(1+u)log(1+u)−u.Bernstein's inequality can be derived by lower-bounding h(u) with a simpler rational expression. Bennett therefore retains more exact convex rate information.
The function h reflects the Legendre transform of an MGF bound. This exposes the relation between concentration inequalities and large deviations: concentration produces a finite-n upper certificate; the convex rate function records the exponential cost geometry.
31.4 Chernoff method
The Chernoff method begins from Markov's inequality:
P(X≥a)=P(eλX≥eλa)≤e−λaE[eλX].Optimizing over λ>0 gives
P(X≥a)≤exp(−λ>0sup{λa−logE[eλX]}).For independent sums,
logE[eλ∑iXi]=i∑logE[eλXi].Thus an MGF estimate for individual summands becomes a tail estimate for the sum.
Chernoff is a method rather than one inequality. Hoeffding, Bernstein, Bennett, and many large deviation upper bounds are obtained by choosing different MGF estimates. The method fails if exponential moments do not exist in the required direction.
31.5 Subgaussian random variables
A centered random variable X is subgaussian if there exists K>0 such that
E[eλX]≤eK2λ2/2for all λ∈R, or equivalently up to constants,
P(∣X∣≥t)≤2e−ct2/K2.The subgaussian norm is often written
∥X∥ψ2=inf{K>0:EeX2/K2≤2}.Gaussian variables and bounded centered variables are subgaussian.
Independent subgaussian sums remain subgaussian:
i∑aiXiψ2≲(i∑ai2∥Xi∥ψ22)1/2.This creates a reusable tail calculus for high-dimensional probability.
31.6 Subexponential random variables
A random variable is subexponential if its tail satisfies roughly
P(∣X∣≥t)≤2e−ct/Kfor sufficiently large t, or equivalently its Orlicz norm
∥X∥ψ1=inf{K>0:Ee∣X∣/K≤2}is finite.
Products of subgaussian variables are often subexponential. Sums of independent centered subexponential variables obey Bernstein-type bounds:
P(i∑Xi≥t)≤2exp[−cmin(∑iKi2t2,maxiKit)].The two-regime structure reflects Gaussian accumulation at moderate scale and single-large-term behavior farther into the tail.
31.7 Bounded differences
Suppose X1,…,Xn are independent and
Z=f(X1,…,Xn).Assume changing coordinate i alone can change f by at most ci:
∣f(x)−f(x′)∣≤ciwhenever x,x′ differ only in coordinate i.
Then Z is concentrated around its mean. The standard route constructs the Doob exposure martingale
Mi=E[Z∣X1,…,Xi].The martingale increments are controlled by ci, so Azuma–Hoeffding applies.
The method is valuable because it applies to nonlinear functions of many independent inputs. The essential primitive is sensitivity, not additivity.
31.8 McDiarmid inequality
Under the bounded differences assumptions,
P(Z−EZ≥t)≤exp(−∑ici22t2),and similarly for the lower tail.
Applications include graph statistics, randomized algorithms, occupancy statistics, empirical risks, permutation statistics, and combinatorial optimization. One reveals inputs sequentially and bounds how much a single revelation can alter the final quantity.
The weakness is worst-case sensitivity. If rare inputs can cause large changes but typical changes are small, McDiarmid may be too crude. Variance-sensitive or self-bounding inequalities then provide sharper alternatives.
31.9 Talagrand inequalities
Talagrand-type concentration inequalities give stronger control for product spaces by combining sensitivity with geometric or convex structure. Several inequivalent inequalities carry Talagrand's name. A representative phenomenon is that a Lipschitz convex function of independent bounded variables has subgaussian concentration around its median or mean.
Another version uses convex distance on product spaces:
P(A)P(Atc)≤e−ct2.The distance is designed to measure how many coordinates must change, with optimized weights.
Talagrand inequalities often improve bounded-difference estimates because they incorporate structure beyond maximum coordinate influence. They are central in empirical processes, random graphs, random combinatorics, and high-dimensional geometry.
31.10 Gaussian concentration
If G∼N(0,In) and f:Rn→R is L-Lipschitz, then
P(∣f(G)−Ef(G)∣≥t)≤2e−t2/(2L2)up to precise constants depending on formulation.
The dimension n does not appear explicitly. High-dimensional Gaussian measure concentrates because Lipschitz observables cannot vary rapidly enough to overcome the geometry of Gaussian mass.
Gaussian concentration can be proved through isoperimetry, log-Sobolev inequalities, Herbst's argument, or transportation inequalities. It is a prototype for dimension-free concentration.
31.11 Isoperimetric concentration
Isoperimetric inequalities relate boundary size to volume. In probability, they imply concentration because a set with probability at least 1/2 rapidly captures almost all mass when enlarged by a small metric radius.
For Gaussian measure γn, half-spaces minimize boundary measure among sets of fixed Gaussian measure. This produces Gaussian concentration:
γn(Arc)≤e−r2/2up to constants when γn(A)≥1/2.
The principle is:
isoperimetry→set concentration→Lipschitz-function concentration.Geometry of the carrier controls fluctuation of observables.
31.12 Log-Sobolev inequalities
A log-Sobolev inequality controls entropy by an energy form. For Gaussian measure, a standard form is
Ent(f2)≤2∫∣∇f∣2dγ.Here
Ent(g)=E[glogg]−E[g]logE[g].The Herbst argument converts a log-Sobolev inequality into exponential moment bounds and then into concentration inequalities. Thus functional inequalities become tail certificates.
Log-Sobolev inequalities are stronger than ordinary Poincaré inequalities. They imply hypercontractivity and strong convergence-to-equilibrium results for Markov semigroups.
31.13 Transport inequalities
A transportation-cost inequality compares a distance between probability measures to relative entropy. A basic example is
W2(ν,μ)2≤2CH(ν∣μ).Such inequalities imply concentration of Lipschitz functions.
The interpretation is that moving probability mass far from its reference distribution requires entropy cost. This unifies optimal transport, information theory, and concentration.
Transport inequalities can tensorize across independent products, making them powerful in high-dimensional systems. The carrier geometry is encoded by Wasserstein distance; probabilistic deviation is encoded by entropy.
31.14 Concentration versus asymptotics
Concentration inequalities and asymptotic limit theorems answer different questions. A CLT identifies the shape of typical normalized fluctuations:
nSn−nμ⇒N(0,σ2).Concentration gives finite-n tail bounds:
P(∣Sn−ESn∣≥t)≤explicit bound.A concentration bound may be nonsharp in its constants or exact rate function. A CLT may give no finite-n error. A large deviation principle may give the correct exponential rate but omit useful finite-sample prefactors.
The correct hierarchy is:
shape=finite bound=exponential rate.Chapter 32 — Empirical Processes
32.1 Empirical distribution functions
Given iid real observations X1,…,Xn with CDF F, the empirical distribution function is
Fn(x)=n1i=1∑n1{Xi≤x}.For each fixed x, the strong law gives
Fn(x)→F(x)almost surely.
The empirical CDF is itself a random function. Fixed-x convergence does not imply uniform convergence over all x; that requires a uniform law of large numbers. Empirical process theory studies random fluctuations indexed by classes of functions or sets.
32.2 Glivenko–Cantelli theorem
The Glivenko–Cantelli theorem states
x∈Rsup∣Fn(x)−F(x)∣→0almost surely.
This is stronger than pointwise LLN because the supremum ranges over an uncountable index set. The special monotonicity structure of CDFs permits uniform control.
A general class F is called Glivenko–Cantelli for a law P if
f∈Fsup∣Pnf−Pf∣→0,where
Pnf=n1i=1∑nf(Xi),Pf=E[f(X)].Uniform laws of large numbers are therefore empirical-process statements.
32.3 Donsker theorem
Define the empirical process
αn(x)=n(Fn(x)−F(x)).Donsker's theorem states that, after appropriate indexing and under standard conditions,
αn⇒B∘F,where B is a Brownian bridge.
For F continuous, the covariance is
Cov(B(F(x)),B(F(y)))=F(x∧y)−F(x)F(y).This is a functional CLT. Instead of one statistic converging to one Gaussian variable, an entire indexed process converges to a Gaussian process. Finite-dimensional convergence is insufficient; tightness in function space is required.
32.4 VC dimension
The Vapnik–Chervonenkis dimension of a class of sets A measures its combinatorial capacity. A finite set {x1,…,xm} is shattered if
{A∩{x1,…,xm}:A∈A}contains all 2m subsets. The VC dimension is the largest shattered-set size.
Finite VC dimension controls uniform deviations:
A∈Asup∣Pn(A)−P(A)∣.The reason is combinatorial: a low-VC class cannot realize too many distinct label patterns on a finite sample.
VC theory links probability concentration to statistical learning. Complexity of the function class determines sample size needed for uniform generalization.
32.5 Covering numbers
For a metric or pseudometric space (F,d), the covering number
N(ε,F,d)is the minimum number of d-balls of radius ε required to cover F.
In empirical processes, common metrics include
dP(f,g)=(E[(f−g)2])1/2or empirical versions. Covering numbers quantify effective size at scale ε.
A finite class has cardinality as crude complexity. An infinite class requires multiscale complexity. Covering numbers are the bridge:
infinite function class→finite approximations at each scale.32.6 Entropy integrals
Metric entropy is
logN(ε,F,d).Dudley-type entropy bounds control expected suprema of Gaussian or subgaussian processes:
Ef∈FsupXf≲∫0diam(F)logN(ε,F,d)dε.Entropy integrals convert geometric complexity into stochastic supremum bounds. The square root of log covering number reflects Gaussian fluctuation scale.
A class may be infinite and highly structured while still having a finite entropy integral. Conversely, excessive metric entropy can obstruct uniform convergence or tightness.
32.7 Symmetrization
Let
Pnf=n1i=1∑nf(Xi).To bound
Ef∈Fsup∣Pnf−Pf∣,introduce an independent ghost sample Xi′ or Rademacher signs εi. Symmetrization gives bounds of the form
Ef∈Fsup∣Pnf−Pf∣≤2Ef∈Fsupn1i=1∑nεif(Xi).This replaces an unknown population mean by a centered random-sign process conditional on the data. The resulting process is easier to analyze with concentration, contraction inequalities, and chaining.
Symmetrization is a carrier conversion:
empirical-minus-population→random signed empirical process.32.8 Rademacher complexity
The empirical Rademacher complexity of a function class F is
Rn(F)=Eε[f∈Fsupn1i=1∑nεif(Xi)X1,…,Xn].Its expectation over the sample gives population Rademacher complexity.
This quantity measures how well the class can align with random noise. A highly expressive class can fit random signs and has large complexity. Uniform generalization bounds often take the form
f∈Fsup∣Pnf−Pf∣≲Rn(F)+concentration term.Rademacher complexity adapts to actual function values and can be sharper than purely combinatorial VC bounds.
32.9 Chaining
Chaining controls the supremum of a stochastic process by approximating the index set at progressively finer scales. Instead of using one crude ε-net, construct nested nets
T0⊆T1⊆⋯and decompose
Xt−Xt0=k≥1∑(Xπk(t)−Xπk−1(t)).Each scale contributes a small increment but has more candidate points. Balancing increment size against combinatorial multiplicity produces entropy integrals or generic chaining bounds.
The method is one of the deepest general tools for suprema of Gaussian and subgaussian processes. It shows that one-scale union bounds are often structurally wasteful.
32.10 Uniform laws of large numbers
A uniform law of large numbers states
f∈Fsup∣Pnf−Pf∣→0in probability or almost surely. Unlike ordinary LLN, the index f may be selected after observing data, so pointwise convergence is insufficient.
Sufficient conditions use VC dimension, covering numbers, bracketing entropy, Rademacher complexity, or compactness. Envelope functions control integrability:
∣f(x)∣≤F(x)for all f∈F.
Uniform LLNs are the probability foundation of empirical risk minimization. They justify replacing population optimization
finfPfwith sample optimization
finfPnfonly when the function class is not too complex.
32.11 Statistical learning links
Statistical learning theory studies the gap between empirical and population performance:
R(f)−Rn(f).Uniform concentration permits data-dependent choice f^:
R(f^)≤Rn(f^)+f∈Fsup∣R(f)−Rn(f)∣.Complexity controls generalization. Too small a class creates approximation error; too large a class creates estimation error. This produces the bias-complexity tradeoff.
Probability theory contributes concentration, symmetrization, empirical-process bounds, and complexity measures. The learning problem adds optimization and model-selection structure. Training error alone is not a generalization certificate.
Chapter 33 — Random Graphs and Probabilistic Combinatorics
33.1 Erdős–Rényi models
The two classical Erdős–Rényi models are G(n,p), where each of the
(2n)possible edges appears independently with probability p, and G(n,m), where exactly m edges are chosen uniformly from all graphs with m edges.
In G(n,p), the edge count is
E∼Binomial((2n),p).The expected degree is
(n−1)p.Random graph theory studies global properties generated by local random edge decisions: connectivity, components, cycles, clique number, chromatic number, expansion, spectra, and subgraph counts.
The model changes regime as p=p(n) varies. Sparse and dense random graphs are not perturbations of one another; they have different structural carriers.
33.2 Thresholds
A graph property P has a threshold scale pc(n) if
P(G(n,p)∈P)→0when p≪pc, and tends to one when p≫pc, for monotone properties.
Thresholds arise when expected obstruction counts change scale. For isolated vertices,
EI=n(1−p)n−1≈ne−pn.This becomes order one around
p∼nlogn,the connectivity threshold scale.
Expectation suggests a threshold but does not prove sharp transition. Second moments, Poisson approximation, coupling, or sharp-threshold theorems are often required.
33.3 First moment method
For a nonnegative integer-valued random variable X,
P(X>0)≤E[X].Therefore, if
E[X]→0,then X=0 with high probability.
Conversely, to prove deterministic existence, if a random object has expected number of bad features less than one,
E[B]<1,then some realization has B=0.
The first moment method is asymmetric: it easily proves nonexistence of random structures when expected count vanishes and deterministic existence through averaging, but it generally cannot prove X>0 with high probability from EX→∞. Variance or dependence may dominate.
33.4 Second moment method
For X≥0,
P(X>0)≥E[X2](EX)2by Paley–Zygmund at threshold 0. Thus if
(EX)2Var(X)→0,then X>0 with high probability.
The method analyzes overlap among candidate structures. If
X=α∑Iα,then
E[X2]=α,β∑P(Iα=Iβ=1).Pairs (α,β) are classified by intersection pattern. The overlap geometry is the dependence ledger.
33.5 Lovász local lemma
The Lovász local lemma proves that a collection of individually unlikely bad events can simultaneously be avoided when dependencies are sufficiently sparse. A symmetric form states: if each event Ai satisfies
P(Ai)≤pand depends on at most d other events, then
ep(d+1)≤1implies
P(i⋂Aic)>0.The union bound requires
i∑P(Ai)<1,which can fail badly when there are many events. The local lemma exploits sparse dependence rather than total event count.
It is an existence theorem. Algorithmic versions such as Moser–Tardos provide constructive randomized procedures under suitable variable-event structure.
33.6 Janson inequalities
Janson inequalities estimate lower tails for counts of dependent rare structures built from independent underlying variables. Suppose
X=α∑Iαcounts substructures, with mean μ, and define an overlap-dependence parameter
Δ=α∼β∑E[IαIβ].Then estimates of the form
P(X=0)≤exp(−μ+2Δ)hold in standard settings.
Janson's method is suited to subgraph counts, patterns, and set systems. The key object is not independence of indicators but dependence generated by shared primitive variables.
33.7 Random regular graphs
A random d-regular graph is sampled uniformly from graphs where every vertex has degree d. Edges are not independent, so G(n,p) methods do not transfer automatically.
The configuration model constructs a multigraph by assigning d half-edges to each vertex and pairing all half-edges uniformly. Conditional on simplicity, the resulting graph is uniform over simple d-regular graphs.
The configuration model converts a constrained graph problem into random matching. One must then audit loops, multiple edges, conditioning on simplicity, and whether the probability of simplicity remains non-negligible in the parameter regime.
33.8 Percolation preview
Percolation can be viewed as a random graph on a fixed infinite underlying graph. Each edge or vertex is independently retained with probability p. The principal question is emergence of an infinite connected component.
Random graph giant-component transitions and lattice percolation share branching-process heuristics but differ geometrically. Locally tree-like random graphs are often approximated by branching processes. Lattices contain strong short-cycle and spatial structure.
Thus branching approximation is a carrier conversion whose validity depends on local geometry.
33.9 Phase transitions
In G(n,p), the scale
p∼n1marks the giant-component phase transition. If p=c/n with c<1, components are typically small. If c>1, a component of order n emerges.
The critical window is much narrower and has distinct scaling behavior. Near
p=n1+λn−4/3,component sizes occur on the n2/3 scale.
A phase transition is therefore not one number but a layered structure:
subcritical regime→critical window→supercritical regime.33.10 Branching-process approximations
Exploring a sparse random graph component from a vertex initially resembles a branching process. Each discovered vertex produces approximately
Binomial(n,p)new neighbors, approaching Poisson(c) when p=c/n.
The mean offspring c predicts the phase transition at c=1. But the approximation is local and breaks when many explored vertices create collisions or depletion. Branching processes certify early exploration, not the entire global graph.
The liftback requires controlling the scale at which tree-likeness persists.
33.11 Contiguity
Two sequences of probability measures Pn,Qn are contiguous if
Pn(An)→0⟺Qn(An)→0in mutual-contiguity form. One-sided contiguity uses only one implication.
Contiguity permits transfer of high-probability properties between random models even when their laws are not close in total variation. It is widely used between configuration models and uniform regular graphs, planted and unplanted models, and alternative random graph constructions.
Contiguity transports negligible events, not exact probabilities or expectations. It is weaker than total variation equivalence but stronger than mere heuristic similarity.
33.12 Random construction as existence certificate
The probabilistic method proves deterministic objects exist by showing a random construction has positive probability of satisfying required properties:
P(good object)>0⇒∃ good object.The probability model is temporary. The terminal conclusion is deterministic existence. Methods include first moment, alteration, second moment, local lemma, random greedy algorithms, entropy compression, and concentration.
The central firewall is:
positive probability existence=efficient algorithm.An existence certificate may provide no practical construction unless a constructive liftback is supplied.
Chapter 34 — Random Matrices
34.1 Wigner matrices
A Wigner matrix is a symmetric or Hermitian random matrix whose upper-triangular entries are independent up to symmetry, typically centered and variance-scaled:
Hij∼n−1/2ξij.The n−1/2 normalization keeps eigenvalues on an O(1) scale.
Random matrix theory asks about eigenvalue distributions, extreme eigenvalues, eigenvector delocalization, spacing statistics, and universality.
The entries may be independent, but eigenvalues are strongly dependent nonlinear functions of all entries. The spectral carrier is therefore not a collection of independent scalar variables.
34.2 Sample covariance matrices
Given a data matrix
X∈Rn×p,the sample covariance matrix is
S=n1X⊤X.When entries are random, one studies the spectrum as n,p→∞, often with
np→γ∈(0,∞).In fixed dimension, sample covariance converges to population covariance. In high dimension with p/n non-negligible, spectral distortion remains order one. Classical low-dimensional intuition fails.
Sample covariance matrices connect random matrix theory to multivariate statistics, principal component analysis, signal detection, and high-dimensional inference.
34.3 Spectral norm bounds
The spectral norm of a symmetric matrix is
∥H∥op=imax∣λi(H)∣.For random matrices, one seeks bounds such as
∥H∥op≲nfor unnormalized iid-entry matrices, or O(1) under Wigner scaling.
Methods include matrix concentration inequalities, ε-nets, trace moments, noncommutative Bernstein inequalities, and resolvent estimates.
Entrywise control is not enough. A matrix may have small entries but large operator norm through coherent alignment. Spectral norm is a global geometric observable.
34.4 Semicircle law
For a normalized Wigner matrix Hn, the empirical spectral distribution
μHn=n1i=1∑nδλiconverges to the semicircle distribution with density
ρsc(x)=2π14−x21[−2,2](x).This is a global spectral law. It describes the bulk distribution of eigenvalues after normalization. It does not by itself determine largest-eigenvalue fluctuations or microscopic spacing.
Proofs use moments or resolvents. The moment method identifies limiting moments with Catalan-number counts of noncrossing pairings.
34.5 Marchenko–Pastur law
For sample covariance matrices with p/n→γ, the empirical spectral law converges under standard assumptions to the Marchenko–Pastur distribution. Its absolutely continuous part has density
ργ(x)=2πγx(b−x)(x−a)1[a,b](x),where
a=(1−γ)2,b=(1+γ)2,with an additional atom at zero in the appropriate rectangular regime.
This law describes systematic high-dimensional eigenvalue spread even when the population covariance is identity. Random matrix effects are therefore not negligible sampling noise; they alter the whole spectral carrier.
34.6 Concentration of measure
Many random matrix observables concentrate because they are Lipschitz functions of high-dimensional random input. Examples include spectral norm, empirical spectral statistics, and singular values.
For Hermitian matrices,
∣λk(A)−λk(B)∣≤∥A−B∥opand Hoffman–Wielandt-type inequalities control aggregate eigenvalue displacement by Frobenius norm.
Concentration converts entry-level independence or log-concavity into spectral stability. But local eigenvalue statistics require much finer tools than global Lipschitz concentration.
34.7 Resolvent method
For a matrix H, define the resolvent
G(z)=(H−zI)−1,Imz>0.The normalized trace
mn(z)=n1TrG(z)is the Stieltjes transform of the empirical spectral measure:
mn(z)=∫x−z1μHn(dx).The resolvent converts spectral questions into analytic matrix identities. Self-consistent equations for mn(z) identify limiting laws. Fine control as Imz shrinks leads to local semicircle laws.
The resolvent is a carrier transformation:
eigenvalue point process→analytic function in upper half-plane.34.8 Trace method
The trace method studies moments
n1ETr(Hk)=n1i1,…,ik∑E[Hi1i2⋯Hiki1].Independence and centering eliminate most index patterns. Surviving walks correspond to combinatorial structures such as pairings and trees.
For fixed k, moments identify global spectral laws. For growing k, the method can bound extreme eigenvalues. However, high moments create complicated combinatorics and require careful control.
34.9 Universality preview
Universality means that many spectral statistics depend only weakly on the detailed entry distribution. After normalization, local eigenvalue spacing or edge fluctuations can converge to the same limit for broad classes of matrix ensembles.
Global universality is comparatively coarse: semicircle or Marchenko–Pastur laws require limited moment assumptions. Local universality is much sharper and needs stronger control, though modern methods often permit broad distributions.
The correct statement must specify scale:
global density=mesoscopic statistics=local spacing=edge fluctuations.34.10 Free probability preview
Large independent random matrices often become asymptotically free under suitable invariance or independence conditions. This means normalized traces of centered alternating products vanish asymptotically:
n1Tr(P1(An)Q1(Bn)⋯Pk(An)Qk(Bn))→0.Free probability then predicts spectral laws of sums and products through free convolution. It is an algebraic transport from matrix asymptotics to noncommutative probability.
The word “free” does not mean classically independent entries. It describes a limiting noncommutative factorization law.
34.11 Random matrices in statistics and physics
In statistics, random matrices govern high-dimensional covariance estimation, PCA, canonical correlation, spiked models, and noise spectra. The classical assumption p fixed while n→∞ can fail when p/n remains positive.
In physics, random matrices model complex spectra, quantum chaotic systems, disordered media, transport, and interacting systems. The reason is not that every Hamiltonian is literally random, but that spectral statistics may be universal under limited structural information.
Applications require a model-liftback audit. A random-matrix ensemble can capture spectral statistics without reproducing all microscopic mechanisms.
Part X — Ergodic, Information, and Statistical Probability
Chapter 35 — Ergodic Theory and Probability
35.1 Measure-preserving systems
A measure-preserving dynamical system is
(Ω,F,P,T),where T:Ω→Ω is measurable and
P(T−1A)=P(A)for every A∈F.
The orbit of ω is
ω,Tω,T2ω,…and an observable f generates a stationary sequence
Xn(ω)=f(Tnω).Measure preservation says the ensemble law remains invariant under evolution. It does not imply individual trajectories explore the whole state space, nor that time averages equal ensemble averages. Those require stronger gates.
35.2 Stationarity
A process (Xn) is strictly stationary if
(Xn1,…,Xnk)=d(Xn1+h,…,Xnk+h)for all finite index selections and allowed shifts h.
A measure-preserving system produces stationary processes by observation:
Xn=f∘Tn.Stationarity is distributional time-translation symmetry. It permits long-run analysis but does not imply independence, mixing, or ergodicity. A mixture of two stationary ergodic processes can be stationary but nonergodic.
35.3 Ergodicity
A measure-preserving transformation T is ergodic if every invariant event
T−1A=Amodulo null sets has probability 0 or 1.
Equivalent formulations include absence of nonconstant invariant L2 functions:
f∘T=f⇒f constant a.s.Ergodicity means the system cannot be decomposed into nontrivial invariant probabilistic components. It is not randomness and does not require mixing. Deterministic systems can be ergodic.
35.4 Birkhoff ergodic theorem
For a measure-preserving transformation T and f∈L1,
n1k=0∑n−1f(Tkω)→E[f∣I](ω)almost surely, where I is the invariant σ-algebra.
If T is ergodic, then
E[f∣I]=Ef,so
n1k=0∑n−1f(Tkω)→Efalmost surely.
The theorem does not say every trajectory samples every state uniformly. It says time averages of integrable observables converge to invariant-component averages.
35.5 Von Neumann mean ergodic theorem
Let U be the Koopman operator
Uf=f∘Ton L2. Since T is measure-preserving, U is an isometry. The mean ergodic theorem states
n1k=0∑n−1Ukf→PInvfin L2, where PInv is orthogonal projection onto the invariant subspace.
This is a Hilbert-space theorem. It gives mean convergence, whereas Birkhoff gives pointwise almost sure convergence under L1 assumptions.
The theorem exposes the operator-theoretic carrier:
dynamics→unitary/isometric operator→projection onto invariant modes.35.6 Mixing conditions
Mixing is stronger than ergodicity. Strong mixing can be expressed as
P(A∩T−nB)→P(A)P(B).This says distant-time events become asymptotically independent.
Other notions include weak mixing, α-mixing, ϕ-mixing, ρ-mixing, and absolute regularity. They differ in which event/function classes are controlled and at what rate.
Mixing rates influence CLTs, concentration, empirical-process convergence, and statistical estimation. Saying “the process mixes” is incomplete without declaring the mixing notion and rate.
35.7 Ergodic decomposition
A stationary probability measure can often be decomposed into ergodic components:
P=∫Pθν(dθ),where each Pθ is ergodic.
The invariant σ-algebra identifies the latent component. Time averages converge to component-specific expectations:
n1k=0∑n−1f(Tkω)→∫fdPΘ(ω).Thus nonergodicity does not necessarily mean time averages fail to converge. They may converge to random limits determined by the invariant component. The failure is not convergence but collapse to one universal ensemble mean.
35.8 Time average versus ensemble average
The time average of an observable along one trajectory is
fn(ω)=n1k=0∑n−1f(Tkω).The ensemble average is
Ef=∫fdP.Birkhoff gives
fn→E[f∣I].Only under ergodicity does this become Ef.
Thus the equation
time average=ensemble averageis not a definition of probability and not automatic stationarity. It is a theorem requiring invariant-component triviality for the chosen carrier.
35.9 Ergodicity breaking
Ergodicity breaking means the invariant σ-algebra is nontrivial or the system decomposes into dynamically distinct components. Different trajectories can have different long-run averages:
n→∞limfn(ω)=g(ω),with nonconstant invariant g.
In some applied literature, “ergodicity breaking” is used more broadly for aging, nonstationarity, metastability, or finite-time trapping. These are distinct mechanisms. A rigorous analysis must specify whether failure is due to nonergodic invariant decomposition, insufficient observation time, nonstationarity, or nonexistence of relevant averages.
35.10 Probabilistic interpretation errors
A common error is interpreting stationarity as ergodicity. Another is interpreting ergodicity as independence. A third is treating ensemble averages as automatically observable from one trajectory.
The correct hierarchy is
stationary⇒ergodic,ergodic⇒mixing,and
mixing⇒independence at finite lag.The carrier determines the theorem. Time sampling, ensemble sampling, and repeated independent experiments are different experimental structures.
Chapter 36 — Information-Theoretic Probability
36.1 Entropy
For a discrete random variable X with mass function p(x), Shannon entropy is
H(X)=−x∑p(x)logp(x).It measures uncertainty or coding cost relative to the logarithm base.
For a pair (X,Y),
H(X,Y)=−x,y∑p(x,y)logp(x,y).Conditional entropy is
H(X∣Y)=H(X,Y)−H(Y).Entropy is law-level. It does not measure one realized sample's surprise directly; the pointwise quantity is
−logp(X),whose expectation is entropy.
Differential entropy for continuous variables,
h(X)=−∫f(x)logf(x)dx,behaves differently and is not invariant under coordinate change.
36.2 Relative entropy
The relative entropy of P with respect to Q is
D(P∥Q)=∫log(dQdP)dPwhen P≪Q, and +∞ otherwise.
For discrete laws,
D(P∥Q)=x∑p(x)logq(x)p(x).Relative entropy is nonnegative:
D(P∥Q)≥0,with equality iff P=Q under standard identification. It is not symmetric and is not a metric.
Relative entropy measures change-of-measure cost, coding inefficiency, statistical distinguishability, and large-deviation cost. It is the rate function in Sanov's theorem.
36.3 Mutual information
Mutual information is
I(X;Y)=D(PX,Y∥PX⊗PY).Thus
I(X;Y)≥0,with equality iff X and Y are independent.
For discrete variables,
I(X;Y)=H(X)+H(Y)−H(X,Y)=H(X)−H(X∣Y).Mutual information measures dependence at the law level. Unlike covariance, it detects nonlinear dependence in general. It remains invariant under invertible transformations of variables under appropriate measurable formulations.
36.4 Pinsker inequality
Pinsker's inequality relates total variation distance to relative entropy:
∥P−Q∥TV≤21D(P∥Q)under natural logarithm convention.
Thus small relative entropy guarantees small discrepancy of event probabilities:
∣P(A)−Q(A)∣≤21D(P∥Q).The converse is false without further restrictions. Two laws can be close in total variation while relative entropy is large due to behavior on small-probability regions.
Pinsker is a transport from information divergence to probabilistic approximation.
36.5 Data processing inequality
If
X→Y→Zforms a Markov chain, then
I(X;Z)≤I(X;Y).Processing data cannot increase information about the original source.
More generally, for a measurable map or Markov kernel K,
D(PK∥QK)≤D(P∥Q).This is contraction of information divergence under stochastic transport.
The theorem formalizes loss under coarse-graining. A statistic cannot contain more information about a parameter than the full data from which it was computed.
36.6 Typical sets
For iid X1,…,Xn with entropy H(X), the asymptotic equipartition property says
−n1logP(X1,…,Xn)→H(X)almost surely under standard discrete assumptions.
Thus typical sequences have probabilities approximately
e−nH(X)and the number of typical sequences is approximately
enH(X).Typical sets reconcile probability concentration with combinatorial counting. Almost all mass lies in an exponentially large set whose members have approximately equal exponential probability.
36.7 Shannon source coding theorem
For a discrete memoryless source with entropy H(X), lossless coding below rate H(X) is asymptotically impossible, while rates above H(X) are achievable with vanishing error.
The theorem is built from typical sets. Since there are approximately
2nH(X)typical length-n sequences, one needs roughly nH(X) bits to index them.
Entropy is therefore operational: it is the asymptotic compression threshold. The result is not that every sequence can be compressed to entropy length; atypical sequences exist. The theorem is probabilistic and asymptotic.
36.8 Channel coding preview
A communication channel is a Markov kernel
PY∣X.For an input law PX, the mutual information is
I(X;Y).The channel capacity is
C=PXsupI(X;Y).The channel coding theorem states that communication rates below C are achievable with vanishing error under the channel model, while rates above C are impossible.
The probability carrier includes codebook, message distribution, channel law, decoder, and block length. Capacity is an asymptotic information-transport limit, not a claim about zero-error communication at finite block length.
36.9 Entropy method in probability
The entropy method derives concentration inequalities by bounding entropy of exponential transforms. For a random variable Z,
Ent(eλZ)=E[eλZλZ]−E[eλZ]logE[eλZ].Tensorization allows entropy of functions of independent variables to be decomposed or bounded by coordinatewise contributions. Differential inequalities for
logE[eλ(Z−EZ)]then yield concentration through Chernoff optimization.
This method connects log-Sobolev inequalities, bounded differences, self-bounding functions, and Gaussian concentration.
36.10 KL divergence as transport cost
Relative entropy can be interpreted as the cost of changing one law into another. In exponential tilting,
dPdQθ=eθX−Λ(θ),and
D(Qθ∥P)=θEQθ[X]−Λ(θ).The large-deviation rate function can be written variationally as the minimum entropy cost among laws forcing a constraint:
I(a)=ν:∫xdν=ainfD(ν∥μ).This makes rare events optimization problems over alternative probability laws. The least expensive law under which the rare event becomes typical controls the exponential probability.
36.11 Large deviations and entropy
Sanov's theorem states
P(Ln≈ν)≍e−nD(ν∥μ),where Ln is the empirical measure and μ the true iid law.
Cramér's theorem follows by contraction:
I(a)=ν:∫xdν=ainfD(ν∥μ).Thus entropy is not only a coding quantity. It is a geometric action governing empirical-law deviations. Information theory and large deviation theory share the same convex-duality carrier.
Chapter 37 — Statistical Inference as Probability Liftback
37.1 Statistical models
A statistical model is a family of probability laws
P={Pθ:θ∈Θ}.The observed data X are modeled as sampled from one unknown law Pθ.
The parameter θ may be finite-dimensional, infinite-dimensional, structural, or merely an index. A model specifies which distributions are considered possible; inference attempts to recover features of the generating law from data.
Model specification precedes estimation. If the true data law lies outside P, the problem becomes misspecified inference, not ordinary parameter estimation.
37.2 Sampling distributions
A statistic is a measurable function
T=T(X1,…,Xn).Its sampling distribution under parameter θ is
Lθ(T).Inference procedures are calibrated through these distributions. Standard errors, confidence intervals, test critical values, and asymptotic approximations all depend on the sampling law.
The observed statistic is one realized value. Its sampling distribution describes hypothetical repetition under the model. Confusing realized data with the random statistic is the same variable/value category error encountered in elementary probability.
37.3 Likelihood
Given an observed value x, the likelihood is
L(θ;x)∝pθ(x),viewed as a function of θ. For dominated models,
pθ=dμdPθrelative to a common dominating measure μ.
Likelihood is not a probability distribution over θ unless a prior and normalization are added. Multiplying L(θ;x) by a factor depending only on x does not change likelihood comparisons.
Maximum likelihood estimation chooses
θ^∈argθ∈ΘmaxL(θ;x).Existence and uniqueness of the maximizer are separate gates.
37.4 Sufficiency
A statistic T(X) is sufficient for θ if the conditional distribution of the full data given T does not depend on θ. Intuitively, T retains all model-relevant information about θ.
The factorization theorem gives a practical criterion in dominated models:
pθ(x)=gθ(T(x))h(x).Sufficiency is a carrier compression:
X→T(X)that preserves inferential information for the parameter family. It does not mean T preserves every property of the sample; only parameter-relevant likelihood information.
37.5 Exponential families
A regular exponential family has density
pθ(x)=h(x)exp(η(θ)⊤T(x)−A(θ)).Here T(x) is the sufficient statistic, η(θ) the natural parameter, and A(θ) the log-partition function.
Derivatives of A encode moments:
∇A(η)=Eη[T(X)],∇2A(η)=Covη(T(X)).Exponential families connect convex analysis, entropy, maximum likelihood, Bayesian conjugacy, and information geometry.
37.6 Estimators
An estimator is a measurable function of data:
θ^n=Tn(X1,…,Xn).Its quality may be evaluated by bias, variance, mean squared error, risk, consistency, asymptotic distribution, robustness, or minimax criteria.
An estimator is itself a random variable before data realization. Different estimators trade bias against variance and local efficiency against robustness.
The inference problem is not simply to produce a number. It is to certify behavior of the random mapping from samples to estimates under a declared model class.
37.7 Bias and variance
Bias is
Biasθ(θ^)=Eθ[θ^]−θ.Variance is
Varθ(θ^).Mean squared error decomposes as
Eθ[(θ^−θ)2]=Varθ(θ^)+Biasθ(θ^)2.Unbiasedness is not universally optimal. A slightly biased estimator can have much smaller MSE. The relevant criterion depends on loss function and parameter domain.
In high-dimensional inference, regularization deliberately introduces bias to control variance and instability.
37.8 Consistency
An estimator is consistent if
θ^n→θin probability under Pθ. Strong consistency uses almost sure convergence.
Consistency is a large-sample property. It does not quantify finite-sample error or convergence rate. An estimator can be consistent but practically poor at relevant sample sizes.
Consistency proofs often use uniform LLNs and identifiability:
Mn(θ)→M(θ)uniformly, with M uniquely optimized at the true parameter. Then argmax or argmin continuity transports criterion convergence to parameter convergence.
37.9 Asymptotic normality
Many regular estimators satisfy
n(θ^n−θ0)⇒N(0,V).For M-estimators, a typical expansion is
0=Ψn(θ^n)≈Ψn(θ0)+A(θ^n−θ0),leading to
n(θ^n−θ0)≈−A−1nΨn(θ0).A CLT for the score or estimating equation then yields asymptotic normality.
This route requires differentiability, identifiability, nonsingular derivative matrix, and stochastic remainder control. Writing a Taylor expansion is not enough; the remainder must be negligible at n−1/2 scale.
37.10 Fisher information
For a smooth parametric model with score
Sθ(X)=∂θ∂logpθ(X),Fisher information is
I(θ)=Eθ[Sθ(X)2]in one dimension, or
I(θ)=Eθ[SθSθ⊤]in multiple dimensions.
Under regularity,
I(θ)=−Eθ[∂θ2∂2logpθ(X)].Fisher information measures local curvature of likelihood and local distinguishability of nearby parameter laws.
37.11 Cramér–Rao bound
For an unbiased estimator T of a scalar parameter θ, under regularity conditions,
Varθ(T)≥In(θ)1,where for iid samples
In(θ)=nI1(θ).The bound follows from covariance between estimator and score plus Cauchy–Schwarz. It is a local lower bound under a specific model and regularity class.
It does not imply every estimator must obey the same finite-sample bound under bias, irregular models, boundary parameters, or different loss structures. The hypothesis ledger is essential.
37.12 Hypothesis testing
A test is a measurable decision rule
φ(X)∈[0,1]or {0,1}. For null H0 and alternative H1, type I error is
PH0(reject),and type II error is
PH1(fail to reject).The Neyman–Pearson lemma identifies the most powerful test for simple-versus-simple hypotheses using likelihood ratios:
p0(X)p1(X).Composite hypotheses require additional structure: generalized likelihood ratio, score tests, Wald tests, minimax testing, or Bayesian decision rules.
A p-value is not the probability that the null hypothesis is true. It is a tail probability computed under the null model.
37.13 Confidence intervals
A confidence procedure produces a random set C(X) satisfying
Pθ(θ∈C(X))≥1−αfor all θ in the intended parameter class.
Coverage is a property of the random procedure under repeated sampling, not a posterior probability for a fixed realized interval unless a Bayesian model is added.
Asymptotic confidence intervals use approximations such as
θ^n±z1−α/2nse.Validity requires asymptotic normality, consistent variance estimation, and sufficient sample size. Finite-sample coverage can differ substantially.
37.14 Bayesian posterior calculus
A prior π(dθ) and likelihood Pθ(dx) define a joint law
π(dθ)Pθ(dx).The posterior is the conditional distribution
π(dθ∣x).In density form,
π(θ∣x)∝pθ(x)π(θ).Bayesian inference is ordinary conditional probability on an expanded parameter-data carrier. The prior is part of the model. Posterior validity relative to the model does not by itself validate the prior or likelihood empirically.
Posterior consistency and asymptotic normality require separate theorems; Bayesian updating alone does not guarantee learning under misspecification or nonidentifiability.
37.15 Model misspecification
A model is misspecified when the true data law P0 is not in
{Pθ:θ∈Θ}.Estimators may then converge to pseudo-true parameters minimizing a discrepancy such as
θ∗=argθminD(P0∥Pθ).Standard errors derived under correct specification may become wrong. Sandwich covariance formulas often appear:
A−1BA−1,where curvature A and score variance B no longer coincide.
Misspecification is not automatically fatal, but the interpretation changes. One estimates the best approximation within a model class, not necessarily a true generative parameter.
37.16 Empirical versus formal probability
Formal probability says:
given Pθ, these consequences follow.Empirical inference asks:
does the observed system warrant Pθ?The gap is filled by design, calibration, diagnostics, robustness, replication, and mechanism. Mathematical correctness of an estimator under iid Gaussian data does not prove the actual data are iid Gaussian.
The full inference chain is
world→measurement→data-generating assumptions→probability model→statistical procedure→uncertainty certificate→domain liftback.Part XI — Advanced Carriers and Extensions
Chapter 38 — Coupling Theory
38.1 Couplings of probability measures
Given probability measures μ on S and ν on T, a coupling is a probability measure
γ∈P(S×T)with marginals μ,ν:
γ(A×T)=μ(A),γ(S×B)=ν(B).Equivalently, a coupling is a pair (X,Y) on one probability space satisfying
X∼μ,Y∼ν.The set of all couplings is
Π(μ,ν).Marginals do not determine joint behavior. Coupling theory studies how to choose a joint carrier to optimize equality, order, distance, coalescence, or another relation.
38.2 Maximal coupling
A maximal coupling maximizes
P(X=Y)among all couplings of μ,ν. For countable or suitably dominated laws,
γ∈Π(μ,ν)supP(X=Y)=1−∥μ−ν∥TV.Equivalently,
∥μ−ν∥TV=γ∈Π(μ,ν)infP(X=Y).This turns total variation distance into an optimal mismatch probability. The abstract distance between laws becomes a concrete joint-event probability.
38.3 Monotone coupling
For probability measures on an ordered space such as R, stochastic domination
μ⪯stνmeans
μ((t,∞))≤ν((t,∞))for all t, equivalently their CDFs satisfy the reverse order.
A monotone coupling constructs
X≤Ya.s.with X∼μ, Y∼ν. On R, quantile coupling does this:
X=Fμ−1(U),Y=Fν−1(U),with common U∼Uniform(0,1).
Monotone coupling converts distributional order into pathwise order.
38.4 Strassen theorem
Strassen-type theorems characterize when probability measures admit couplings supported on a specified relation. A typical order version states that
μ⪯stνunder suitable conditions iff there exists a coupling (X,Y) such that
X≤Ya.s.More generally, if R⊆S×T is closed, one seeks
γ(R)=1.Marginal inequalities over appropriate sets characterize existence of such a coupling.
Strassen's theorem is a law-to-relation liftback theorem.
38.5 Coupling inequalities
Any coupling gives probability-distance bounds. For a coupling (X,Y),
∥L(X)−L(Y)∥TV≤P(X=Y).For Wasserstein distance,
Wp(μ,ν)p≤E[d(X,Y)p]for any coupling, with equality after optimizing.
Coupling inequalities convert joint-path control into law-distance control. The quality of the estimate depends entirely on coupling design.
38.6 Coupling from the past
Coupling from the past is an exact sampling method for certain Markov chains. One uses shared randomness to run all possible initial states from a sufficiently remote past toward time zero. If all trajectories coalesce by time zero, their common value has exactly the stationary distribution.
The method avoids burn-in approximation. It converts coalescence into exact stationarity.
Feasibility depends on monotonicity, finite state space, or other structure allowing simultaneous control of all initial states. Coalescence must be certified, not assumed.
38.7 Wasserstein coupling
The p-Wasserstein distance is
Wp(μ,ν)=(γ∈Π(μ,ν)inf∫d(x,y)pdγ(x,y))1/p.An optimal coupling minimizes expected transport cost.
Unlike total variation, Wasserstein distance sees geometry. Two point masses at nearby locations have total variation distance one but small Wasserstein distance:
∥δx−δy∥TV=1for x=y, while
Wp(δx,δy)=d(x,y).This makes Wasserstein metrics appropriate for approximation where small spatial displacement should count as small error.
38.8 Coupling for Markov chains
To compare Markov chains started from x and y, construct
(Xn,Yn)with correct marginal transition laws. If the coupling is coalescent,
Xn=Yn⇒Xn+k=Yn+k,then the coupling time
T=inf{n:Xn=Yn}controls convergence:
∥Pn(x,⋅)−Pn(y,⋅)∥TV≤P(T>n).Couplings may be synchronous, reflection, maximal, monotone, or problem-specific. Good coupling design exploits geometry of the transition mechanism.
38.9 Coupling as liftback from law-level claims
A law-level statement may be too weak for a pathwise claim. Coupling constructs a common carrier where stronger comparisons become meaningful.
Examples:
μn⇒μ→Skorokhod coupling with a.s. convergence,μ⪯stν→X≤Y a.s. under monotone coupling,Wp(μ,ν) small→E[d(X,Y)p] small under near-optimal coupling.The firewall is that the coupling relation is newly constructed. It is not automatically true for the original variables.
Chapter 39 — Optimal Transport
39.1 Transport plans
Let μ and ν be probability measures on spaces X,Y. A transport plan is a coupling
γ∈Π(μ,ν).Given cost c(x,y), the Kantorovich transport problem is
γ∈Π(μ,ν)inf∫c(x,y)γ(dx,dy).The plan describes how mass at x is probabilistically distributed to destinations y. Unlike a deterministic map, one source point can split mass across many destinations.
Optimal transport is therefore coupling theory plus optimization over geometric cost.
39.2 Monge problem
The original Monge problem seeks a deterministic map
T:X→Ysuch that
T∗μ=νand minimizes
∫c(x,T(x))μ(dx).The problem is nonlinear and may have no solution because a map cannot split mass. For example, a point mass cannot be mapped deterministically into a diffuse distribution.
This is the primitive failure that motivates Kantorovich relaxation:
transport map→transport plan.39.3 Kantorovich relaxation
The Kantorovich problem replaces deterministic maps by arbitrary couplings:
γ∈Π(μ,ν)inf∫cdγ.The feasible set Π(μ,ν) is convex, and the objective is linear in γ.
This convexification creates existence and duality machinery. Under lower semicontinuity and tightness conditions, minimizers often exist.
However, a Kantorovich optimizer need not be induced by a map. Recovering Monge structure requires additional assumptions. Convex relaxation terminalizes the relaxed problem, not automatically the original deterministic one.
39.4 Duality
Kantorovich duality gives
γ∈Π(μ,ν)inf∫c(x,y)dγ=φ,ψsup{∫φdμ+∫ψdν:φ(x)+ψ(y)≤c(x,y)}.The dual potentials certify optimality. If equality holds
φ(x)+ψ(y)=c(x,y)on the support of an optimal plan, primal and dual structures match.
For cost c(x,y)=d(x,y), the dual simplifies to the Kantorovich–Rubinstein formula:
W1(μ,ν)=∥f∥Lip≤1sup∫fdμ−∫fdν.39.5 Wasserstein distances
For p≥1,
Wp(μ,ν)=(γ∈Π(μ,ν)inf∫d(x,y)pdγ)1/p.The space
Pp(X)consists of probability measures with finite p-th moment.
Wasserstein convergence is stronger than weak convergence:
Wp(μn,μ)→0implies weak convergence plus p-moment convergence under standard settings.
Wasserstein geometry turns probability distributions into points of a metric space whose metric reflects spatial displacement.
39.6 Brenier theorem
For quadratic cost
c(x,y)=∣x−y∣2on Rd, if the source measure μ is absolutely continuous, Brenier's theorem gives an optimal transport map
T=∇φfor a convex function φ, under suitable finite-moment conditions.
Thus the optimal Kantorovich plan is concentrated on the graph of a deterministic map:
γ=(id,T)∗μ.Convexity enters as the certificate for optimal map structure. The theorem is a relaxation liftback:
optimal plan→gradient of convex potential.39.7 Displacement convexity
Given an optimal transport map T, define interpolation
Tt=(1−t)id+tT,and
μt=(Tt)∗μ0.This is a Wasserstein geodesic.
A functional F is displacement convex if
F(μt)≤(1−t)F(μ0)+tF(μ1)along Wasserstein geodesics.
This differs from ordinary convexity under mixture:
(1−t)μ0+tμ1.The carrier geometry has changed; the relevant straight lines are transport geodesics.
39.8 Transport inequalities
A transport inequality relates Wasserstein distance to entropy:
W2(ν,μ)2≤2CD(ν∥μ).This says moving law μ to ν by a large geometric amount requires large information cost.
Such inequalities imply concentration. Tilting μ toward an event or large-Lipschitz deviation creates a law ν, and entropy-transport control bounds how far the tilt can move typical mass.
This connects:
entropy↔geometry↔concentration.39.9 Gradient flows in probability space
Many dissipative PDEs can be interpreted as gradient flows in Wasserstein space. The Fokker–Planck equation
∂tρ=∇⋅(ρ∇V)+Δρcan be viewed as gradient descent of the free-energy functional
F(ρ)=∫Vdρ+∫ρlogρdxunder the W2 metric.
This transforms PDE evolution into geometry on probability measures:
state=ρt,energy=F,metric=W2.The interpretation is not metaphorical when the variational structure is rigorously established through minimizing-movement schemes and metric-gradient-flow theory.
39.10 Probability geometry
Optimal transport reveals probability laws as geometric objects. Means, interpolations, barycenters, curvature, geodesics, gradient flows, and convexity can all be studied in measure space.
A Wasserstein barycenter minimizes
ν↦i∑wiW2(ν,μi)2.This is an analogue of Euclidean least squares for distributions.
The carrier warning is that different probability metrics induce different geometries. Total variation, KL divergence, Hellinger distance, and Wasserstein distance are not interchangeable. Each detects different deformations.
Chapter 40 — Point Processes
40.1 Counting measures
A counting measure has the form
N=i∑δXi,possibly with multiplicities. For a measurable set A,
N(A)counts points lying in A.
A point process is a random counting measure. Thus the state space is a space of locally finite measures, not merely a random vector of points. This formulation handles random numbers of points and unbounded domains uniformly.
Measurability and topology on configuration space matter for convergence. Vague topology is commonly used because test functions have compact support:
N↦∫fdN.40.2 Poisson point processes
A Poisson point process with intensity measure μ satisfies:
- for measurable A with μ(A)<∞,
- counts on disjoint sets are independent.
Its Laplace functional is
E[e−∫fdN]=exp(−∫(1−e−f(x))μ(dx))for nonnegative measurable f.
The Laplace functional characterizes the law. It plays the role of a transform for random measures.
40.3 Intensity measures
The intensity measure of a point process is
Λ(A)=E[N(A)].For a Poisson point process, intensity measure completely determines the law. For a general point process, it does not; higher-order dependence requires factorial moment measures or correlation functions.
If
Λ(dx)=λ(x)dx,then λ(x) is an intensity density. It describes expected local point density, not necessarily actual conditional occurrence rate.
Two point processes can have identical intensity and radically different clustering or repulsion.
40.4 Laplace functionals
For a random measure N, define
LN(f)=Eexp(−∫fdN),f≥0.Under suitable conditions, the Laplace functional determines the law of N.
For Poisson process intensity μ,
LN(f)=exp(−∫(1−e−f)dμ).Laplace functionals transform superposition of independent point processes into multiplication. They are the random-measure analogue of characteristic or probability generating functions.
40.5 Palm distributions
Palm distributions describe a point process viewed from or conditioned relative to a typical point. Informally, they answer:
what does the configuration look like given a point at x?For stationary point processes, Palm theory distinguishes a typical location from a typical point. These are not the same sampling procedures. A typical point is biased toward denser regions or larger structures depending on context.
Campbell formulas relate sums over process points to integrals against intensity and Palm distributions:
Ex∈N∑f(x,N)=∫Ex[f(x,N)]Λ(dx).Palm calculus is central in queueing, networks, stochastic geometry, and spatial statistics.
40.6 Cox processes
A Cox process, or doubly stochastic Poisson process, is a Poisson process conditional on a random intensity measure Λ. Given Λ,
N∣Λ∼PPP(Λ).Marginally, the process is not Poisson because randomness in intensity creates clustering and dependence between counts in disjoint regions:
Cov(N(A),N(B))can be positive through shared intensity fluctuations.
Cox processes model environmental heterogeneity, spatial clustering, insurance events, communication networks, and random media.
40.7 Renewal processes
A renewal process is generated by iid positive interarrival times
T1,T2,….Arrival epochs are
Sn=T1+⋯+Tn,and the counting process is
N(t)=max{n:Sn≤t}.The elementary renewal theorem gives
tE[N(t)]→E[T1]1under finite mean. Strong laws give
tN(t)→E[T1]1almost surely.
The Poisson process is the special renewal process with exponential interarrival times. Memorylessness makes it Markovian; general renewal processes retain age dependence.
40.8 Spatial point processes
Spatial point processes model random configurations in Rd or another geometric space. Major classes include Poisson, Cox, determinantal, Gibbs, cluster, and hard-core processes.
Clustering and repulsion are diagnosed through pair-correlation functions, Ripley's K-function, factorial moments, or conditional intensities.
The carrier includes geometry. Rotation invariance, translation stationarity, boundary effects, observation windows, and metric choice influence inference. Spatial point processes are not ordinary iid samples because point count and locations may be jointly random and dependent.
40.9 Random measures
A random measure is a measurable map
ω↦Mωinto a space of measures. Point processes are integer-valued random measures, but random measures also include diffuse objects such as Gaussian random measures, completely random measures, occupation measures, and empirical measures.
Integration against a random measure gives random variables:
∫fdM.The law of M can be studied through Laplace or characteristic functionals.
Random-measure language unifies point processes, Lévy noise, Bayesian nonparametric priors, branching limits, and stochastic integration.
40.10 Applications to geometry and networks
Point processes model wireless transmitters, stars, trees, cells, particles, earthquakes, traffic events, and network nodes. Geometry enters through nearest-neighbor distance, Voronoi tessellations, coverage, connectivity, interference, and percolation.
For a homogeneous Poisson process of intensity λ in Rd, void probabilities satisfy
P(N(B)=0)=e−λ∣B∣.This immediately gives nearest-neighbor distributions by taking B as a ball.
Network models add edges according to distance, random connection rules, or marks. This creates random geometric graphs and continuum percolation. The liftback from point process to network property requires geometric dependence analysis; node independence does not imply edge-event independence.
- Get link
- X
- Other Apps
Comments
Post a Comment