Probability Theory — Chapters 11–20
- Get link
- X
- Other Apps
Probability Theory — Chapters 11–20
Chapter 11 — Lᵖ Spaces in Probability
11.1 L⁰, L¹, L², Lᵖ, L∞
Given a probability space (Ω,F,P), the space L0 consists of measurable random variables modulo almost sure equality. It has no natural norm in general, but it carries convergence in probability. For 0<p<∞, the space Lp consists of random variables satisfying
E∣X∣p<∞.For p≥1, define
∥X∥p=(E∣X∣p)1/p.For p=∞,
∥X∥∞=esssup∣X∣.The word “essential” means null sets are ignored. Thus Lp is not a space of literal functions but of equivalence classes under X=Y almost surely.
On a probability space, higher Lp integrability implies lower Lq integrability:
q<p⇒Lp⊆Lq,∥X∥q≤∥X∥p.This inclusion uses P(Ω)=1. On infinite-measure spaces, the inclusion can fail. Probability spaces therefore have a special integrability hierarchy: stronger moments control weaker moments.
11.2 Norms and seminorms modulo null sets
The expression
∥X∥p=(E∣X∣p)1/pis a true norm only after identifying random variables that agree almost surely. Before quotienting, it is merely a seminorm, since ∥X∥p=0 implies X=0 almost surely, not necessarily pointwise. The quotient operation is not cosmetic; it is forced by measure theory.
The Lp carrier therefore has this form:
Lp(Ω,F,P)={X:E∣X∣p<∞}/{X=Y a.s.}.Any theorem stated in Lp is automatically modulo null sets. A conditional expectation, an L2 projection, or a martingale representative is not uniquely defined pointwise unless a version is selected. The null-set quotient is the hidden grammar of modern probability.
11.3 Completeness
For p≥1, Lp is a Banach space: every Cauchy sequence in ∥⋅∥p converges in Lp to an element of Lp. Completeness is what allows limiting constructions. If Xn is Cauchy in Lp, then there exists X∈Lp such that
∥Xn−X∥p→0.The proof uses subsequences and almost sure convergence: from an Lp-Cauchy sequence one extracts a rapidly Cauchy subsequence, controls the sum of increments, obtains almost sure convergence, and then returns to Lp. This pattern is common in probability: norm control produces subsequential pathwise control, then integration reconstructs the full convergence.
11.4 Hilbert-space structure of L²
The space L2 is a Hilbert space with inner product
⟨X,Y⟩=E[XY]in the real case, or E[XY] in the complex case. The norm satisfies
∥X∥22=E[X2].Orthogonality means E[XY]=0. Centered variables X−EX and Y−EY are orthogonal exactly when their covariance is zero.
This Hilbert structure powers conditional expectation, martingale theory, Gaussian processes, orthogonal decompositions, and least-squares prediction. In L2, probabilistic estimation becomes geometry: the best approximation is an orthogonal projection onto a closed subspace.
11.5 Orthogonality and projection
If H⊆L2 is a closed linear subspace and X∈L2, there is a unique Y∈H minimizing
∥X−Y∥2.The error X−Y is orthogonal to every element of H:
E[(X−Y)Z]=0,Z∈H.This is the projection theorem.
In probability, H often consists of all square-integrable random variables measurable with respect to a sub-σ-algebra G. The projection of X onto H is E[X∣G]. Thus conditional expectation is not merely an average; in L2, it is the best prediction of X using information G.
11.6 Conditional expectation as projection
For X∈L1, conditional expectation E[X∣G] is the G-measurable random variable satisfying
∫GE[X∣G]dP=∫GXdPfor every G∈G. When X∈L2, this object is the orthogonal projection of X onto L2(G).
This projection interpretation explains the tower property:
E[E[X∣G]∣H]=E[X∣H]when H⊆G. Projecting first onto a larger information space and then onto a smaller one gives the same result as projecting directly onto the smaller one.
11.7 Uniform integrability in L¹
A family X⊂L1 is uniformly integrable if
K→∞limX∈XsupE[∣X∣1{∣X∣>K}]=0.This condition prevents expectation from hiding in rare but large values. Boundedness in Lp for some p>1 implies uniform integrability by Hölder or Markov estimates.
Uniform integrability is the correct compactness substitute in L1. It is also the missing bridge between convergence in probability and convergence of expectations. If Xn→X in probability and {Xn} is uniformly integrable, then X∈L1 and
E∣Xn−X∣→0.Without uniform integrability, expectations can fail to converge even when random variables converge in probability.
11.8 Lᵖ convergence versus probability convergence
Convergence in Lp means
E∣Xn−X∣p→0.For p>0, Lp convergence implies convergence in probability, because Markov’s inequality gives
P(∣Xn−X∣>ε)≤εpE∣Xn−X∣p.The converse is false without additional integrability control.
Different Lp modes have different strengths. On a probability space, if p>q>0, then Lp convergence implies Lq convergence:
∥Xn−X∥q≤∥Xn−X∥p.But convergence in probability alone is only a typical-value statement; it does not control tails. That is why Lp spaces encode moment-sensitive convergence, not just distributional behavior.
11.9 Hypercontractivity preview
Hypercontractivity refers to operators that improve integrability: an operator T may map Lp into Lq with q>p, satisfying
∥Tf∥q≤C∥f∥p.In probability, this phenomenon appears for noise operators, Gaussian semigroups, Boolean functions, Markov semigroups, and log-Sobolev inequalities.
The conceptual value is that smoothing or randomization can upgrade weak moment control into stronger moment control. Hypercontractivity becomes a certificate engine for concentration, tail bounds, invariance principles, and analysis of high-dimensional random structures. It converts functional inequalities into probabilistic regularity.
11.10 Functional-analytic probability carrier
The functional-analytic view treats random variables as elements of normed, Banach, or Hilbert spaces. Expectation becomes a linear functional, conditional expectation becomes projection, martingales become adapted sequences in Lp, and convergence theorems become compactness or boundedness principles.
This carrier is powerful because it exposes the geometry behind probability. L1 controls mass, L2 controls energy and orthogonality, L∞ controls uniform bounds, and Lp norms interpolate between tail and moment behavior. The main firewall is that functional-analytic equality is usually equality modulo null sets; pointwise claims require additional version control.
Chapter 12 — Independence
12.1 Independence of events
Events A and B are independent if
P(A∩B)=P(A)P(B).Equivalently, if P(B)>0,
P(A∣B)=P(A).The event B gives no probabilistic information about A. Independence is not disjointness; disjoint positive-probability events are maximally incompatible, not independent.
For a family {Ai}i∈I, mutual independence means every finite subfamily factors:
P(j=1⋂kAij)=j=1∏kP(Aij).The finite-subfamily condition is crucial even for infinite families. Independence is always certified by finite joint factorization, not by vague unrelatedness.
12.2 Independence of σ-algebras
Sub-σ-algebras G and H are independent if
P(G∩H)=P(G)P(H)for all G∈G, H∈H. For a family {Gi}, mutual independence means every finite selection of events Gi∈Gi factors.
This is the cleanest information-theoretic form of independence. A σ-algebra represents information. Independence of σ-algebras means information in one σ-algebra does not change probabilities of events in the other. Random-variable independence is defined through the σ-algebras generated by those variables.
12.3 Independence of random variables
Random variables Xi:Ω→Si are independent if the σ-algebras σ(Xi) are independent. Equivalently, for measurable sets Bi⊆Si,
P(X1∈B1,…,Xn∈Bn)=i=1∏nP(Xi∈Bi).For real-valued variables, it is enough to check rectangles of the form (−∞,ti].
In law language, independence means the joint law factors:
μ(X1,…,Xn)=μX1⊗⋯⊗μXn.This is the most compact certificate. Marginal laws alone do not imply independence; the joint law must be product.
12.4 Factorization of joint laws
If X and Y have joint law γ on S×T, then they are independent iff
γ=μX⊗μY.For densities, this becomes
fX,Y(x,y)=fX(x)fY(y)almost everywhere. For discrete variables, it becomes
P(X=x,Y=y)=P(X=x)P(Y=y)for all values x,y.
The phrase “almost everywhere” matters. Density factorizations are measure-level claims. Changing a density on a null set changes the pointwise formula but not the law. The true carrier is the measure factorization, not a chosen version of the density.
12.5 Product measures
Given probability measures μ on S and ν on T, their product measure μ⊗ν is characterized by
(μ⊗ν)(A×B)=μ(A)ν(B).It extends uniquely under standard measure-theoretic hypotheses from rectangles to the product σ-algebra.
Product measure constructs independent joint randomness from marginal laws. If X(s,t)=s and Y(s,t)=t on S×T with law μ⊗ν, then X∼μ, Y∼ν, and X,Y are independent. Dependence is exactly deviation from this product carrier.
12.6 Infinite independent families
An infinite family {Xi}i∈I is independent if every finite subfamily is independent. For countable products, one constructs the law on ∏iSi using finite-dimensional cylinder probabilities:
P(Xi1∈A1,…,Xik∈Ak)=j=1∏kμij(Aj).Infinite independence is the foundation of coin-flip sequences, iid samples, random walks, and product processes. It produces tail events, zero-one laws, and limit theorems. The finite-dimensional definition is not a weakness; countable probability itself is built by extension from finite-dimensional consistency.
12.7 Pairwise versus mutual independence
Pairwise independence requires every pair to factor. Mutual independence requires every finite collection to factor. Pairwise independence does not control higher-order interactions. For example, let X,Y be independent fair bits and set Z=X⊕Y. Then X,Y,Z are pairwise independent, but not mutually independent because Z is determined by X,Y.
This distinction is load-bearing in variance computations, random constructions, hashing, and pseudorandomness. Pairwise independence may be sufficient for second-moment estimates, but not for Chernoff bounds, product distributions, or full limit theorems. The independence strength must match the theorem.
12.8 Conditional independence
Events or random variables X and Y are conditionally independent given G if
P(X∈A,Y∈B∣G)=P(X∈A∣G)P(Y∈B∣G)almost surely, for all measurable A,B. Conditional independence says that after the information G is known, no residual dependence remains.
Conditional independence is not the same as unconditional independence. Two variables can be dependent marginally but conditionally independent given a latent variable; this is common in Bayesian networks and mixture models. Conversely, conditioning can create dependence through selection effects. Conditional independence is a σ-algebra-relative factorization claim.
12.9 Exchangeability
A sequence X1,X2,… is exchangeable if its finite-dimensional laws are invariant under finite permutations:
(X1,…,Xn)=d(Xπ(1),…,Xπ(n))for every permutation π of {1,…,n}. Iid sequences are exchangeable, but exchangeability is weaker.
Exchangeability encodes symmetry without independence. It says order carries no information, but dependence may remain. A typical example is drawing from a random parameter: conditional on Θ, the variables are iid, but marginally they are dependent. This leads to de Finetti-type representation.
12.10 De Finetti’s theorem preview
De Finetti’s theorem states, in one classical form, that an infinite exchangeable sequence of Bernoulli random variables is conditionally iid given a random parameter Θ∈[0,1]. That is,
P(X1=x1,…,Xn=xn)=∫01θ∑xi(1−θ)n−∑xiν(dθ)for some mixing measure ν.
The theorem converts exchangeability into mixture-of-iid structure. This is a major carrier transformation: symmetry under permutations becomes conditional independence over a latent law. It is central to Bayesian statistics, representation theory of random sequences, and probabilistic symmetry principles.
12.11 Independence as carrier certificate
Independence is a certificate about the joint carrier. It asserts that the joint law is product, or that relevant σ-algebras factor. Without this certificate, products of probabilities, multiplication of MGFs, convolution formulas, Chernoff bounds, and many limit theorems are not licensed.
The carrier view prevents a common error: variables with unrelated names are not automatically independent. Independence must come from construction, assumption, product measure, randomized mechanism, or proven factorization. In advanced probability, most work is spent either exploiting independence or replacing it with weaker dependence-control structures.
12.12 False independence and coupling errors
False independence arises when marginal information is mistaken for joint information. Knowing X∼μ and Y∼ν does not determine P(X≤Y), E[XY], or P(X=Y). Those require a coupling. If the coupling is product, the variables are independent; if not, dependence must be analyzed.
Another error is assuming pairwise independence gives mutual independence, or assuming conditional independence survives marginalization. A third is using independent copies without constructing them on a product extension. The general rule is:
joint expression requires joint carrier.No joint carrier, no joint probability.
Chapter 13 — Product Measures and Fubini Theory
13.1 Product σ-algebras
For measurable spaces (S,S) and (T,T), the product σ-algebra is
S⊗T=σ({A×B:A∈S, B∈T}).It is the event space generated by measurable rectangles. Coordinate projections are measurable by construction.
Product σ-algebras are the natural carrier for joint random variables. If X:Ω→S and Y:Ω→T, then (X,Y):Ω→S×T is measurable with respect to S⊗T. Joint laws live on this product event structure.
13.2 Product measures
Given σ-finite measures μ and ν, the product measure μ⊗ν is the unique measure on S⊗T satisfying
(μ⊗ν)(A×B)=μ(A)ν(B).For probability measures, it defines the law of independent coordinates.
Product measure is not just a convenience. It is the mathematical object behind independent sampling. If one says “sample X∼μ and independently sample Y∼ν,” the joint law is μ⊗ν. Any other joint law with the same marginals is a coupling but not independent.
13.3 Tonelli’s theorem
Tonelli’s theorem states that if f:S×T→[0,∞] is measurable, then
∫S×Tfd(μ⊗ν)=∫S(∫Tf(s,t)ν(dt))μ(ds)=∫T(∫Sf(s,t)μ(ds))ν(dt).No integrability assumption is needed beyond nonnegativity; the value may be infinite.
Tonelli is the theorem of safe rearrangement for nonnegative quantities. It justifies summing or integrating in either order when no cancellation is possible. Many expectation identities for counts, occupation times, and nonnegative random fields are Tonelli arguments.
13.4 Fubini’s theorem
Fubini’s theorem extends Tonelli to signed or complex functions under integrability:
∫∣f∣d(μ⊗ν)<∞.Then iterated integrals exist almost everywhere, are integrable, and
∫fd(μ⊗ν)=∫∫f(s,t)ν(dt)μ(ds)=∫∫f(s,t)μ(ds)ν(dt).The integrability hypothesis is load-bearing. Without absolute integrability, changing order can change the value or produce undefined expressions. In probability, Fubini licenses identities such as E[E[X∣Y]]=E[X] and many conditioning calculations, but only when the relevant integrability conditions hold.
13.5 Iterated expectation
If X is integrable on a product probability space (S×T,μ⊗ν), then
E[X]=∫SE[X(s,⋅)]μ(ds)=∫TE[X(⋅,t)]ν(dt).This is expectation computed one coordinate at a time.
Iterated expectation generalizes to conditional expectation:
E[X]=E[E[X∣G]].The inner expectation averages over unresolved randomness; the outer expectation averages over the conditioning information. This is the measure-theoretic form of “average the conditional averages.”
13.6 Independent product construction
To construct independent random variables with given laws μi, take the product space
Ω=i∏Siwith product measure ⨂iμi, and define coordinate maps
Xi(ω)=ωi.Then the Xi are independent and Xi∼μi.
This construction proves that independent copies exist under standard measurable-space hypotheses. It also clarifies that an “independent copy of X” is not created inside the original sample space automatically. One may need to extend the probability space to carry the copy.
13.7 Infinite products
Infinite product measures are built from finite-dimensional cylinder probabilities. A cylinder set constrains finitely many coordinates:
{ω:ωi1∈A1,…,ωik∈Ak}.Its probability under product measure is
j=1∏kμij(Aj).The infinite product σ-algebra is generated by these finite-coordinate events. It contains many asymptotic events obtained by countable operations, such as convergence of averages and infinitely many successes. Infinite product construction is the backbone of iid sequences and independent stochastic inputs.
13.8 Random sequences
A random sequence is a measurable map
X:Ω→SN,or equivalently a sequence of coordinate random variables Xn:Ω→S. Its law is a probability measure on sequence space. If the coordinates are independent with common law μ, the law is μ⊗N.
Sequence-space thinking is cleaner than treating each coordinate separately. Tail events, empirical measures, stopping times, and path properties are events on SN. Limit theorems then become statements about the measure of subsets of sequence space.
13.9 Coordinate random variables
On a product space Ω=∏iSi, the coordinate random variable is
Xi(ω)=ωi.The product σ-algebra is the smallest σ-algebra making all coordinate maps measurable. Finite-dimensional distributions are pushforwards of the product law under finite coordinate projections.
Coordinate variables are canonical, but not every sequence of random variables originally appears on a product space. The product representation is a model realization. What matters is whether the joint law of the given sequence matches the coordinate law on the canonical product carrier.
13.10 Kolmogorov consistency
A family of finite-dimensional distributions {μI} is consistent if marginalizing from a larger finite set J to a smaller finite set I⊆J gives μI. Formally,
(πJ→I)∗μJ=μI.Consistency is necessary because any genuine process must have compatible finite projections.
Kolmogorov’s extension theorem says that, under suitable state-space hypotheses, consistency is sufficient to produce a probability measure on the infinite product. The theorem turns local finite-dimensional specifications into a global stochastic process carrier.
13.11 Product-space pathologies
Product spaces can behave badly when measurability, completion, or topology is mishandled. The product of completed σ-algebras need not equal the completion of the product σ-algebra. Sections of measurable sets may be measurable under good hypotheses, but projections of measurable sets may require analytic-set machinery.
Another pathology is assuming pointwise path regularity from finite-dimensional distributions. A process law on R[0,∞) may exist without having continuous or càdlàg paths. Regularity requires separate certificates such as continuity criteria, separability, or modification theorems. Product construction gives existence of coordinates; it does not automatically give nice sample paths.
Chapter 14 — Conditional Probability and Conditioning
14.1 Conditioning on positive-probability events
If B∈F with P(B)>0, define
P(A∣B)=P(B)P(A∩B).This creates a new probability law on Ω, concentrated on B, or equivalently a normalized restriction of P to B.
The condition P(B)>0 is essential. If P(B)=0, the ratio is undefined. Many paradoxes in continuous probability arise from informal conditioning on null events. Conditioning on X=x for a continuous variable requires regular conditional distributions or limiting procedures, not the elementary ratio.
14.2 Conditioning on partitions
If {Bi} is a countable partition of Ω with P(Bi)>0, conditioning on the partition means replacing a random quantity by its average on the cell containing the outcome. For an event A,
P(A∣Bi)=P(Bi)P(A∩Bi).The law of total probability is
P(A)=i∑P(A∣Bi)P(Bi).For an integrable X, the conditional expectation given the partition is
E[X∣σ(Bi)]=i∑P(Bi)E[X1Bi]1Bi.Thus conditioning on a finite or countable partition is averaging over information cells.
14.3 Conditioning on σ-algebras
Conditioning on a σ-algebra G⊆F means conditioning on available information. The conditional expectation E[X∣G] is G-measurable and satisfies
∫GE[X∣G]dP=∫GXdPfor all G∈G.
The σ-algebra formulation subsumes conditioning on events, partitions, random variables, and filtrations. Conditioning on a random variable Y means conditioning on σ(Y). The result is a σ(Y)-measurable object, often representable as a function g(Y) under standard conditions.
14.4 Conditional expectation
For X∈L1, conditional expectation is the unique almost sure equivalence class Z=E[X∣G] such that Z is G-measurable and
E[Z1G]=E[X1G]for every G∈G. It preserves all integrals over events visible to G.
Key properties include linearity, positivity, monotonicity, Jensen’s inequality in conditional form, and the tower property:
E[E[X∣G]∣H]=E[X∣H]if H⊆G.Conditional expectation is the projection of X onto the information carrier G.
14.5 Conditional distributions
A conditional distribution of X given Y is a family of probability measures
K(y,A)=P(X∈A∣Y=y)such that y↦K(y,A) is measurable and
P(X∈A,Y∈B)=∫BK(y,A)μY(dy).This is a Markov kernel from the state space of Y to the state space of X.
Conditional distributions refine conditional expectation. If g is integrable,
E[g(X)∣Y]=∫g(x)K(Y,dx).Existence of regular conditional distributions requires suitable measurable-space hypotheses, usually standard Borel spaces.
14.6 Regular conditional probability
A regular conditional probability is a conditional distribution that behaves as an actual probability measure in the conditioned variable and as a measurable function in the conditioning value. For each y, K(y,⋅) is a probability measure; for each event A, K(⋅,A) is measurable.
Regular conditional probabilities justify expressions such as P(X∈A∣Y=y), even when P(Y=y)=0. But they are versions, defined only up to μY-null sets in y. Treating a chosen version as canonical at every point can produce errors, especially on null conditioning values.
14.7 Disintegration
Disintegration decomposes a joint measure into conditional measures over a marginal:
γ(dx,dy)=K(y,dx)μY(dy).It says a joint law can be represented by first drawing Y∼μY, then drawing X according to K(Y,⋅), under appropriate space regularity.
This is the measure-theoretic form of conditional modeling. It generalizes density factorization
fX,Y(x,y)=fX∣Y(x∣y)fY(y).The density formula is only a special representation; disintegration is the invariant carrier.
14.8 Bayes theorem in measure form
Bayes’ theorem becomes a change of disintegration. Suppose a prior π(dθ) and likelihood kernel L(θ,dx) define a joint law
γ(dθ,dx)=L(θ,dx)π(dθ).The posterior is a conditional distribution of θ given x:
π(dθ∣x)=∫L(θ′,dx)π(dθ′)L(θ,dx)π(dθ)in density notation.
The invariant statement is not the formula with densities; it is the existence and identification of the reverse conditional kernel. Bayes’ theorem is therefore disintegration reversal: it transports from generative factorization to inferential factorization.
14.9 Versions of conditional expectation
Conditional expectation is unique only almost surely. If Z and Z′ both satisfy the defining property, then
P(Z=Z′)=1.They may differ on a null set. Thus E[X∣G] is an equivalence class unless a version is selected.
Version issues become serious when evaluating conditional expectations at specific points, especially points of probability zero. A statement true almost surely in the conditioning variable may fail at an exceptional value. Any pointwise use of conditional objects requires a version certificate or regularity theorem.
14.10 Conditioning on null events
The elementary formula P(A∣B)=P(A∩B)/P(B) fails when P(B)=0. Continuous conditioning, such as P(X∈A∣Y=y), requires regular conditional probability, limiting conditioning, or geometric disintegration.
Different limiting procedures can give different answers when conditioning on null events. This is the source of Borel-type paradoxes. The conditioning event must be replaced by a specified σ-algebra, kernel, limiting scheme, or coordinate-invariant disintegration. Null conditioning is not illegal, but it is not handled by the finite event-ratio formula.
14.11 Conditional independence
Random variables X and Y are conditionally independent given G if for bounded measurable f,g,
E[f(X)g(Y)∣G]=E[f(X)∣G]E[g(Y)∣G].This is the conditional factorization of joint information.
Conditional independence is the language of graphical models, Markov properties, hidden-variable models, and filtering. It is stronger and more precise than saying dependence is “explained by” G. Once G is known, the remaining randomness in X and Y factors.
14.12 Filtrations and information
A filtration is an increasing family of σ-algebras:
Fs⊆Ft,s≤t.It represents information accumulated over time. A process Xt is adapted if Xt is Ft-measurable for every t.
Filtrations are the carrier for martingales, stopping times, stochastic integration, and dynamic conditioning. The phrase “known at time t” means measurable with respect to Ft. Without filtration, temporal probability statements are under-specified.
14.13 Updating as transport between σ-algebras
Updating is movement from a coarser information σ-algebra to a finer one. If G⊆H, then E[X∣G] is the best estimate with less information, while E[X∣H] is the updated estimate with more information. The tower property guarantees coherence:
E[E[X∣H]∣G]=E[X∣G].This is the formal version of rational updating. New information refines the event grammar. Probabilities change not because the underlying law is incoherent, but because the conditioning σ-algebra has changed. The update is a projection onto a different information carrier.
Chapter 15 — Probability Convergence Grammar
15.1 Almost sure convergence
A sequence Xn converges almost surely to X if
P({ω:Xn(ω)→X(ω)})=1.This is pointwise convergence outside a null set. It requires all Xn and X to live on a common probability space.
Almost sure convergence is strong because it controls entire sample paths eventually for almost every outcome. It is the natural mode for strong laws, martingale convergence, and pathwise stochastic analysis. But it is still not pointwise convergence everywhere; null exceptional sets remain.
15.2 Convergence in probability
Xn→X in probability if for every ε>0,
P(∣Xn−X∣>ε)→0.This says large deviations from X become unlikely. It also requires a common probability space because the expression Xn−X must be meaningful.
Almost sure convergence implies convergence in probability, but not conversely. Convergence in probability is often the correct mode for weak laws of large numbers, estimator consistency, and randomized approximation. It controls typical behavior at each large n, but not necessarily eventual behavior along almost every sample path.
15.3 Lᵖ convergence
Xn→X in Lp if
E∣Xn−X∣p→0.For p>0, this implies convergence in probability. For p≥1, it is norm convergence in the Banach space Lp.
Lp convergence controls both probability of deviation and magnitude of deviation. L1 convergence controls expected absolute error; L2 convergence controls mean-square error; higher p controls stronger tail behavior. It is more quantitative than convergence in probability but less pathwise than almost sure convergence.
15.4 Convergence in distribution
Xn converges in distribution to X, written
Xn⇒X,if the laws μXn converge weakly to μX. For real variables, this is equivalent to
FXn(t)→FX(t)at every continuity point t of FX.
Convergence in distribution is law-level. The variables do not need to live on a common probability space. It is the natural mode for central limit theorems and many asymptotic approximations. But it does not by itself imply convergence in probability or almost surely. The arrow is weaker because it forgets coupling.
15.5 Total variation convergence
Probability measures μn converge to μ in total variation if
∥μn−μ∥TV=Asup∣μn(A)−μ(A)∣→0.Equivalently, when densities exist with respect to a common dominating measure,
∥μn−μ∥TV=21∫∣fn−f∣dλ.Total variation is stronger than weak convergence. It controls probabilities of all measurable events uniformly. It is central in Markov chain mixing, coupling, statistical distance, and approximation theory. A coupling characterization says total variation is the minimal mismatch probability over all couplings:
∥μ−ν∥TV=infP(X=Y).15.6 Weak convergence
Weak convergence of probability measures on a metric space means
∫fdμn→∫fdμfor every bounded continuous f. This is convergence tested by continuous bounded probes, not by all measurable events.
Weak convergence is topology-sensitive. It sees the geometry of the state space. It is weaker than total variation and does not generally imply convergence of expectations for unbounded functions. To pass expectations of unbounded functions, one needs uniform integrability, moment bounds, or stronger Wasserstein-type convergence.
15.7 Vague convergence
Vague convergence is used mainly for locally compact spaces and measures that may not be probability measures. Measures μn converge vaguely to μ if
∫fdμn→∫fdμfor every continuous compactly supported f.
Vague convergence is weaker than weak convergence because compactly supported tests may not see mass escaping to infinity. It is appropriate for point processes, extreme-value theory, and infinite measures. For probability measures, vague convergence plus tightness can often recover weak convergence.
15.8 Relations between convergence modes
The basic implication chain is:
Lp convergence⇒convergence in probability⇒convergence in distribution.Almost sure convergence also implies convergence in probability. If Xn⇒c where c is constant, then Xn→c in probability.
No reverse implication holds generally without extra hypotheses. Distributional convergence is law-level and can occur without a shared sample space. Probability convergence requires coupling. Almost sure convergence requires pathwise eventual control. Lp convergence requires moment control. Each arrow has a different carrier.
15.9 Counterexamples separating convergence modes
Let Xn be independent Bernoulli variables with P(Xn=1)=1/n. Then Xn→0 in probability, since P(∣Xn∣>ε)=1/n→0, but Xn does not converge almost surely to zero if the events are arranged with divergent sum and independence in a Borel–Cantelli construction.
Let Xn=n with probability 1/n, else 0. Then Xn→0 in probability, but E[Xn]=1, so Xn→0 in L1. Let Xn∼N(0,1) independently of X∼N(0,1); then Xn⇒X, but without coupling there is no reason for Xn→X in probability. These examples prove the convergence modes are not interchangeable.
15.10 Skorokhod representation
The Skorokhod representation theorem states that, under suitable conditions such as Polish state spaces, if Xn⇒X, then one can construct random variables Yn,Y on a new probability space such that
Yn=dXn,Y=dX,Yn→Y almost surely.This converts weak convergence into almost sure convergence after changing the carrier.
The theorem is powerful but dangerous if misread. It does not say the original Xn converge almost surely. It says there exists a coupling with almost sure convergence. Therefore it is a law-level-to-coupling liftback theorem, not a statement about the original sample space.
15.11 Borel–Cantelli lemmas
For events An, define
An i.o.=limsupAn=m=1⋂∞n≥m⋃An.The first Borel–Cantelli lemma states that if
n∑P(An)<∞,then
P(An i.o.)=0.No independence is required.
The second Borel–Cantelli lemma states that if the An are independent and
n∑P(An)=∞,then
P(An i.o.)=1.Thus summability controls eventual occurrence. The second direction requires independence or sufficient dependence control.
15.12 Cauchy criteria in probability
A sequence Xn is Cauchy in probability if for every ε>0,
P(∣Xn−Xm∣>ε)→0as m,n→∞. In many standard settings, Cauchy in probability implies convergence in probability to some random variable.
For Lp, the Cauchy criterion is norm-based:
∥Xn−Xm∥p→0.Completeness of Lp then supplies an Lp limit. Cauchy criteria are useful when the limit is not explicitly known, as in stochastic integration, martingale convergence, and construction of processes.
15.13 Convergence of expectations
Convergence of random variables does not automatically imply convergence of expectations. Sufficient conditions include dominated convergence, bounded convergence, monotone convergence, L1 convergence, or convergence in probability plus uniform integrability.
For weak convergence, bounded continuous test functions are safe:
Xn⇒X⇒E[f(Xn)]→E[f(X)]for bounded continuous f. For unbounded f, additional conditions are required. The missing bridge is usually uniform integrability of f(Xn).
15.14 Uniform integrability as missing bridge
Uniform integrability converts weak or probability convergence into expectation convergence. If Xn→X in probability and {Xn} is uniformly integrable, then
E[Xn]→E[X].If Xn⇒X and {∣Xn∣} has sufficient uniform integrability under a coupling or moment condition, then expectations can often be transferred.
This is the recurring audit rule: convergence controls where most mass goes; uniform integrability controls what rare large mass can do. Without it, expectations can remain fixed, diverge, or oscillate despite convergence in probability or distribution.
Chapter 16 — Laws and Weak Convergence
16.1 Probability measures on metric spaces
A probability law on a metric space S is a probability measure on its Borel σ-algebra B(S). Metric structure allows one to define weak convergence, tightness, continuity sets, compactness, and convergence-determining classes.
The move from real-valued variables to metric-space-valued random elements is essential for stochastic processes, empirical measures, random graphs, and random functions. A law on C[0,1], D[0,1], or the space of probability measures is still a probability measure; only the state carrier changes.
16.2 Bounded continuous test functions
Weak convergence μn⇒μ is defined by
∫fdμn→∫fdμfor every bounded continuous f:S→R. These functions probe the law without being sensitive to null boundary artifacts.
Boundedness prevents tail mass from distorting the test integral; continuity prevents tests from seeing abrupt boundary behavior not controlled by weak convergence. Indicator functions are generally not continuous, so event probabilities require continuity-set conditions.
16.3 Portmanteau theorem
The Portmanteau theorem gives equivalent formulations of weak convergence. On metric spaces, μn⇒μ iff
nlimsupμn(F)≤μ(F)for every closed set F, equivalently
nliminfμn(G)≥μ(G)for every open set G, and equivalently
μn(A)→μ(A)for every μ-continuity set A, meaning μ(∂A)=0.
The theorem explains why convergence of CDFs is required only at continuity points. Discontinuities are atoms or boundary masses where indicator tests are not continuous probes. Weak convergence controls events whose boundaries carry no limiting mass.
16.4 Tightness
A family of probability measures {μi} on a metric space is tight if for every ε>0, there exists compact K such that
isupμi(Kc)<ε.Tightness says mass does not escape to infinity or into noncompact regions.
On Rd, tightness often follows from moment bounds:
isup∫∣x∣pμi(dx)<∞⇒{μi} tight.Tightness is the compactness gate for weak convergence. It gives subsequential limits, but it does not identify the limit; identification requires convergence of test functions, characteristic functions, finite-dimensional distributions, or other determining data.
16.5 Prokhorov theorem
Prokhorov’s theorem states that, on Polish spaces, a family of probability measures is relatively compact in the weak topology iff it is tight. Thus tightness is not merely sufficient but exactly the compactness criterion in good spaces.
This theorem is central in process convergence. To prove Xn⇒X in a function space, one typically proves tightness of the laws and then identifies all subsequential limits by finite-dimensional distributions or martingale problems. Tightness gives existence of limit candidates; identification eliminates ambiguity.
16.6 Weak convergence on ℝᵈ
For laws on Rd, weak convergence can be tested by bounded continuous functions, by convergence of distribution functions at continuity points, or by characteristic functions:
ϕμn(t)=∫ei⟨t,x⟩μn(dx).Lévy’s continuity theorem states that pointwise convergence of characteristic functions to a function continuous at zero gives weak convergence to the corresponding law.
In Rd, Cramér–Wold also applies: Xn⇒X iff
⟨u,Xn⟩⇒⟨u,X⟩for every u∈Rd. This reduces multivariate convergence to one-dimensional projections.
16.7 Weak convergence on function spaces
Weak convergence on spaces such as C[0,1] or D[0,1] requires more than convergence of finite-dimensional distributions. One must also prove tightness in the function-space topology. For C[0,1], tightness is often certified by modulus-of-continuity estimates. For D[0,1], Skorokhod topologies handle jumps.
The process-level law contains path regularity. Finite-dimensional distributions only describe coordinates at finitely many times; they do not control oscillation between times. Thus process convergence requires both finite-dimensional convergence and tightness. This is the standard two-gate structure.
16.8 Empirical measures
Given samples X1,…,Xn, the empirical measure is
μn=n1i=1∑nδXi.It is a random probability measure. For iid samples with law μ, the empirical measure converges weakly to μ almost surely under broad conditions:
μn⇒μ.Empirical measures convert samples into law-level random objects. The Glivenko–Cantelli theorem strengthens this on R to uniform convergence of CDFs:
xsup∣Fn(x)−F(x)∣→0almost surely. Empirical process theory studies fluctuations around this convergence.
16.9 Wasserstein convergence
The p-Wasserstein distance between probability measures on a metric space is
Wp(μ,ν)=(γ∈Π(μ,ν)inf∫d(x,y)pγ(dx,dy))1/p,where Π(μ,ν) is the set of couplings. Wasserstein convergence combines weak convergence with moment control.
On Rd, Wp(μn,μ)→0 iff μn⇒μ and the p-th moments converge appropriately. This makes Wasserstein distance a liftback metric: it remembers geometry and tail magnitude, not just weak law convergence.
16.10 Lévy metric
The Lévy metric metrizes weak convergence of probability measures on R. For distribution functions F,G, it measures the smallest ε such that
F(x−ε)−ε≤G(x)≤F(x+ε)+εfor all x. It permits small horizontal and vertical errors.
The metric is useful because weak convergence of real laws is exactly convergence in this metric. It is less commonly used in computations than characteristic functions or bounded-Lipschitz metrics, but it formalizes the CDF geometry of weak convergence.
16.11 Convergence-determining classes
A class C of test functions or sets is convergence-determining if convergence on C implies weak convergence. On R, intervals (−∞,t] at continuity points determine convergence. On Rd, characteristic functions or bounded Lipschitz functions determine convergence.
Convergence-determining classes reduce verification. Instead of testing every bounded continuous function, one tests a smaller structured family. The class must be rich enough to identify the law. Insufficient tests can miss mass or dependence.
16.12 Mapping theorem
If Xn⇒X and g:S→T is continuous, then
g(Xn)⇒g(X).More generally, it is enough that the discontinuity set of g has PX-measure zero. This is the mapping theorem.
The theorem transports weak convergence through functions. It is widely used for statistics: once an estimator converges in distribution, continuous transformations of it converge by mapping. Discontinuous transformations require boundary audits; atoms at discontinuities can break the conclusion.
16.13 Continuous mapping theorem
The continuous mapping theorem is the random-variable version of the mapping theorem. If
Xn⇒Xand g is continuous at X almost surely, then
g(Xn)⇒g(X).For vector-valued variables, this handles sums, products, norms, maxima, and smooth transformations when continuous.
The theorem is law-level. It does not claim pointwise convergence of g(Xn). It says the laws transport through continuous maps. For discontinuous g, one must verify that the limit avoids discontinuity points almost surely.
16.14 Slutsky’s theorem
Slutsky’s theorem states that if
Xn⇒X,Yn→cin probability, where c is constant, then
Xn+Yn⇒X+c,XnYn⇒cX,and if c=0,
Xn/Yn⇒X/c.Slutsky’s theorem is the standard tool for replacing unknown normalizing constants or nuisance estimators by consistent estimates. It combines weak convergence with probability convergence. The constant limit is important; if Yn⇒Y nonconstant, joint convergence is required to conclude anything about Xn+Yn.
Chapter 17 — Laws of Large Numbers
17.1 Weak law of large numbers
The weak law states that for iid integrable random variables Xi with mean μ,
n1i=1∑nXi→μin probability, under standard hypotheses. With finite variance, the proof is immediate from Chebyshev:
Var(n1i∑Xi)=nσ2.Thus
P(n1i∑Xi−μ>ε)≤nε2σ2.The weak law says empirical averages are close to expectation with high probability for large n. It is a typical-sample theorem, not a pathwise theorem. It does not say every infinite sequence has average μ, nor that convergence occurs almost surely.
17.2 Strong law of large numbers
The strong law upgrades convergence to almost sure:
n1i=1∑nXi→μa.s.For iid Xi, the sharp classical integrability condition is E∣X1∣<∞. Under finite variance, easier proofs use maximal inequalities or subsequence arguments.
The strong law is pathwise. It says that with probability one, a realized infinite sample sequence has empirical average converging to the mean. This is the formal theorem behind long-run frequency stabilization. It still allows exceptional null sequences where convergence fails.
17.3 Kolmogorov inequality
For independent mean-zero random variables Xi with finite variances, Kolmogorov’s maximal inequality states
P(1≤k≤nmaxi=1∑kXi≥λ)≤λ2Var(∑i=1nXi).If the variables are independent,
Var(i∑Xi)=i∑Var(Xi).This inequality controls the maximum partial sum, not just the final sum. It is therefore suited to almost sure convergence proofs. Strong laws require control over all sufficiently large partial sums, and maximal inequalities provide that pathwise bridge.
17.4 Three-series theorem
Kolmogorov’s three-series theorem characterizes almost sure convergence of sums of independent random variables. For independent Xn, the series ∑Xn converges almost surely iff, for some truncation level c>0, three series involving large jumps, truncated means, and truncated variances converge:
∑P(∣Xn∣>c)<∞,∑E[Xn1{∣Xn∣≤c}] converges,∑Var(Xn1{∣Xn∣≤c})<∞.The theorem decomposes convergence into jump control, drift control, and fluctuation control. It is the precise independent-sum audit. Large rare terms, accumulated bias, and residual variance are the three possible obstructions.
17.5 Etemadi’s proof
Etemadi gave a clean proof of the strong law for pairwise independent identically distributed integrable random variables. The proof avoids some heavier machinery and shows that full mutual independence is not always necessary for averaging.
The conceptual point is that strong laws require enough independence to control deviations of partial sums. Pairwise independence can suffice when combined with identical distribution and truncation. This is an independence-strength lesson: different limit theorems require different factorization payloads.
17.6 Truncation methods
Truncation replaces Xi by bounded variables such as
Xi(n)=Xi1{∣Xi∣≤n}.The goal is to separate ordinary fluctuations from rare large jumps. For integrable X,
E[∣X∣1{∣X∣>n}]→0.This makes tail contributions negligible after normalization.
Truncation is a core probability technique because many theorems are easy for bounded variables and hard for unbounded ones. The proof strategy is: prove the theorem for truncated variables, show the discarded tails do not matter, and then lift back to the original variables. The tail estimate is the decisive debt payment.
17.7 Identically distributed versus independent
Identically distributed means all variables have the same law. Independent means the joint law factors. Neither implies the other. A constant sequence Xn=X1 is identically distributed but maximally dependent. Independent variables may have different distributions.
The law of large numbers typically requires both a stable average law and weak enough dependence. Identical distribution supplies a common mean; independence supplies variance or fluctuation control. General versions replace identical distribution with average moment conditions and replace independence with mixing, martingale differences, or ergodicity.
17.8 Weak dependence versions
Weak dependence versions of LLN allow correlations but require them to decay or average out. If Xi are centered and
Var(n1i=1∑nXi)→0,then the average converges to zero in L2 and hence in probability. This criterion can hold under covariance summability or mixing conditions.
For stationary sequences, ergodic theorems replace independence. The average converges to a conditional expectation on the invariant σ-algebra. If the system is ergodic, that conditional expectation is constant. Thus independence is one route to averaging, but not the only route.
17.9 Ergodic theorem preview
Birkhoff’s ergodic theorem states that for a measure-preserving transformation T and f∈L1,
n1k=0∑n−1f(Tkω)→E[f∣I](ω)a.s.,where I is the invariant σ-algebra. If the system is ergodic, I is trivial and the limit is Ef.
This generalizes the strong law from independent samples to deterministic or dependent dynamical sampling. The limit is not automatically the ensemble mean; it is the invariant-information conditional mean. Ergodicity is the gate that collapses time average to ensemble average.
17.10 Empirical averages and model liftback
The LLN connects formal probability to empirical averaging:
Xˉn=n1i=1∑nXi≈EX.But the theorem lives inside a model. To export it to data, one must justify that observations are modeled as iid, weakly dependent, stationary ergodic, or otherwise governed by the theorem’s assumptions.
The liftback error is to treat LLN as saying “averages always stabilize.” They stabilize under specific carrier conditions. Heavy tails, nonstationarity, dependence, selection bias, and adversarial sampling can all break the empirical interpretation. LLN is a mathematical certificate after hypotheses are paid, not a universal law of data.
Chapter 18 — Central Limit Theory
18.1 Normal distribution
The normal distribution N(μ,σ2) has density
f(x)=2πσ21exp(−2σ2(x−μ)2).The standard normal is N(0,1). Its characteristic function is
ϕ(t)=e−t2/2.The normal law is stable under independent summation: sums of independent normal variables are normal. It also emerges as the universal finite-variance fluctuation limit for many independent sums. The CLT explains Gaussian appearance not as a primitive assumption but as a consequence of aggregation under appropriate normalization.
18.2 Characteristic functions
For a real random variable X,
ϕX(t)=E[eitX].If X,Y are independent,
ϕX+Y(t)=ϕX(t)ϕY(t).This multiplicative property makes characteristic functions ideal for sums.
If X has mean 0 and variance σ2, then near zero,
ϕX(t)=1−2σ2t2+o(t2),under finite second moment. In CLT proofs, this local expansion is raised to the n-th power after scaling t/n, producing the Gaussian exponential.
18.3 Lévy continuity theorem
Lévy’s continuity theorem states that if characteristic functions ϕn(t) converge pointwise to a function ϕ(t) continuous at 0, then ϕ is the characteristic function of some probability law μ, and
μn⇒μ.Conversely, weak convergence implies pointwise convergence of characteristic functions.
This theorem is the main transport gate from Fourier analysis to weak convergence. It allows one to prove distributional limits by proving analytic convergence of characteristic functions. Continuity at zero prevents mass loss.
18.4 Classical CLT
Let X1,X2,… be iid with mean μ and variance 0<σ2<∞. Then
σn∑i=1nXi−nμ⇒N(0,1).This is the classical central limit theorem.
The normalization is essential. The sum has mean nμ and variance nσ2, so subtracting nμ centers it and dividing by σn gives unit variance. The theorem describes fluctuations around the law-of-large-numbers scale. It does not describe rare large deviations far into the tails.
18.5 Lindeberg–Feller CLT
For triangular arrays Xn,k, the Lindeberg–Feller theorem gives conditions under which normalized sums converge to normal. Let independent centered variables have variances σn,k2 and total variance sn2=∑kσn,k2. The Lindeberg condition is
sn21k∑E[Xn,k21{∣Xn,k∣>εsn}]→0for every ε>0.
This condition says no single summand contributes a macroscopic part of the variance. It is a tail-negligibility certificate. The theorem generalizes iid CLT to nonidentically distributed arrays and identifies the obstruction: large individual jumps.
18.6 Lyapunov CLT
The Lyapunov condition is a stronger, easier-to-check sufficient condition. If independent centered variables have total variance sn2, and for some δ>0,
sn2+δ1k∑E∣Xn,k∣2+δ→0,then the normalized sum converges to N(0,1).
Lyapunov implies Lindeberg by Markov-type estimates. It pays the large-jump debt using a higher moment. The cost is stronger hypotheses. In applications, Lyapunov is often simpler; Lindeberg is sharper.
18.7 Triangular arrays
A triangular array is a collection Xn,k for 1≤k≤kn, where the n-th row is summed and normalized. Arrays model changing distributions, infinitesimal summands, and approximations where no fixed iid sequence exists.
Triangular arrays are the natural carrier for general CLT theory. They separate row-level normalization from variable-level assumptions. The main audit questions are: are variables independent within rows, are they centered, what is the row variance, and do large terms vanish relative to total scale?
18.8 Berry–Esseen theorem
For iid variables with mean 0, variance σ2>0, and finite third absolute moment ρ=E∣X∣3, the Berry–Esseen theorem gives
xsupP(σn∑i=1nXi≤x)−Φ(x)≤Cσ3nρ.It quantifies the CLT rate.
The theorem turns asymptotic convergence into finite-n error control. The third absolute moment is the rate debt. Without quantitative error, a CLT only says convergence eventually occurs; Berry–Esseen says how fast in Kolmogorov distance.
18.9 Multivariate CLT
For iid random vectors Xi∈Rd with mean m and covariance matrix Σ,
n1i=1∑n(Xi−m)⇒N(0,Σ).The limiting Gaussian is characterized by
Eei⟨t,Z⟩=e−21t⊤Σt.The Cramér–Wold device reduces the proof to one-dimensional CLTs: convergence of all projections ⟨t,Xn⟩ implies multivariate convergence. The covariance matrix encodes all second-order limiting geometry.
18.10 Delta method
If
n(θ^n−θ)⇒Zand g is differentiable at θ, then
n(g(θ^n)−g(θ))⇒Dg(θ)Z.In one dimension,
n(g(θ^n)−g(θ))⇒g′(θ)Z.The delta method transports asymptotic normality through smooth transformations. If the first derivative vanishes, higher-order delta methods are needed with different scaling. The differentiability and nondegeneracy assumptions are the liftback gates.
18.11 Stable convergence
Stable convergence strengthens convergence in distribution by preserving joint convergence with bounded variables measurable with respect to an underlying σ-algebra. One writes
Xn⇒stXwhen X may live on an extension and joint limits with background randomness are retained.
This mode appears in martingale CLTs, random environments, and asymptotic statistics with mixed normal limits. Ordinary weak convergence forgets coupling to the original information; stable convergence remembers enough joint structure to support conditional limits.
18.12 Failure of CLT under heavy tails
If Xi have infinite variance, the n Gaussian CLT may fail. Heavy-tailed variables with
P(∣X∣>t)∼Ct−α,0<α<2,can converge, after normalization n1/α, to stable laws rather than Gaussians.
The obstruction is that large jumps do not become negligible. Variance is infinite, and no single quadratic scale controls fluctuations. The correct limit carrier becomes stable law theory, not Gaussian theory. Thus “sum of many variables is normal” is false without tail hypotheses.
18.13 CLT versus tail probabilities
The CLT describes probabilities at fluctuation scale:
Sn−nμ=O(n).It does not accurately describe rare deviations such as
Sn−nμ≥cn.Those belong to large deviation theory, where probabilities decay exponentially and rate functions, not Gaussian densities, govern behavior.
Using CLT for far-tail estimates is a common error. Moderate deviations interpolate between CLT and large deviations, but require their own hypotheses. The scale must be declared before a limit theorem is applied.
18.14 Gaussian approximation residue
A Gaussian approximation has residue: centering error, variance estimation error, dependence error, tail error, lattice correction, finite-sample rate, and test-function class. A bare CLT only gives weak convergence:
Ef(Zn)→Ef(Z)for bounded continuous f. It does not automatically give density approximation, tail approximation, local probability approximation, or uniform finite-n accuracy.
The correct terminal depends on the claim. If the claim is asymptotic distribution, CLT may suffice. If the claim is finite probability, confidence interval accuracy, rare-event estimate, or density approximation, additional quantitative certificates are required.
Chapter 19 — Poisson and Rare-Event Limits
19.1 Law of small numbers
The law of small numbers says that the sum of many rare, approximately independent indicators tends to a Poisson distribution. If Xn∼Binomial(n,λ/n), then
Xn⇒Poisson(λ).The heuristic is: many opportunities, each with small probability, with total expected count near λ.
The Poisson law is therefore the rare-event analogue of the Gaussian law. Gaussian limits arise from many small additive fluctuations with finite variance; Poisson limits arise from sparse counts of rare events. The scaling regime determines the limit carrier.
19.2 Poisson approximation
For a sum of indicators
W=i∈I∑Ii,pi=P(Ii=1),the natural Poisson parameter is
λ=EW=i∑pi.If the indicators are rare and weakly dependent, then W is close to Poisson(λ).
Exact approximation requires an error metric such as total variation:
dTV(L(W),Poisson(λ)).The quality depends on probabilities of individual events and dependence neighborhoods. Rare-event approximation is not just matching the mean; dependence can create clusters and destroy Poisson behavior.
19.3 Binomial-to-Poisson limit
Let Xn∼Binomial(n,pn) with npn→λ. Then for fixed k,
P(Xn=k)=(kn)pnk(1−pn)n−k→e−λk!λk.Thus Xn⇒Poisson(λ).
The proof shows the three components: (kn)∼nk/k!, pnk∼(λ/n)k, and (1−pn)n→e−λ. This is the canonical sparse independent limit.
19.4 Chen–Stein method
The Chen–Stein method gives quantitative Poisson approximation for dependent indicators. It characterizes the Poisson distribution by an operator equation and bounds the distance between W and Poisson through local dependence terms.
A typical bound involves dependency neighborhoods Bi, with errors built from sums such as
i∑j∈Bi∑pipjand joint probabilities
i∑j∈Bi,j=i∑P(Ii=1,Ij=1).The method pays the dependence debt explicitly. It is widely used in random graphs, occupancy problems, pattern matching, and rare-event counts.
19.5 Rare indicators
Rare indicators are variables Ii=1Ai with small P(Ai). Their sum counts rare events. The Poisson regime requires not only that each P(Ai) is small, but that the total mean remains finite:
i∑P(Ai)→λ.If rare events occur in clusters, the limit may be compound Poisson rather than Poisson. If dependence is too strong, no Poisson limit may hold. The audit is: individual rarity, total intensity, and clustering control.
19.6 Dependency neighborhoods
A dependency neighborhood Bi for Ii is a set of indices containing variables that may significantly depend on Ii. Outside Bi, approximate independence is assumed or proved. The smaller and weaker these neighborhoods are, the closer the count is to Poisson.
In random graphs, the indicator for a subgraph copy depends on indicators for overlapping copies. Nonoverlapping copies may be independent. Thus dependency neighborhoods are combinatorial overlap structures. Poisson approximation reduces to counting overlaps and showing their contribution vanishes.
19.7 Point processes preview
A point process is a random counting measure
N=i∑δXion a space S. A Poisson point process with intensity measure μ satisfies:
N(A)∼Poisson(μ(A))for measurable A, and counts on disjoint sets are independent.
Poisson point processes are the spatial version of rare-event limits. Instead of only counting how many rare events occur, they record where they occur. Many rare-event limits are better stated as convergence of point processes, with ordinary Poisson count convergence as a projection.
19.8 Poissonization and depoissonization
Poissonization replaces a fixed sample size n by a Poisson random sample size N∼Poisson(n). This often makes counts independent. For example, in occupancy problems, Poissonizing the number of balls makes bin occupancies independent Poisson variables.
Depoissonization transfers estimates back to fixed n. This transfer requires error control showing that randomizing the sample size did not change the target quantity too much. Poissonization is a carrier simplification; depoissonization is the liftback.
19.9 Occupancy and collision limits
In occupancy with m balls and n boxes, the number of boxes with exactly r balls is a rare-event count in many regimes. Collision counts are sums over pairs:
C=i<j∑1{Xi=Xj}.When m2/n→λ, collisions can have Poisson limits.
The exact limit depends on scaling. If m≪n, collisions vanish. If m∼cn, collisions have nontrivial Poisson behavior. If m≫n, collisions become abundant. Occupancy models therefore display rare-event thresholds through elementary counting.
Chapter 20 — Local Limit Theorems
20.1 Global versus local convergence
A global limit theorem, such as the CLT, says distribution functions converge:
P(bnSn−an≤x)→Φ(x).A local limit theorem estimates point probabilities or small-window probabilities:
P(Sn=k)≈bn1g(bnk−an)in lattice cases, or density-level approximations in continuous cases.
Local theorems are stronger. Weak convergence tests large intervals with continuity boundaries; local convergence probes individual atoms or shrinking windows. It requires finer Fourier control and must account for lattice structure.
20.2 Lattice distributions
A distribution is lattice if it is supported on
a+hZfor some span h>0. Sums of lattice variables remain on a lattice. For integer-valued iid variables with mean μ and variance σ2, a typical local CLT has the form
P(Sn=k)∼σn1φ(σnk−nμ),for admissible lattice points k, with φ the standard normal density.
The lattice span matters. If the variable only takes even values, odd target points have probability zero. A local theorem that ignores lattice support is false. Lattice audit is therefore mandatory.
20.3 Nonlattice distributions
A nonlattice distribution is not concentrated on any shifted arithmetic progression a+hZ. For sums of nonlattice variables, local results are usually stated in terms of densities, small intervals, or smoothed probabilities rather than point masses, since P(Sn=x)=0 in continuous cases.
If Sn has density fn, a density local CLT may state
σnfn(nμ+σnx)→φ(x)uniformly in x, under regularity conditions. Nonlattice assumptions prevent periodic Fourier obstructions that appear in lattice cases.
20.4 Fourier inversion
Local limit theorems rely heavily on Fourier inversion. For integer-valued Sn,
P(Sn=k)=2π1∫−ππe−itkϕSn(t)dt.For continuous densities,
fSn(x)=2π1∫Re−itxϕSn(t)dtwhen inversion is justified.
The proof strategy splits the Fourier domain. Near t=0, characteristic functions approximate the Gaussian exponential. Away from zero, one needs decay bounds to show contributions are negligible. Local results require stronger global Fourier control than ordinary CLT.
20.5 Aperiodicity
For integer-valued variables, aperiodicity means the support is not contained in a proper sublattice a+hZ with h>1. Equivalently, the characteristic function satisfies
∣ϕ(t)∣<1for t∈[−π,π]∖{0}
in the span-one case.
Aperiodicity prevents periodic zeros and inaccessible residue classes. Without it, the local limit must be restricted to reachable lattice points and corrected by the span. Aperiodicity is the discrete analogue of nonlattice regularity.
20.6 Gaussian local limit theorem
For iid integer-valued aperiodic variables with mean μ and variance σ2, the Gaussian local limit theorem states
ksupσnP(Sn=k)−φ(σnk−nμ)→0.This is stronger than the CLT because it approximates individual probabilities.
The theorem shows that the distribution mass near k is approximately the Gaussian density times the lattice cell width 1/(σn). It is the correct bridge from continuous Gaussian shape to discrete point probabilities.
20.7 Edgeworth expansions
Edgeworth expansions refine normal approximation by adding correction terms involving cumulants. A typical density-level expansion has the form
fn(x)=φ(x)[1+nP1(x)+nP2(x)+⋯]+o(n−m/2),where Pi are polynomials determined by cumulants.
These expansions give higher-order asymptotics beyond CLT. They require stronger moment and smoothness conditions and may fail in far tails. Edgeworth expansions are asymptotic series, not automatically positive densities or uniform global approximations.
20.8 Saddle-point methods preview
Saddle-point methods estimate probabilities using complex analytic or exponential tilting techniques. They are especially useful for local probabilities and tail probabilities where Gaussian approximation is insufficient. The method locates the dominant contribution to an integral representation, often from a point where a phase derivative vanishes.
In probability, saddle-point methods appear in sums, combinatorial enumeration, branching processes, and large deviations. They refine exponential-scale estimates by adding prefactors. The carrier is analytic: moment generating functions, cumulant generating functions, and contour integrals.
20.9 Local probability estimates
Local estimates determine probabilities of specific values or small windows:
P(Sn=k),P(Sn∈[x,x+Δ]).They are required in random walks, renewal theory, combinatorics, statistical mechanics, and number-theoretic probability. Global convergence may be too coarse because it cannot resolve individual mass points.
A local estimate usually needs variance scale, lattice/nonlattice classification, smoothness or aperiodicity, and Fourier decay. The output is often uniform over central ranges and weaker in tails. Tail-local estimates may require large deviation or saddle-point machinery.
20.10 Lattice obstruction audit
Before applying a local limit theorem, check the support. If X is supported on a+hZ, then Sn is supported on
na+hZ.Any claimed approximation at points outside this set is false because the probability is exactly zero.
The audit also includes span, periodicity, residue classes, and whether the target variable has density or atoms. The CLT can ignore these details because intervals blur them; local limits cannot. Local probability is where carrier arithmetic reappears.
- Get link
- X
- Other Apps
Comments
Post a Comment