Probability Theory — Appendices
- Get link
- X
- Other Apps
Appendix A — Measure Theory Needed for Probability
A.1 Semirings and π-systems
A semiring of sets is a primitive domain on which one can define a premeasure before extending it to a σ-algebra. A typical semiring is the family of half-open intervals (a,b] in R, or rectangles of the form
(a1,b1]×⋯×(ad,bd]in Rd. Semirings are useful because complicated measurable sets are built from simpler geometric blocks, while measures are often first defined on those blocks.
A π-system is a collection P of sets closed under finite intersections:
A,B∈P⇒A∩B∈P.Rectangles form a π-system. Cylinder sets in product spaces form a π-system. Half-lines (−∞,t] form a π-system generating the Borel σ-algebra on R.
The value of π-systems is uniqueness. If two probability measures agree on a π-system P, and P generates F, then under standard hypotheses the measures agree on all of F:
μ∣P=ν∣P⇒μ∣σ(P)=ν∣σ(P).This is why finite-dimensional distributions, interval probabilities, and rectangle probabilities can determine full laws.
A.2 Dynkin systems
A Dynkin system, or λ-system, is a collection D⊆2Ω such that Ω∈D, if A,B∈D and A⊆B, then B∖A∈D, and if A1,A2,…∈D are disjoint, then
n=1⋃∞An∈D.Every σ-algebra is a Dynkin system, but a Dynkin system need not be closed under arbitrary finite intersections.
The π-λ theorem says that if P is a π-system, then the smallest Dynkin system containing P is exactly σ(P). This is a proof engine: define
D={A:desired identity holds for A},prove D is a Dynkin system, prove it contains a generating π-system, then conclude the identity holds on the whole σ-algebra.
This method proves uniqueness of product measures, independence extensions, equality of laws from CDFs, and many conditioning identities. It is a carrier-extension gate: local verification on generators becomes global verification on all measurable events.
A.3 Monotone class theorem
The monotone class theorem is another extension engine. A monotone class of sets is closed under increasing countable unions and decreasing countable intersections. If a monotone class contains an algebra A, then it contains σ(A).
The functional version is even more important in probability. If a class of bounded measurable functions is closed under bounded monotone pointwise limits and contains indicators of a generating algebra, then it contains all bounded measurable functions. This allows a standard proof pattern:
indicators→simple functions→bounded measurable functions→nonnegative/integrable functions.For example, to prove an identity such as
E[f(X)g(Y)]=E[f(X)]E[g(Y)]under independence, one first proves it for indicators, extends to simple functions, and then to bounded measurable functions by monotone class. The theorem licenses moving from event-level probability to function-level expectation.
A.4 Carathéodory extension theorem
Carathéodory’s extension theorem takes a premeasure defined on an algebra or semiring and extends it to a measure on the generated σ-algebra. If μ0 is countably additive on the primitive class and σ-finite under suitable hypotheses, there exists a measure μ on σ(A) extending μ0.
This theorem is the construction engine behind Lebesgue measure, product measure, probability laws from distribution functions, and stochastic processes from finite-dimensional data. One usually defines mass on simple sets first:
μ0((a,b])=F(b)−F(a),or
(μ⊗ν)(A×B)=μ(A)ν(B),then extends to the full σ-algebra.
The hidden gate is consistency. A premeasure cannot assign contradictory values on overlapping decompositions. In stochastic processes, finite-dimensional distributions must be mutually consistent before Kolmogorov extension can build a process law. Extension theorems do not repair inconsistent local data; they only globalize coherent data.
A.5 Lebesgue–Stieltjes measures
A Lebesgue–Stieltjes measure is generated by a nondecreasing right-continuous function F:R→R. For a probability distribution function F, define initially
μF((a,b])=F(b)−F(a).Carathéodory extension then gives a Borel probability measure satisfying
μF((−∞,t])=F(t).This construction is why every valid CDF corresponds to a unique probability law. It also unifies discrete, continuous, singular, and mixed distributions. If F has jumps, those jumps are atoms:
P(X=t)=F(t)−F(t−).If F is absolutely continuous, then
F(t)=∫−∞tf(x)dxfor a density f. If F is continuous singular, such as the Cantor distribution, there are no atoms and no Lebesgue density.
Thus the CDF is the universal one-dimensional representation. A density is a special case. A mass function is a special case. The measure is the carrier.
A.6 Radon measures
A Radon measure is a measure compatible with topology, usually locally finite and inner regular. On a locally compact Hausdorff space, inner regularity means
μ(A)=sup{μ(K):K⊆A, K compact}for measurable A, while outer regularity means
μ(A)=inf{μ(U):A⊆U, U open}.Radon measures are essential because probability on topological spaces requires approximation. Weak convergence, tightness, compactness, support, and regular conditional laws all rely on topological regularity. In Polish spaces, Borel probability measures are Radon, which is one reason Polish and standard Borel spaces dominate modern probability.
Radon regularity is the bridge between measurable events and geometric/topological reasoning. Without it, one can have a measure but lack useful approximation by compact or open sets. Probability would then lose many convergence and compactness tools.
A.7 Radon–Nikodym theorem
If ν≪μ, meaning μ(A)=0⇒ν(A)=0, and the measures are σ-finite, then there exists a measurable function f such that
ν(A)=∫Afdμfor every measurable A. The function f is the Radon–Nikodym derivative:
f=dμdν.This theorem is the source of densities, likelihood ratios, conditional expectation, and change of measure. If Q≪P, then
EQ[X]=EP[XdPdQ].In Bayesian statistics, likelihoods are Radon–Nikodym derivatives. In stochastic calculus, Girsanov densities are Radon–Nikodym derivatives between path measures.
Conditional expectation can also be constructed through Radon–Nikodym. For X∈L1 and sub-σ-algebra G, define a signed measure on G:
ν(G)=∫GXdP.Since ν≪P∣G, Radon–Nikodym gives
dP∣Gdν=E[X∣G].A.8 Disintegration theorem
Disintegration decomposes a measure into conditional measures along fibers. If γ is a probability measure on X×Y and μY is its Y-marginal, then under standard Borel hypotheses there exists a Markov kernel K(y,dx) such that
γ(dx,dy)=K(y,dx)μY(dy).Equivalently,
γ(A×B)=∫BK(y,A)μY(dy).This is the rigorous version of conditional distribution:
K(y,A)=P(X∈A∣Y=y).It is essential because P(Y=y) may be zero. The elementary formula
P(A∣B)=P(B)P(A∩B)does not handle conditioning on a continuous value. Disintegration does.
Disintegration is also the invariant form of Bayes’ theorem. Given a generative law
γ(dθ,dx)=L(θ,dx)π(dθ),the posterior is the reverse disintegration:
γ(dθ,dx)=Π(dθ∣x)μX(dx).Bayesian updating is therefore conditional-measure transport, not merely symbolic ratio manipulation.
A.9 Product measures
Given measures μ on (S,S) and ν on (T,T), the product measure μ⊗ν is determined by
(μ⊗ν)(A×B)=μ(A)ν(B).The product σ-algebra is
S⊗T=σ{A×B:A∈S, B∈T}.Product measure is the carrier of independent joint randomness. If X∼μ, Y∼ν, and X,Y are independent, then
L(X,Y)=μ⊗ν.Conversely, if the joint law factors as product, the variables are independent.
Tonelli and Fubini are the integration laws of product measure:
∫fd(μ⊗ν)=∫∫f(x,y)ν(dy)μ(dx)under nonnegativity or integrability hypotheses. Thus product measure supplies not only independence but also iterated expectation and multidimensional integration.
A.10 Weak convergence of measures
Weak convergence of probability measures on a metric space S is
μn⇒μif
∫fdμn→∫fdμfor every bounded continuous f:S→R. This is convergence by continuous bounded probes, not by all measurable sets.
The portmanteau theorem gives equivalent gates. For closed F,
nlimsupμn(F)≤μ(F),for open G,
nliminfμn(G)≥μ(G),and for μ-continuity sets A, meaning μ(∂A)=0,
μn(A)→μ(A).Weak convergence is the carrier of convergence in distribution. It is weaker than total variation and does not generally control expectations of unbounded functions. Tightness is the compactness condition. On Polish spaces, Prokhorov’s theorem says tightness is equivalent to relative compactness in weak topology. Weak convergence is therefore measure theory plus topology.
Appendix B — Functional Analysis Needed for Probability
B.1 Normed spaces
A normed vector space is a vector space V equipped with a function ∥⋅∥ satisfying positivity, homogeneity, and the triangle inequality:
∥x∥≥0,∥λx∥=∣λ∣∥x∥,∥x+y∥≤∥x∥+∥y∥.In probability, the central examples are Lp spaces:
∥X∥p=(E∣X∣p)1/p,p≥1.Norms measure error and convergence. L1 controls mean absolute error, L2 controls quadratic error, L∞ controls essential worst-case size, and Lp interpolates between typical and tail behavior. Many probabilistic statements are norm statements:
Xn→X in Lp⇔∥Xn−X∥p→0.The norm is also a stability certificate. If a random approximation is close in L2, it controls variance-scale error. If it is close in L∞, it controls all outcomes except null sets. The chosen norm determines what kind of probabilistic error is being certified.
B.2 Banach spaces
A Banach space is a complete normed space. Completeness means every Cauchy sequence converges to a point in the space. For p≥1,
Lp(Ω,F,P)is a Banach space after quotienting by almost sure equality.
Completeness is what makes approximation methods legitimate. One defines objects first for simple functions, step processes, bounded functions, or finite-dimensional approximations, then extends by completion. Itô integration is built this way: define the integral for simple predictable processes, prove an isometry, then complete in L2.
Banach-space-valued probability also appears in empirical processes, random series, stochastic PDE, and concentration in function spaces. A random element may take values in a Banach space, and convergence may be norm convergence:
E∥Xn−X∥p→0.This is stronger than finite-dimensional convergence and requires actual control of the full object.
B.3 Hilbert spaces
A Hilbert space is a complete inner-product space. The core probability example is L2, with
⟨X,Y⟩=E[XY]in the real case, and
⟨X,Y⟩=E[XY]in the complex case. The norm is
∥X∥22=E[∣X∣2].Hilbert geometry turns probability into orthogonal projection. Conditional expectation is the projection of X∈L2 onto the closed subspace L2(G):
E[X∣G]=ProjL2(G)X.The error is orthogonal to all G-measurable square-integrable random variables:
E[(X−E[X∣G])Z]=0.Gaussian processes are also Hilbert-space objects. Covariance kernels define inner products, and Gaussian Hilbert spaces encode linear Gaussian structure. Orthogonality and independence coincide for jointly Gaussian centered variables, making L2 geometry unusually powerful in Gaussian probability.
B.4 Duality
The dual space V∗ consists of continuous linear functionals on V. For 1<p<∞, the dual of Lp is Lq, where
p1+q1=1,via the pairing
⟨X,Y⟩=E[XY].Hölder’s inequality,
∣E[XY]∣≤∥X∥p∥Y∥q,is the boundedness certificate for this pairing.
Duality is central to probability because laws are often identified by test functions:
μ↦∫fdμ.Weak convergence, total variation, Wasserstein duality, optimal transport, risk measures, hypothesis testing, and convex concentration all use dual formulations.
Duality also exposes what a convergence mode can see. If convergence is defined by a small class of test functionals, it may miss tail behavior, oscillation, or singular structure. The dual class is the sensor array. Too small a class gives weak information; too large a class may require stronger convergence.
B.5 Weak topology
The weak topology on a normed space is the coarsest topology making every continuous linear functional continuous. A sequence xn converges weakly to x if
ℓ(xn)→ℓ(x)for every ℓ∈V∗. Weak convergence is generally weaker than norm convergence.
In probability laws, weak convergence means convergence against bounded continuous functions:
μn⇒μ⇔∫fdμn→∫fdμ.This is not norm convergence of measures and not convergence on all measurable sets. It is topology-sensitive law convergence.
Weak topology is valuable because it supplies compactness. Bounded sets in reflexive spaces have weakly compact subsequences under appropriate conditions. Probability analogues include tightness and Prokhorov compactness. The price is weaker conclusions: weak convergence does not automatically control norms, tails, or expectations of unbounded functions.
B.6 Compactness criteria
Compactness criteria convert boundedness or regularity into subsequential convergence. In finite dimensions, closed bounded sets are compact. In infinite dimensions, this is false, so one needs additional structure.
In probability measures on Polish spaces, tightness is the main compactness criterion:
∀ε>0, ∃K compact:nsupμn(Kc)<ε.By Prokhorov’s theorem, tightness gives relatively compact families in weak topology.
In function spaces, Arzelà–Ascoli gives compactness from uniform boundedness and equicontinuity. In stochastic-process convergence, one proves tightness by controlling oscillations:
P(wX(δ)>ε)→0as δ↓0, where wX is a modulus of continuity or Skorokhod oscillation functional.
The standard limit-proof architecture is:
compactness/tightness→subsequential limit→identify every limit→full convergence.B.7 Separability
A topological space is separable if it has a countable dense subset. Separability is a countability gate. It lets one replace uncountable operations by countable approximations in controlled settings.
For stochastic processes, separability is often what makes suprema measurable. If Xt has continuous paths on [0,T], then
t∈[0,T]supXt=t∈Q∩[0,T]supXt.The right side is a countable supremum of measurable variables, hence measurable. Without path regularity or separability, the uncountable supremum may not be measurable.
Standard Borel and Polish spaces owe much of their usefulness to separability. Regular conditional probabilities, disintegration, measurable selection, and weak convergence theory all behave better in separable carriers. Nonseparable spaces often generate version and measurability pathologies.
B.8 Riesz representation
Riesz representation theorems identify linear functionals with measures or inner products. For locally compact Hausdorff spaces, positive linear functionals on Cc(X) correspond to Radon measures:
L(f)=∫fdμ.For Hilbert spaces, every continuous linear functional has the form
ℓ(x)=⟨x,y⟩for a unique y.
In probability, Riesz representation explains why a law can be identified by its integrals against test functions. If
∫fdμ=∫fdνfor a sufficiently rich class of f, then μ=ν. This underlies weak convergence, distribution identification, and dual formulations of distances between measures.
It also links Markov semigroups and kernels. A positive linear operator P acting on test functions can often be represented by a transition kernel:
Pf(x)=∫f(y)P(x,dy).Thus operator action and probabilistic transition are dual descriptions.
B.9 Operators and semigroups
A semigroup of operators (Pt)t≥0 satisfies
P0=I,Pt+s=PtPs.For a Markov process,
Ptf(x)=Ex[f(Xt)].The semigroup is the expectation-evolution operator.
The generator is the infinitesimal operator
Af=t↓0limtPtf−f,defined on a domain of functions for which the limit exists. For a diffusion,
Af=b⋅∇f+21Tr(a∇2f).The generator connects stochastic processes to PDE:
∂tu=Au.Semigroup theory is the analytic carrier of Markov probability. Transition kernels, invariant measures, spectral gaps, ergodicity, diffusion equations, and martingale problems can all be expressed through operators. The domain of the generator is not optional; it determines which functions the infinitesimal formula legally acts on.
Appendix C — Fourier and Transform Methods
C.1 Characteristic functions
The characteristic function of a real random variable X is
ϕX(t)=E[eitX],t∈R.It always exists because
∣eitX∣=1.For a random vector X∈Rd,
ϕX(t)=E[ei⟨t,X⟩],t∈Rd.Characteristic functions determine laws:
ϕX=ϕY⇒X=dY.They convert independent sums into products. If X,Y are independent,
ϕX+Y(t)=ϕX(t)ϕY(t).For iid sums Sn=X1+⋯+Xn,
ϕSn(t)=ϕX(t)n.Near zero, characteristic functions encode moments. If E[X]=μ and E[X2]<∞, then
ϕX(t)=1+iμt−21E[X2]t2+o(t2).For centered variance-σ2 variables,
ϕX(t)=1−2σ2t2+o(t2).This local expansion is the analytic engine behind the CLT.
C.2 Moment generating functions
The moment generating function is
MX(t)=E[etX],where finite. Unlike the characteristic function, it may not exist for all t, or even for any nonzero t. Heavy-tailed variables often have infinite MX(t) for t>0.
When MX exists near zero, it encodes moments:
MX(k)(0)=E[Xk].For independent variables,
MX+Y(t)=MX(t)MY(t).The log-moment generating function
ΛX(t)=logMX(t)is central in Chernoff bounds and large deviations. Chernoff gives
P(X≥a)≤e−taMX(t)for t>0, hence
P(X≥a)≤exp(−t>0sup{ta−ΛX(t)}).The Legendre transform of Λ is the large-deviation rate function in Cramér-type theory.
C.3 Laplace transforms
For a nonnegative random variable X, the Laplace transform is
LX(λ)=E[e−λX],λ≥0.It always exists for λ≥0 because 0≤e−λX≤1. For measures on [0,∞), the Laplace transform often determines the law.
Laplace transforms are especially effective for waiting times, subordinators, branching processes, renewal theory, hitting times, and nonnegative random variables. If X,Y≥0 are independent, then
LX+Y(λ)=LX(λ)LY(λ).For a nonnegative variable,
E[X]=−LX′(0+)when the derivative exists and the expectation is finite. More generally, derivatives at zero encode moments with alternating signs:
LX(k)(0+)=(−1)kE[Xk].Laplace methods are one-sided transform methods suited to positive carriers.
C.4 Fourier inversion
Fourier inversion reconstructs a distribution from its characteristic function. If X has an integrable characteristic function, then X has a bounded continuous density
fX(x)=2π1∫Re−itxϕX(t)dt.For lattice integer-valued variables,
P(X=k)=2π1∫−ππe−itkϕX(t)dt.Fourier inversion is the native engine for local limit theorems. Weak convergence may use pointwise convergence of characteristic functions, but local probabilities require integrating the characteristic function and controlling its behavior away from zero.
The standard local-limit proof splits the integral:
∫=∫∣t∣≤δ+∫∣t∣>δ.Near zero, one uses Taylor expansion:
ϕ(t/n)n→e−σ2t2/2.Away from zero, one needs decay or aperiodicity:
∣ϕ(t)∣<1.This second region is where lattice and smoothness obstructions live.
C.5 Lévy continuity theorem
Lévy’s continuity theorem states that if characteristic functions ϕn(t) converge pointwise to a function ϕ(t) that is continuous at 0, then ϕ is the characteristic function of a probability law μ, and
μn⇒μ.Conversely, if μn⇒μ, then
ϕn(t)→ϕ(t)for every t.
The continuity-at-zero condition prevents loss of mass. A pointwise limit of characteristic functions need not correspond to a probability law if mass escapes or the limit is discontinuous at zero. Thus Lévy’s theorem is a law-convergence gate, not merely an analytic convenience.
The CLT proof uses it directly. For centered iid variance-σ2 variables,
ϕSn/(σn)(t)=[ϕX(σnt)]n→e−t2/2.Since e−t2/2 is continuous at zero and is the standard normal characteristic function, convergence in distribution follows.
C.6 Smoothing inequalities
Smoothing inequalities bound distances between distribution functions using characteristic functions. A typical Berry–Esseen-style smoothing inequality has the form
xsup∣F(x)−G(x)∣≤C∫−TTtϕF(t)−ϕG(t)dt+TC′,where G has sufficient regularity.
The first term measures Fourier discrepancy on a bounded frequency range. The second term is smoothing error from truncating the integral. Optimizing T balances analytic approximation and tail of the transform integral.
Smoothing inequalities are the bridge from characteristic-function convergence to quantitative distributional error. Lévy’s theorem gives convergence but not rate. Smoothing inequalities supply rate, making them central in Berry–Esseen estimates, local limit bounds, and approximation theory.
C.7 Saddle-point estimates
Saddle-point methods evaluate probabilities or coefficients through complex or exponential integral representations. If a probability can be written as
P(Sn=k)=2πi1∫enΨ(z)a(z)dz,the dominant contribution often comes from a point z0 satisfying
Ψ′(z0)=0.Expanding around z0,
Ψ(z)≈Ψ(z0)+21Ψ′′(z0)(z−z0)2,gives Gaussian-type prefactors multiplying the main exponential term.
In probability, saddle points appear in precise large deviations, local probabilities, random combinatorial structures, branching processes, occupancy models, and statistical mechanics. They refine exponential estimates by extracting the correct polynomial prefactor and local shape.
The method is powerful but gate-heavy. One needs analyticity, contour deformation legitimacy, nondegenerate saddle, control away from the saddle, and error bounds. Without those, saddle-point notation is only heuristic.
C.8 Transform methods in limit theory
Transform methods convert probabilistic operations into algebraic or analytic operations. Independent sums become products of transforms:
ϕX+Y=ϕXϕY.Scaling becomes argument rescaling:
ϕaX(t)=ϕX(at).Centering becomes multiplication by a phase:
ϕX−μ(t)=e−itμϕX(t).Different transforms fit different carriers. Characteristic functions are universal for real laws. Moment generating functions are powerful when exponential moments exist. Laplace transforms fit nonnegative variables. Probability generating functions
GX(s)=E[sX]fit nonnegative integer-valued variables. Cauchy/Stieltjes transforms fit spectral measures and random matrix theory:
Gμ(z)=∫z−x1μ(dx).Limit theory often follows the same pattern:
normalize→transform→pointwise/analytic convergence→inversion or continuity theorem→law-level terminal.The counterkernel is using a transform outside its domain: MGF without exponential moments, density inversion without integrability, moment matching without determinacy, or pointwise transform convergence without continuity/tightness.
- Get link
- X
- Other Apps
Comments
Post a Comment