A.1 Semirings and π-systems

A semiring of sets is a primitive domain on which one can define a premeasure before extending it to a σ-algebra. A typical semiring is the family of half-open intervals $(𝑎, 𝑏]$ in $𝑅$ , or rectangles of the form

(𝑎_{1}, 𝑏_{1}] \times \dots \times (𝑎_{𝑑}, 𝑏_{𝑑}]

in $𝑅^{𝑑}$ . Semirings are useful because complicated measurable sets are built from simpler geometric blocks, while measures are often first defined on those blocks.

A π-system is a collection $𝑃$ of sets closed under finite intersections:

𝐴, 𝐵 \in 𝑃 \Rightarrow 𝐴 \cap 𝐵 \in 𝑃 .

Rectangles form a π-system. Cylinder sets in product spaces form a π-system. Half-lines $(- \infty, 𝑡]$ form a π-system generating the Borel σ-algebra on $𝑅$ .

The value of π-systems is uniqueness. If two probability measures agree on a π-system $𝑃$ , and $𝑃$ generates $𝐹$ , then under standard hypotheses the measures agree on all of $𝐹$ :

𝜇 ∣_{𝑃} = 𝜈 ∣_{𝑃} \Rightarrow 𝜇 ∣_{𝜎 (𝑃)} = 𝜈 ∣_{𝜎 (𝑃)} .

This is why finite-dimensional distributions, interval probabilities, and rectangle probabilities can determine full laws.

A.2 Dynkin systems

A Dynkin system, or λ-system, is a collection $𝐷 \subseteq 2^{Ω}$ such that $Ω \in 𝐷$ , if $𝐴, 𝐵 \in 𝐷$ and $𝐴 \subseteq 𝐵$ , then $𝐵 ∖ 𝐴 \in 𝐷$ , and if $𝐴_{1}, 𝐴_{2}, \dots \in 𝐷$ are disjoint, then

⋃_{𝑛 = 1}^{\infty} 𝐴_{𝑛} \in 𝐷 .

Every σ-algebra is a Dynkin system, but a Dynkin system need not be closed under arbitrary finite intersections.

The π-λ theorem says that if $𝑃$ is a π-system, then the smallest Dynkin system containing $𝑃$ is exactly $𝜎 (𝑃)$ . This is a proof engine: define

𝐷 = {𝐴 : desired identity holds for 𝐴},

prove $𝐷$ is a Dynkin system, prove it contains a generating π-system, then conclude the identity holds on the whole σ-algebra.

This method proves uniqueness of product measures, independence extensions, equality of laws from CDFs, and many conditioning identities. It is a carrier-extension gate: local verification on generators becomes global verification on all measurable events.

A.3 Monotone class theorem

The monotone class theorem is another extension engine. A monotone class of sets is closed under increasing countable unions and decreasing countable intersections. If a monotone class contains an algebra $𝐴$ , then it contains $𝜎 (𝐴)$ .

The functional version is even more important in probability. If a class of bounded measurable functions is closed under bounded monotone pointwise limits and contains indicators of a generating algebra, then it contains all bounded measurable functions. This allows a standard proof pattern:

indicators \to simple functions \to bounded measurable functions \to nonnegative/integrable functions .

For example, to prove an identity such as

𝐸 [𝑓 (𝑋) 𝑔 (𝑌)] = 𝐸 [𝑓 (𝑋)] 𝐸 [𝑔 (𝑌)]

under independence, one first proves it for indicators, extends to simple functions, and then to bounded measurable functions by monotone class. The theorem licenses moving from event-level probability to function-level expectation.

A.4 Carathéodory extension theorem

Carathéodory’s extension theorem takes a premeasure defined on an algebra or semiring and extends it to a measure on the generated σ-algebra. If $𝜇_{0}$ is countably additive on the primitive class and σ-finite under suitable hypotheses, there exists a measure $𝜇$ on $𝜎 (𝐴)$ extending $𝜇_{0}$ .

This theorem is the construction engine behind Lebesgue measure, product measure, probability laws from distribution functions, and stochastic processes from finite-dimensional data. One usually defines mass on simple sets first:

𝜇_{0} ((𝑎, 𝑏]) = 𝐹 (𝑏) - 𝐹 (𝑎),

(𝜇 \otimes 𝜈) (𝐴 \times 𝐵) = 𝜇 (𝐴) 𝜈 (𝐵),

then extends to the full σ-algebra.

The hidden gate is consistency. A premeasure cannot assign contradictory values on overlapping decompositions. In stochastic processes, finite-dimensional distributions must be mutually consistent before Kolmogorov extension can build a process law. Extension theorems do not repair inconsistent local data; they only globalize coherent data.

A.5 Lebesgue–Stieltjes measures

A Lebesgue–Stieltjes measure is generated by a nondecreasing right-continuous function $𝐹 : 𝑅 \to 𝑅$ . For a probability distribution function $𝐹$ , define initially

𝜇_{𝐹} ((𝑎, 𝑏]) = 𝐹 (𝑏) - 𝐹 (𝑎) .

Carathéodory extension then gives a Borel probability measure satisfying

𝜇_{𝐹} ((- \infty, 𝑡]) = 𝐹 (𝑡) .

This construction is why every valid CDF corresponds to a unique probability law. It also unifies discrete, continuous, singular, and mixed distributions. If $𝐹$ has jumps, those jumps are atoms:

𝑃 (𝑋 = 𝑡) = 𝐹 (𝑡) - 𝐹 (𝑡^{-}) .

If $𝐹$ is absolutely continuous, then

𝐹 (𝑡) = \int_{- \infty}^{𝑡} 𝑓 (𝑥) 𝑑 𝑥

for a density $𝑓$ . If $𝐹$ is continuous singular, such as the Cantor distribution, there are no atoms and no Lebesgue density.

Thus the CDF is the universal one-dimensional representation. A density is a special case. A mass function is a special case. The measure is the carrier.

A.6 Radon measures

A Radon measure is a measure compatible with topology, usually locally finite and inner regular. On a locally compact Hausdorff space, inner regularity means

𝜇 (𝐴) = \sup {𝜇 (𝐾) : 𝐾 \subseteq 𝐴, 𝐾 compact}

for measurable $𝐴$ , while outer regularity means

𝜇 (𝐴) = \inf {𝜇 (𝑈) : 𝐴 \subseteq 𝑈, 𝑈 open} .

Radon measures are essential because probability on topological spaces requires approximation. Weak convergence, tightness, compactness, support, and regular conditional laws all rely on topological regularity. In Polish spaces, Borel probability measures are Radon, which is one reason Polish and standard Borel spaces dominate modern probability.

Radon regularity is the bridge between measurable events and geometric/topological reasoning. Without it, one can have a measure but lack useful approximation by compact or open sets. Probability would then lose many convergence and compactness tools.

A.7 Radon–Nikodym theorem

If $𝜈 ≪ 𝜇$ , meaning $𝜇 (𝐴) = 0 \Rightarrow 𝜈 (𝐴) = 0$ , and the measures are σ-finite, then there exists a measurable function $𝑓$ such that

𝜈 (𝐴) = \int_{𝐴} 𝑓 𝑑 𝜇

for every measurable $𝐴$ . The function $𝑓$ is the Radon–Nikodym derivative:

𝑓 = \frac{𝑑 𝜈}{𝑑 𝜇} .

This theorem is the source of densities, likelihood ratios, conditional expectation, and change of measure. If $𝑄 ≪ 𝑃$ , then

𝐸_{𝑄} [𝑋] = 𝐸_{𝑃} [𝑋 \frac{𝑑 𝑄}{𝑑 𝑃}] .

In Bayesian statistics, likelihoods are Radon–Nikodym derivatives. In stochastic calculus, Girsanov densities are Radon–Nikodym derivatives between path measures.

Conditional expectation can also be constructed through Radon–Nikodym. For $𝑋 \in 𝐿^{1}$ and sub-σ-algebra $𝐺$ , define a signed measure on $𝐺$ :

𝜈 (𝐺) = \int_{𝐺} 𝑋 𝑑 𝑃 .

Since $𝜈 ≪ 𝑃 ∣_{𝐺}$ , Radon–Nikodym gives

\frac{𝑑 𝜈}{𝑑 𝑃 ∣_{𝐺}} = 𝐸 [𝑋 ∣ 𝐺] .

A.8 Disintegration theorem

Disintegration decomposes a measure into conditional measures along fibers. If $𝛾$ is a probability measure on $𝑋 \times 𝑌$ and $𝜇_{𝑌}$ is its $𝑌$ -marginal, then under standard Borel hypotheses there exists a Markov kernel $𝐾 (𝑦, 𝑑 𝑥)$ such that

𝛾 (𝑑 𝑥, 𝑑 𝑦) = 𝐾 (𝑦, 𝑑 𝑥) 𝜇_{𝑌} (𝑑 𝑦) .

Equivalently,

𝛾 (𝐴 \times 𝐵) = \int_{𝐵} 𝐾 (𝑦, 𝐴) 𝜇_{𝑌} (𝑑 𝑦) .

This is the rigorous version of conditional distribution:

𝐾 (𝑦, 𝐴) = 𝑃 (𝑋 \in 𝐴 ∣ 𝑌 = 𝑦) .

It is essential because $𝑃 (𝑌 = 𝑦)$ may be zero. The elementary formula

𝑃 (𝐴 ∣ 𝐵) = \frac{𝑃 (𝐴 \cap 𝐵)}{𝑃 (𝐵)}

does not handle conditioning on a continuous value. Disintegration does.

Disintegration is also the invariant form of Bayes’ theorem. Given a generative law

𝛾 (𝑑 𝜃, 𝑑 𝑥) = 𝐿 (𝜃, 𝑑 𝑥) 𝜋 (𝑑 𝜃),

the posterior is the reverse disintegration:

𝛾 (𝑑 𝜃, 𝑑 𝑥) = Π (𝑑 𝜃 ∣ 𝑥) 𝜇_{𝑋} (𝑑 𝑥) .

Bayesian updating is therefore conditional-measure transport, not merely symbolic ratio manipulation.

A.9 Product measures

Given measures $𝜇$ on $(𝑆, 𝑆)$ and $𝜈$ on $(𝑇, 𝑇)$ , the product measure $𝜇 \otimes 𝜈$ is determined by

(𝜇 \otimes 𝜈) (𝐴 \times 𝐵) = 𝜇 (𝐴) 𝜈 (𝐵) .

The product σ-algebra is

𝑆 \otimes 𝑇 = 𝜎 {𝐴 \times 𝐵 : 𝐴 \in 𝑆, 𝐵 \in 𝑇} .

Product measure is the carrier of independent joint randomness. If $𝑋 \sim 𝜇$ , $𝑌 \sim 𝜈$ , and $𝑋, 𝑌$ are independent, then

𝐿 (𝑋, 𝑌) = 𝜇 \otimes 𝜈 .

Conversely, if the joint law factors as product, the variables are independent.

Tonelli and Fubini are the integration laws of product measure:

\int 𝑓 𝑑 (𝜇 \otimes 𝜈) = \int \int 𝑓 (𝑥, 𝑦) 𝜈 (𝑑 𝑦) 𝜇 (𝑑 𝑥)

under nonnegativity or integrability hypotheses. Thus product measure supplies not only independence but also iterated expectation and multidimensional integration.

A.10 Weak convergence of measures

Weak convergence of probability measures on a metric space $𝑆$ is

𝜇_{𝑛} \Rightarrow 𝜇

\int 𝑓 𝑑 𝜇_{𝑛} \to \int 𝑓 𝑑 𝜇

for every bounded continuous $𝑓 : 𝑆 \to 𝑅$ . This is convergence by continuous bounded probes, not by all measurable sets.

The portmanteau theorem gives equivalent gates. For closed $𝐹$ ,

\underset{𝑛}{lim sup} 𝜇_{𝑛} (𝐹) \leq 𝜇 (𝐹),

for open $𝐺$ ,

\underset{𝑛}{lim inf} 𝜇_{𝑛} (𝐺) \geq 𝜇 (𝐺),

and for $𝜇$ -continuity sets $𝐴$ , meaning $𝜇 (\partial 𝐴) = 0$ ,

𝜇_{𝑛} (𝐴) \to 𝜇 (𝐴) .

Weak convergence is the carrier of convergence in distribution. It is weaker than total variation and does not generally control expectations of unbounded functions. Tightness is the compactness condition. On Polish spaces, Prokhorov’s theorem says tightness is equivalent to relative compactness in weak topology. Weak convergence is therefore measure theory plus topology.

Appendix B — Functional Analysis Needed for Probability

B.1 Normed spaces

A normed vector space is a vector space $𝑉$ equipped with a function $∥ \cdot ∥$ satisfying positivity, homogeneity, and the triangle inequality:

∥ 𝑥 ∥ \geq 0, ∥ 𝜆 𝑥 ∥ = ∣ 𝜆 ∣ ∥ 𝑥 ∥, ∥ 𝑥 + 𝑦 ∥ \leq ∥ 𝑥 ∥ + ∥ 𝑦 ∥ .

In probability, the central examples are $𝐿^{𝑝}$ spaces:

∥ 𝑋 ∥_{𝑝} = (𝐸 ∣ 𝑋 ∣^{𝑝})^{1 / 𝑝}, 𝑝 \geq 1.

Norms measure error and convergence. $𝐿^{1}$ controls mean absolute error, $𝐿^{2}$ controls quadratic error, $𝐿^{\infty}$ controls essential worst-case size, and $𝐿^{𝑝}$ interpolates between typical and tail behavior. Many probabilistic statements are norm statements:

𝑋_{𝑛} \to 𝑋 in 𝐿^{𝑝} \Leftrightarrow ∥ 𝑋_{𝑛} - 𝑋 ∥_{𝑝} \to 0.

The norm is also a stability certificate. If a random approximation is close in $𝐿^{2}$ , it controls variance-scale error. If it is close in $𝐿^{\infty}$ , it controls all outcomes except null sets. The chosen norm determines what kind of probabilistic error is being certified.

B.2 Banach spaces

A Banach space is a complete normed space. Completeness means every Cauchy sequence converges to a point in the space. For $𝑝 \geq 1$ ,

𝐿^{𝑝} (Ω, 𝐹, 𝑃)

is a Banach space after quotienting by almost sure equality.

Completeness is what makes approximation methods legitimate. One defines objects first for simple functions, step processes, bounded functions, or finite-dimensional approximations, then extends by completion. Itô integration is built this way: define the integral for simple predictable processes, prove an isometry, then complete in $𝐿^{2}$ .

Banach-space-valued probability also appears in empirical processes, random series, stochastic PDE, and concentration in function spaces. A random element may take values in a Banach space, and convergence may be norm convergence:

𝐸 ∥ 𝑋_{𝑛} - 𝑋 ∥^{𝑝} \to 0.

This is stronger than finite-dimensional convergence and requires actual control of the full object.

B.3 Hilbert spaces

A Hilbert space is a complete inner-product space. The core probability example is $𝐿^{2}$ , with

⟨ 𝑋, 𝑌 ⟩ = 𝐸 [𝑋 𝑌]

in the real case, and

⟨ 𝑋, 𝑌 ⟩ = 𝐸 [𝑋 \overline{𝑌}]

in the complex case. The norm is

∥ 𝑋 ∥_{2}^{2} = 𝐸 [∣ 𝑋 ∣^{2}] .

Hilbert geometry turns probability into orthogonal projection. Conditional expectation is the projection of $𝑋 \in 𝐿^{2}$ onto the closed subspace $𝐿^{2} (𝐺)$ :

𝐸 [𝑋 ∣ 𝐺] = {Proj}_{𝐿^{2} (𝐺)} 𝑋 .

The error is orthogonal to all $𝐺$ -measurable square-integrable random variables:

𝐸 [(𝑋 - 𝐸 [𝑋 ∣ 𝐺]) 𝑍] = 0.

Gaussian processes are also Hilbert-space objects. Covariance kernels define inner products, and Gaussian Hilbert spaces encode linear Gaussian structure. Orthogonality and independence coincide for jointly Gaussian centered variables, making $𝐿^{2}$ geometry unusually powerful in Gaussian probability.

B.4 Duality

The dual space $𝑉^{*}$ consists of continuous linear functionals on $𝑉$ . For $1 < 𝑝 < \infty$ , the dual of $𝐿^{𝑝}$ is $𝐿^{𝑞}$ , where

\frac{1}{𝑝} + \frac{1}{𝑞} = 1,

via the pairing

⟨ 𝑋, 𝑌 ⟩ = 𝐸 [𝑋 𝑌] .

Hölder’s inequality,

∣ 𝐸 [𝑋 𝑌] ∣ \leq ∥ 𝑋 ∥_{𝑝} ∥ 𝑌 ∥_{𝑞},

is the boundedness certificate for this pairing.

Duality is central to probability because laws are often identified by test functions:

𝜇 \mapsto \int 𝑓 𝑑 𝜇 .

Weak convergence, total variation, Wasserstein duality, optimal transport, risk measures, hypothesis testing, and convex concentration all use dual formulations.

Duality also exposes what a convergence mode can see. If convergence is defined by a small class of test functionals, it may miss tail behavior, oscillation, or singular structure. The dual class is the sensor array. Too small a class gives weak information; too large a class may require stronger convergence.

B.5 Weak topology

The weak topology on a normed space is the coarsest topology making every continuous linear functional continuous. A sequence $𝑥_{𝑛}$ converges weakly to $𝑥$ if

ℓ (𝑥_{𝑛}) \to ℓ (𝑥)

for every $ℓ \in 𝑉^{*}$ . Weak convergence is generally weaker than norm convergence.

In probability laws, weak convergence means convergence against bounded continuous functions:

𝜇_{𝑛} \Rightarrow 𝜇 \Leftrightarrow \int 𝑓 𝑑 𝜇_{𝑛} \to \int 𝑓 𝑑 𝜇 .

This is not norm convergence of measures and not convergence on all measurable sets. It is topology-sensitive law convergence.

Weak topology is valuable because it supplies compactness. Bounded sets in reflexive spaces have weakly compact subsequences under appropriate conditions. Probability analogues include tightness and Prokhorov compactness. The price is weaker conclusions: weak convergence does not automatically control norms, tails, or expectations of unbounded functions.

B.6 Compactness criteria

Compactness criteria convert boundedness or regularity into subsequential convergence. In finite dimensions, closed bounded sets are compact. In infinite dimensions, this is false, so one needs additional structure.

In probability measures on Polish spaces, tightness is the main compactness criterion:

\forall 𝜀 > 0, \exists 𝐾 compact : \sup_{𝑛} 𝜇_{𝑛} (𝐾^{𝑐}) < 𝜀 .

By Prokhorov’s theorem, tightness gives relatively compact families in weak topology.

In function spaces, Arzelà–Ascoli gives compactness from uniform boundedness and equicontinuity. In stochastic-process convergence, one proves tightness by controlling oscillations:

𝑃 (𝑤_{𝑋} (𝛿) > 𝜀) \to 0

as $𝛿 ↓ 0$ , where $𝑤_{𝑋}$ is a modulus of continuity or Skorokhod oscillation functional.

The standard limit-proof architecture is:

compactness/tightness \to subsequential limit \to identify every limit \to full convergence .

B.7 Separability

A topological space is separable if it has a countable dense subset. Separability is a countability gate. It lets one replace uncountable operations by countable approximations in controlled settings.

For stochastic processes, separability is often what makes suprema measurable. If $𝑋_{𝑡}$ has continuous paths on $[0, 𝑇]$ , then

\sup_{𝑡 \in [0, 𝑇]} 𝑋_{𝑡} = \sup_{𝑡 \in 𝑄 \cap [0, 𝑇]} 𝑋_{𝑡} .

The right side is a countable supremum of measurable variables, hence measurable. Without path regularity or separability, the uncountable supremum may not be measurable.

Standard Borel and Polish spaces owe much of their usefulness to separability. Regular conditional probabilities, disintegration, measurable selection, and weak convergence theory all behave better in separable carriers. Nonseparable spaces often generate version and measurability pathologies.

B.8 Riesz representation

Riesz representation theorems identify linear functionals with measures or inner products. For locally compact Hausdorff spaces, positive linear functionals on $𝐶_{𝑐} (𝑋)$ correspond to Radon measures:

𝐿 (𝑓) = \int 𝑓 𝑑 𝜇 .

For Hilbert spaces, every continuous linear functional has the form

ℓ (𝑥) = ⟨ 𝑥, 𝑦 ⟩

for a unique $𝑦$ .

In probability, Riesz representation explains why a law can be identified by its integrals against test functions. If

\int 𝑓 𝑑 𝜇 = \int 𝑓 𝑑 𝜈

for a sufficiently rich class of $𝑓$ , then $𝜇 = 𝜈$ . This underlies weak convergence, distribution identification, and dual formulations of distances between measures.

It also links Markov semigroups and kernels. A positive linear operator $𝑃$ acting on test functions can often be represented by a transition kernel:

𝑃 𝑓 (𝑥) = \int 𝑓 (𝑦) 𝑃 (𝑥, 𝑑 𝑦) .

Thus operator action and probabilistic transition are dual descriptions.

B.9 Operators and semigroups

A semigroup of operators $(𝑃_{𝑡})_{𝑡 \geq 0}$ satisfies

𝑃_{0} = 𝐼, 𝑃_{𝑡 + 𝑠} = 𝑃_{𝑡} 𝑃_{𝑠} .

For a Markov process,

𝑃_{𝑡} 𝑓 (𝑥) = 𝐸_{𝑥} [𝑓 (𝑋_{𝑡})] .

The semigroup is the expectation-evolution operator.

The generator is the infinitesimal operator

𝐴 𝑓 = \lim_{𝑡 ↓ 0} \frac{𝑃_{𝑡} 𝑓 - 𝑓}{𝑡},

defined on a domain of functions for which the limit exists. For a diffusion,

𝐴 𝑓 = 𝑏 \cdot \nabla 𝑓 + \frac{1}{2} Tr (𝑎 \nabla^{2} 𝑓) .

The generator connects stochastic processes to PDE:

\partial_{𝑡} 𝑢 = 𝐴 𝑢 .

Semigroup theory is the analytic carrier of Markov probability. Transition kernels, invariant measures, spectral gaps, ergodicity, diffusion equations, and martingale problems can all be expressed through operators. The domain of the generator is not optional; it determines which functions the infinitesimal formula legally acts on.

Appendix C — Fourier and Transform Methods

C.1 Characteristic functions

The characteristic function of a real random variable $𝑋$ is

𝜙_{𝑋} (𝑡) = 𝐸 [𝑒^{𝑖 𝑡 𝑋}], 𝑡 \in 𝑅 .

It always exists because

∣ 𝑒^{𝑖 𝑡 𝑋} ∣ = 1.

For a random vector $𝑋 \in 𝑅^{𝑑}$ ,

𝜙_{𝑋} (𝑡) = 𝐸 [𝑒^{𝑖 ⟨ 𝑡, 𝑋 ⟩}], 𝑡 \in 𝑅^{𝑑} .

Characteristic functions determine laws:

𝜙_{𝑋} = 𝜙_{𝑌} \Rightarrow 𝑋 \overset{d}{=} 𝑌 .

They convert independent sums into products. If $𝑋, 𝑌$ are independent,

𝜙_{𝑋 + 𝑌} (𝑡) = 𝜙_{𝑋} (𝑡) 𝜙_{𝑌} (𝑡) .

For iid sums $𝑆_{𝑛} = 𝑋_{1} + \dots + 𝑋_{𝑛}$ ,

𝜙_{𝑆_{𝑛}} (𝑡) = 𝜙_{𝑋} (𝑡)^{𝑛} .

Near zero, characteristic functions encode moments. If $𝐸 [𝑋] = 𝜇$ and $𝐸 [𝑋^{2}] < \infty$ , then

𝜙_{𝑋} (𝑡) = 1 + 𝑖 𝜇 𝑡 - \frac{1}{2} 𝐸 [𝑋^{2}] 𝑡^{2} + 𝑜 (𝑡^{2}) .

For centered variance- $𝜎^{2}$ variables,

𝜙_{𝑋} (𝑡) = 1 - \frac{𝜎^{2} 𝑡^{2}}{2} + 𝑜 (𝑡^{2}) .

This local expansion is the analytic engine behind the CLT.

C.2 Moment generating functions

The moment generating function is

𝑀_{𝑋} (𝑡) = 𝐸 [𝑒^{𝑡 𝑋}],

where finite. Unlike the characteristic function, it may not exist for all $𝑡$ , or even for any nonzero $𝑡$ . Heavy-tailed variables often have infinite $𝑀_{𝑋} (𝑡)$ for $𝑡 > 0$ .

When $𝑀_{𝑋}$ exists near zero, it encodes moments:

𝑀_{𝑋}^{(𝑘)} (0) = 𝐸 [𝑋^{𝑘}] .

For independent variables,

𝑀_{𝑋 + 𝑌} (𝑡) = 𝑀_{𝑋} (𝑡) 𝑀_{𝑌} (𝑡) .

The log-moment generating function

Λ_{𝑋} (𝑡) = \log 𝑀_{𝑋} (𝑡)

is central in Chernoff bounds and large deviations. Chernoff gives

𝑃 (𝑋 \geq 𝑎) \leq 𝑒^{- 𝑡 𝑎} 𝑀_{𝑋} (𝑡)

for $𝑡 > 0$ , hence

𝑃 (𝑋 \geq 𝑎) \leq \exp (- \sup_{𝑡 > 0} {𝑡 𝑎 - Λ_{𝑋} (𝑡)}) .

The Legendre transform of $Λ$ is the large-deviation rate function in Cramér-type theory.

C.3 Laplace transforms

For a nonnegative random variable $𝑋$ , the Laplace transform is

𝐿_{𝑋} (𝜆) = 𝐸 [𝑒^{- 𝜆 𝑋}], 𝜆 \geq 0.

It always exists for $𝜆 \geq 0$ because $0 \leq 𝑒^{- 𝜆 𝑋} \leq 1$ . For measures on $[0, \infty)$ , the Laplace transform often determines the law.

Laplace transforms are especially effective for waiting times, subordinators, branching processes, renewal theory, hitting times, and nonnegative random variables. If $𝑋, 𝑌 \geq 0$ are independent, then

𝐿_{𝑋 + 𝑌} (𝜆) = 𝐿_{𝑋} (𝜆) 𝐿_{𝑌} (𝜆) .

For a nonnegative variable,

𝐸 [𝑋] = - 𝐿_{𝑋}^{'} (0 +)

when the derivative exists and the expectation is finite. More generally, derivatives at zero encode moments with alternating signs:

𝐿_{𝑋}^{(𝑘)} (0 +) = (- 1)^{𝑘} 𝐸 [𝑋^{𝑘}] .

Laplace methods are one-sided transform methods suited to positive carriers.

C.4 Fourier inversion

Fourier inversion reconstructs a distribution from its characteristic function. If $𝑋$ has an integrable characteristic function, then $𝑋$ has a bounded continuous density

𝑓_{𝑋} (𝑥) = \frac{1}{2 𝜋} \int_{𝑅} 𝑒^{- 𝑖 𝑡 𝑥} 𝜙_{𝑋} (𝑡) 𝑑 𝑡 .

For lattice integer-valued variables,

𝑃 (𝑋 = 𝑘) = \frac{1}{2 𝜋} \int_{- 𝜋}^{𝜋} 𝑒^{- 𝑖 𝑡 𝑘} 𝜙_{𝑋} (𝑡) 𝑑 𝑡 .

Fourier inversion is the native engine for local limit theorems. Weak convergence may use pointwise convergence of characteristic functions, but local probabilities require integrating the characteristic function and controlling its behavior away from zero.

The standard local-limit proof splits the integral:

\int = \int_{∣ 𝑡 ∣ \leq 𝛿} + \int_{∣ 𝑡 ∣ > 𝛿} .

Near zero, one uses Taylor expansion:

𝜙 (𝑡 / \sqrt{𝑛})^{𝑛} \to 𝑒^{- 𝜎^{2} 𝑡^{2} / 2} .

Away from zero, one needs decay or aperiodicity:

∣ 𝜙 (𝑡) ∣ < 1.

This second region is where lattice and smoothness obstructions live.

C.5 Lévy continuity theorem

Lévy’s continuity theorem states that if characteristic functions $𝜙_{𝑛} (𝑡)$ converge pointwise to a function $𝜙 (𝑡)$ that is continuous at $0$ , then $𝜙$ is the characteristic function of a probability law $𝜇$ , and

𝜇_{𝑛} \Rightarrow 𝜇 .

Conversely, if $𝜇_{𝑛} \Rightarrow 𝜇$ , then

𝜙_{𝑛} (𝑡) \to 𝜙 (𝑡)

for every $𝑡$ .

The continuity-at-zero condition prevents loss of mass. A pointwise limit of characteristic functions need not correspond to a probability law if mass escapes or the limit is discontinuous at zero. Thus Lévy’s theorem is a law-convergence gate, not merely an analytic convenience.

The CLT proof uses it directly. For centered iid variance- $𝜎^{2}$ variables,

𝜙_{𝑆_{𝑛} / (𝜎 \sqrt{𝑛})} (𝑡) = {[𝜙_{𝑋} (\frac{𝑡}{𝜎 \sqrt{𝑛}})]}^{𝑛} \to 𝑒^{- 𝑡^{2} / 2} .

Since $𝑒^{- 𝑡^{2} / 2}$ is continuous at zero and is the standard normal characteristic function, convergence in distribution follows.

C.6 Smoothing inequalities

Smoothing inequalities bound distances between distribution functions using characteristic functions. A typical Berry–Esseen-style smoothing inequality has the form

\sup_{𝑥} ∣ 𝐹 (𝑥) - 𝐺 (𝑥) ∣ \leq 𝐶 \int_{- 𝑇}^{𝑇} ∣ \frac{𝜙_{𝐹} (𝑡) - 𝜙_{𝐺} (𝑡)}{𝑡} ∣ 𝑑 𝑡 + \frac{𝐶^{'}}{𝑇},

where $𝐺$ has sufficient regularity.

The first term measures Fourier discrepancy on a bounded frequency range. The second term is smoothing error from truncating the integral. Optimizing $𝑇$ balances analytic approximation and tail of the transform integral.

Smoothing inequalities are the bridge from characteristic-function convergence to quantitative distributional error. Lévy’s theorem gives convergence but not rate. Smoothing inequalities supply rate, making them central in Berry–Esseen estimates, local limit bounds, and approximation theory.

C.7 Saddle-point estimates

Saddle-point methods evaluate probabilities or coefficients through complex or exponential integral representations. If a probability can be written as

𝑃 (𝑆_{𝑛} = 𝑘) = \frac{1}{2 𝜋 𝑖} \int 𝑒^{𝑛 Ψ (𝑧)} 𝑎 (𝑧) 𝑑 𝑧,

the dominant contribution often comes from a point $𝑧_{0}$ satisfying

Ψ^{'} (𝑧_{0}) = 0.

Expanding around $𝑧_{0}$ ,

Ψ (𝑧) \approx Ψ (𝑧_{0}) + \frac{1}{2} Ψ^{''} (𝑧_{0}) (𝑧 - 𝑧_{0})^{2},

gives Gaussian-type prefactors multiplying the main exponential term.

In probability, saddle points appear in precise large deviations, local probabilities, random combinatorial structures, branching processes, occupancy models, and statistical mechanics. They refine exponential estimates by extracting the correct polynomial prefactor and local shape.

The method is powerful but gate-heavy. One needs analyticity, contour deformation legitimacy, nondegenerate saddle, control away from the saddle, and error bounds. Without those, saddle-point notation is only heuristic.

C.8 Transform methods in limit theory

Transform methods convert probabilistic operations into algebraic or analytic operations. Independent sums become products of transforms:

𝜙_{𝑋 + 𝑌} = 𝜙_{𝑋} 𝜙_{𝑌} .

Scaling becomes argument rescaling:

𝜙_{𝑎 𝑋} (𝑡) = 𝜙_{𝑋} (𝑎 𝑡) .

Centering becomes multiplication by a phase:

𝜙_{𝑋 - 𝜇} (𝑡) = 𝑒^{- 𝑖 𝑡 𝜇} 𝜙_{𝑋} (𝑡) .

Different transforms fit different carriers. Characteristic functions are universal for real laws. Moment generating functions are powerful when exponential moments exist. Laplace transforms fit nonnegative variables. Probability generating functions

𝐺_{𝑋} (𝑠) = 𝐸 [𝑠^{𝑋}]

fit nonnegative integer-valued variables. Cauchy/Stieltjes transforms fit spectral measures and random matrix theory:

𝐺_{𝜇} (𝑧) = \int \frac{1}{𝑧 - 𝑥} 𝜇 (𝑑 𝑥) .

Limit theory often follows the same pattern:

normalize \to transform \to pointwise/analytic convergence \to inversion or continuity theorem \to law-level terminal .

The counterkernel is using a transform outside its domain: MGF without exponential moments, density inversion without integrability, moment matching without determinacy, or pointwise transform convergence without continuity/tightness.

Search This Blog

What I learned today

Probability Theory — Appendices

Appendix A — Measure Theory Needed for Probability