Probability Theory — Chapters 11–20

 

Probability Theory — Chapters 11–20

Chapter 11 — Lᵖ Spaces in Probability

11.1 L⁰, , , Lᵖ, L∞

Given a probability space (Ω,𝐹,𝑃), the space 𝐿0 consists of measurable random variables modulo almost sure equality. It has no natural norm in general, but it carries convergence in probability. For 0<𝑝<, the space 𝐿𝑝 consists of random variables satisfying

𝐸𝑋𝑝<.

For 𝑝1, define

𝑋𝑝=(𝐸𝑋𝑝)1/𝑝.

For 𝑝=,

𝑋=ess sup𝑋.

The word “essential” means null sets are ignored. Thus 𝐿𝑝 is not a space of literal functions but of equivalence classes under 𝑋=𝑌 almost surely.

On a probability space, higher 𝐿𝑝 integrability implies lower 𝐿𝑞 integrability:

𝑞<𝑝𝐿𝑝𝐿𝑞,𝑋𝑞𝑋𝑝.

This inclusion uses 𝑃(Ω)=1. On infinite-measure spaces, the inclusion can fail. Probability spaces therefore have a special integrability hierarchy: stronger moments control weaker moments.

11.2 Norms and seminorms modulo null sets

The expression

𝑋𝑝=(𝐸𝑋𝑝)1/𝑝

is a true norm only after identifying random variables that agree almost surely. Before quotienting, it is merely a seminorm, since 𝑋𝑝=0 implies 𝑋=0 almost surely, not necessarily pointwise. The quotient operation is not cosmetic; it is forced by measure theory.

The 𝐿𝑝 carrier therefore has this form:

𝐿𝑝(Ω,𝐹,𝑃)={𝑋:𝐸𝑋𝑝<}/{𝑋=𝑌 a.s.}.

Any theorem stated in 𝐿𝑝 is automatically modulo null sets. A conditional expectation, an 𝐿2 projection, or a martingale representative is not uniquely defined pointwise unless a version is selected. The null-set quotient is the hidden grammar of modern probability.

11.3 Completeness

For 𝑝1, 𝐿𝑝 is a Banach space: every Cauchy sequence in 𝑝 converges in 𝐿𝑝 to an element of 𝐿𝑝. Completeness is what allows limiting constructions. If 𝑋𝑛 is Cauchy in 𝐿𝑝, then there exists 𝑋𝐿𝑝 such that

𝑋𝑛𝑋𝑝0.

The proof uses subsequences and almost sure convergence: from an 𝐿𝑝-Cauchy sequence one extracts a rapidly Cauchy subsequence, controls the sum of increments, obtains almost sure convergence, and then returns to 𝐿𝑝. This pattern is common in probability: norm control produces subsequential pathwise control, then integration reconstructs the full convergence.

11.4 Hilbert-space structure of

The space 𝐿2 is a Hilbert space with inner product

𝑋,𝑌=𝐸[𝑋𝑌]

in the real case, or 𝐸[𝑋𝑌] in the complex case. The norm satisfies

𝑋22=𝐸[𝑋2].

Orthogonality means 𝐸[𝑋𝑌]=0. Centered variables 𝑋𝐸𝑋 and 𝑌𝐸𝑌 are orthogonal exactly when their covariance is zero.

This Hilbert structure powers conditional expectation, martingale theory, Gaussian processes, orthogonal decompositions, and least-squares prediction. In 𝐿2, probabilistic estimation becomes geometry: the best approximation is an orthogonal projection onto a closed subspace.

11.5 Orthogonality and projection

If 𝐻𝐿2 is a closed linear subspace and 𝑋𝐿2, there is a unique 𝑌𝐻 minimizing

𝑋𝑌2.

The error 𝑋𝑌 is orthogonal to every element of 𝐻:

𝐸[(𝑋𝑌)𝑍]=0,𝑍𝐻.

This is the projection theorem.

In probability, 𝐻 often consists of all square-integrable random variables measurable with respect to a sub-σ-algebra 𝐺. The projection of 𝑋 onto 𝐻 is 𝐸[𝑋𝐺]. Thus conditional expectation is not merely an average; in 𝐿2, it is the best prediction of 𝑋 using information 𝐺.

11.6 Conditional expectation as projection

For 𝑋𝐿1, conditional expectation 𝐸[𝑋𝐺] is the 𝐺-measurable random variable satisfying

𝐺𝐸[𝑋𝐺]𝑑𝑃=𝐺𝑋𝑑𝑃

for every 𝐺𝐺. When 𝑋𝐿2, this object is the orthogonal projection of 𝑋 onto 𝐿2(𝐺).

This projection interpretation explains the tower property:

𝐸[𝐸[𝑋𝐺]𝐻]=𝐸[𝑋𝐻]

when 𝐻𝐺. Projecting first onto a larger information space and then onto a smaller one gives the same result as projecting directly onto the smaller one.

11.7 Uniform integrability in

A family 𝑋𝐿1 is uniformly integrable if

lim𝐾sup𝑋𝑋𝐸[𝑋1{𝑋>𝐾}]=0.

This condition prevents expectation from hiding in rare but large values. Boundedness in 𝐿𝑝 for some 𝑝>1 implies uniform integrability by Hölder or Markov estimates.

Uniform integrability is the correct compactness substitute in 𝐿1. It is also the missing bridge between convergence in probability and convergence of expectations. If 𝑋𝑛𝑋 in probability and {𝑋𝑛} is uniformly integrable, then 𝑋𝐿1 and

𝐸𝑋𝑛𝑋0.

Without uniform integrability, expectations can fail to converge even when random variables converge in probability.

11.8 Lᵖ convergence versus probability convergence

Convergence in 𝐿𝑝 means

𝐸𝑋𝑛𝑋𝑝0.

For 𝑝>0, 𝐿𝑝 convergence implies convergence in probability, because Markov’s inequality gives

𝑃(𝑋𝑛𝑋>𝜀)𝐸𝑋𝑛𝑋𝑝𝜀𝑝.

The converse is false without additional integrability control.

Different 𝐿𝑝 modes have different strengths. On a probability space, if 𝑝>𝑞>0, then 𝐿𝑝 convergence implies 𝐿𝑞 convergence:

𝑋𝑛𝑋𝑞𝑋𝑛𝑋𝑝.

But convergence in probability alone is only a typical-value statement; it does not control tails. That is why 𝐿𝑝 spaces encode moment-sensitive convergence, not just distributional behavior.

11.9 Hypercontractivity preview

Hypercontractivity refers to operators that improve integrability: an operator 𝑇 may map 𝐿𝑝 into 𝐿𝑞 with 𝑞>𝑝, satisfying

𝑇𝑓𝑞𝐶𝑓𝑝.

In probability, this phenomenon appears for noise operators, Gaussian semigroups, Boolean functions, Markov semigroups, and log-Sobolev inequalities.

The conceptual value is that smoothing or randomization can upgrade weak moment control into stronger moment control. Hypercontractivity becomes a certificate engine for concentration, tail bounds, invariance principles, and analysis of high-dimensional random structures. It converts functional inequalities into probabilistic regularity.

11.10 Functional-analytic probability carrier

The functional-analytic view treats random variables as elements of normed, Banach, or Hilbert spaces. Expectation becomes a linear functional, conditional expectation becomes projection, martingales become adapted sequences in 𝐿𝑝, and convergence theorems become compactness or boundedness principles.

This carrier is powerful because it exposes the geometry behind probability. 𝐿1 controls mass, 𝐿2 controls energy and orthogonality, 𝐿 controls uniform bounds, and 𝐿𝑝 norms interpolate between tail and moment behavior. The main firewall is that functional-analytic equality is usually equality modulo null sets; pointwise claims require additional version control.


Chapter 12 — Independence

12.1 Independence of events

Events 𝐴 and 𝐵 are independent if

𝑃(𝐴𝐵)=𝑃(𝐴)𝑃(𝐵).

Equivalently, if 𝑃(𝐵)>0,

𝑃(𝐴𝐵)=𝑃(𝐴).

The event 𝐵 gives no probabilistic information about 𝐴. Independence is not disjointness; disjoint positive-probability events are maximally incompatible, not independent.

For a family {𝐴𝑖}𝑖𝐼, mutual independence means every finite subfamily factors:

𝑃(𝑗=1𝑘𝐴𝑖𝑗)=𝑗=1𝑘𝑃(𝐴𝑖𝑗).

The finite-subfamily condition is crucial even for infinite families. Independence is always certified by finite joint factorization, not by vague unrelatedness.

12.2 Independence of σ-algebras

Sub-σ-algebras 𝐺 and 𝐻 are independent if

𝑃(𝐺𝐻)=𝑃(𝐺)𝑃(𝐻)

for all 𝐺𝐺, 𝐻𝐻. For a family {𝐺𝑖}, mutual independence means every finite selection of events 𝐺𝑖𝐺𝑖 factors.

This is the cleanest information-theoretic form of independence. A σ-algebra represents information. Independence of σ-algebras means information in one σ-algebra does not change probabilities of events in the other. Random-variable independence is defined through the σ-algebras generated by those variables.

12.3 Independence of random variables

Random variables 𝑋𝑖:Ω𝑆𝑖 are independent if the σ-algebras 𝜎(𝑋𝑖) are independent. Equivalently, for measurable sets 𝐵𝑖𝑆𝑖,

𝑃(𝑋1𝐵1,,𝑋𝑛𝐵𝑛)=𝑖=1𝑛𝑃(𝑋𝑖𝐵𝑖).

For real-valued variables, it is enough to check rectangles of the form (,𝑡𝑖].

In law language, independence means the joint law factors:

𝜇(𝑋1,,𝑋𝑛)=𝜇𝑋1𝜇𝑋𝑛.

This is the most compact certificate. Marginal laws alone do not imply independence; the joint law must be product.

12.4 Factorization of joint laws

If 𝑋 and 𝑌 have joint law 𝛾 on 𝑆×𝑇, then they are independent iff

𝛾=𝜇𝑋𝜇𝑌.

For densities, this becomes

𝑓𝑋,𝑌(𝑥,𝑦)=𝑓𝑋(𝑥)𝑓𝑌(𝑦)

almost everywhere. For discrete variables, it becomes

𝑃(𝑋=𝑥,𝑌=𝑦)=𝑃(𝑋=𝑥)𝑃(𝑌=𝑦)

for all values 𝑥,𝑦.

The phrase “almost everywhere” matters. Density factorizations are measure-level claims. Changing a density on a null set changes the pointwise formula but not the law. The true carrier is the measure factorization, not a chosen version of the density.

12.5 Product measures

Given probability measures 𝜇 on 𝑆 and 𝜈 on 𝑇, their product measure 𝜇𝜈 is characterized by

(𝜇𝜈)(𝐴×𝐵)=𝜇(𝐴)𝜈(𝐵).

It extends uniquely under standard measure-theoretic hypotheses from rectangles to the product σ-algebra.

Product measure constructs independent joint randomness from marginal laws. If 𝑋(𝑠,𝑡)=𝑠 and 𝑌(𝑠,𝑡)=𝑡 on 𝑆×𝑇 with law 𝜇𝜈, then 𝑋𝜇, 𝑌𝜈, and 𝑋,𝑌 are independent. Dependence is exactly deviation from this product carrier.

12.6 Infinite independent families

An infinite family {𝑋𝑖}𝑖𝐼 is independent if every finite subfamily is independent. For countable products, one constructs the law on 𝑖𝑆𝑖 using finite-dimensional cylinder probabilities:

𝑃(𝑋𝑖1𝐴1,,𝑋𝑖𝑘𝐴𝑘)=𝑗=1𝑘𝜇𝑖𝑗(𝐴𝑗).

Infinite independence is the foundation of coin-flip sequences, iid samples, random walks, and product processes. It produces tail events, zero-one laws, and limit theorems. The finite-dimensional definition is not a weakness; countable probability itself is built by extension from finite-dimensional consistency.

12.7 Pairwise versus mutual independence

Pairwise independence requires every pair to factor. Mutual independence requires every finite collection to factor. Pairwise independence does not control higher-order interactions. For example, let 𝑋,𝑌 be independent fair bits and set 𝑍=𝑋𝑌. Then 𝑋,𝑌,𝑍 are pairwise independent, but not mutually independent because 𝑍 is determined by 𝑋,𝑌.

This distinction is load-bearing in variance computations, random constructions, hashing, and pseudorandomness. Pairwise independence may be sufficient for second-moment estimates, but not for Chernoff bounds, product distributions, or full limit theorems. The independence strength must match the theorem.

12.8 Conditional independence

Events or random variables 𝑋 and 𝑌 are conditionally independent given 𝐺 if

𝑃(𝑋𝐴,𝑌𝐵𝐺)=𝑃(𝑋𝐴𝐺)𝑃(𝑌𝐵𝐺)

almost surely, for all measurable 𝐴,𝐵. Conditional independence says that after the information 𝐺 is known, no residual dependence remains.

Conditional independence is not the same as unconditional independence. Two variables can be dependent marginally but conditionally independent given a latent variable; this is common in Bayesian networks and mixture models. Conversely, conditioning can create dependence through selection effects. Conditional independence is a σ-algebra-relative factorization claim.

12.9 Exchangeability

A sequence 𝑋1,𝑋2, is exchangeable if its finite-dimensional laws are invariant under finite permutations:

(𝑋1,,𝑋𝑛)=d(𝑋𝜋(1),,𝑋𝜋(𝑛))

for every permutation 𝜋 of {1,,𝑛}. Iid sequences are exchangeable, but exchangeability is weaker.

Exchangeability encodes symmetry without independence. It says order carries no information, but dependence may remain. A typical example is drawing from a random parameter: conditional on Θ, the variables are iid, but marginally they are dependent. This leads to de Finetti-type representation.

12.10 De Finetti’s theorem preview

De Finetti’s theorem states, in one classical form, that an infinite exchangeable sequence of Bernoulli random variables is conditionally iid given a random parameter Θ[0,1]. That is,

𝑃(𝑋1=𝑥1,,𝑋𝑛=𝑥𝑛)=01𝜃𝑥𝑖(1𝜃)𝑛𝑥𝑖𝜈(𝑑𝜃)

for some mixing measure 𝜈.

The theorem converts exchangeability into mixture-of-iid structure. This is a major carrier transformation: symmetry under permutations becomes conditional independence over a latent law. It is central to Bayesian statistics, representation theory of random sequences, and probabilistic symmetry principles.

12.11 Independence as carrier certificate

Independence is a certificate about the joint carrier. It asserts that the joint law is product, or that relevant σ-algebras factor. Without this certificate, products of probabilities, multiplication of MGFs, convolution formulas, Chernoff bounds, and many limit theorems are not licensed.

The carrier view prevents a common error: variables with unrelated names are not automatically independent. Independence must come from construction, assumption, product measure, randomized mechanism, or proven factorization. In advanced probability, most work is spent either exploiting independence or replacing it with weaker dependence-control structures.

12.12 False independence and coupling errors

False independence arises when marginal information is mistaken for joint information. Knowing 𝑋𝜇 and 𝑌𝜈 does not determine 𝑃(𝑋𝑌), 𝐸[𝑋𝑌], or 𝑃(𝑋=𝑌). Those require a coupling. If the coupling is product, the variables are independent; if not, dependence must be analyzed.

Another error is assuming pairwise independence gives mutual independence, or assuming conditional independence survives marginalization. A third is using independent copies without constructing them on a product extension. The general rule is:

joint expression requires joint carrier.

No joint carrier, no joint probability.


Chapter 13 — Product Measures and Fubini Theory

13.1 Product σ-algebras

For measurable spaces (𝑆,𝑆) and (𝑇,𝑇), the product σ-algebra is

𝑆𝑇=𝜎({𝐴×𝐵:𝐴𝑆, 𝐵𝑇}).

It is the event space generated by measurable rectangles. Coordinate projections are measurable by construction.

Product σ-algebras are the natural carrier for joint random variables. If 𝑋:Ω𝑆 and 𝑌:Ω𝑇, then (𝑋,𝑌):Ω𝑆×𝑇 is measurable with respect to 𝑆𝑇. Joint laws live on this product event structure.

13.2 Product measures

Given σ-finite measures 𝜇 and 𝜈, the product measure 𝜇𝜈 is the unique measure on 𝑆𝑇 satisfying

(𝜇𝜈)(𝐴×𝐵)=𝜇(𝐴)𝜈(𝐵).

For probability measures, it defines the law of independent coordinates.

Product measure is not just a convenience. It is the mathematical object behind independent sampling. If one says “sample 𝑋𝜇 and independently sample 𝑌𝜈,” the joint law is 𝜇𝜈. Any other joint law with the same marginals is a coupling but not independent.

13.3 Tonelli’s theorem

Tonelli’s theorem states that if 𝑓:𝑆×𝑇[0,] is measurable, then

𝑆×𝑇𝑓𝑑(𝜇𝜈)=𝑆(𝑇𝑓(𝑠,𝑡)𝜈(𝑑𝑡))𝜇(𝑑𝑠)=𝑇(𝑆𝑓(𝑠,𝑡)𝜇(𝑑𝑠))𝜈(𝑑𝑡).

No integrability assumption is needed beyond nonnegativity; the value may be infinite.

Tonelli is the theorem of safe rearrangement for nonnegative quantities. It justifies summing or integrating in either order when no cancellation is possible. Many expectation identities for counts, occupation times, and nonnegative random fields are Tonelli arguments.

13.4 Fubini’s theorem

Fubini’s theorem extends Tonelli to signed or complex functions under integrability:

𝑓𝑑(𝜇𝜈)<.

Then iterated integrals exist almost everywhere, are integrable, and

𝑓𝑑(𝜇𝜈)=𝑓(𝑠,𝑡)𝜈(𝑑𝑡)𝜇(𝑑𝑠)=𝑓(𝑠,𝑡)𝜇(𝑑𝑠)𝜈(𝑑𝑡).

The integrability hypothesis is load-bearing. Without absolute integrability, changing order can change the value or produce undefined expressions. In probability, Fubini licenses identities such as 𝐸[𝐸[𝑋𝑌]]=𝐸[𝑋] and many conditioning calculations, but only when the relevant integrability conditions hold.

13.5 Iterated expectation

If 𝑋 is integrable on a product probability space (𝑆×𝑇,𝜇𝜈), then

𝐸[𝑋]=𝑆𝐸[𝑋(𝑠,)]𝜇(𝑑𝑠)=𝑇𝐸[𝑋(,𝑡)]𝜈(𝑑𝑡).

This is expectation computed one coordinate at a time.

Iterated expectation generalizes to conditional expectation:

𝐸[𝑋]=𝐸[𝐸[𝑋𝐺]].

The inner expectation averages over unresolved randomness; the outer expectation averages over the conditioning information. This is the measure-theoretic form of “average the conditional averages.”

13.6 Independent product construction

To construct independent random variables with given laws 𝜇𝑖, take the product space

Ω=𝑖𝑆𝑖

with product measure 𝑖𝜇𝑖, and define coordinate maps

𝑋𝑖(𝜔)=𝜔𝑖.

Then the 𝑋𝑖 are independent and 𝑋𝑖𝜇𝑖.

This construction proves that independent copies exist under standard measurable-space hypotheses. It also clarifies that an “independent copy of 𝑋” is not created inside the original sample space automatically. One may need to extend the probability space to carry the copy.

13.7 Infinite products

Infinite product measures are built from finite-dimensional cylinder probabilities. A cylinder set constrains finitely many coordinates:

{𝜔:𝜔𝑖1𝐴1,,𝜔𝑖𝑘𝐴𝑘}.

Its probability under product measure is

𝑗=1𝑘𝜇𝑖𝑗(𝐴𝑗).

The infinite product σ-algebra is generated by these finite-coordinate events. It contains many asymptotic events obtained by countable operations, such as convergence of averages and infinitely many successes. Infinite product construction is the backbone of iid sequences and independent stochastic inputs.

13.8 Random sequences

A random sequence is a measurable map

𝑋:Ω𝑆𝑁,

or equivalently a sequence of coordinate random variables 𝑋𝑛:Ω𝑆. Its law is a probability measure on sequence space. If the coordinates are independent with common law 𝜇, the law is 𝜇𝑁.

Sequence-space thinking is cleaner than treating each coordinate separately. Tail events, empirical measures, stopping times, and path properties are events on 𝑆𝑁. Limit theorems then become statements about the measure of subsets of sequence space.

13.9 Coordinate random variables

On a product space Ω=𝑖𝑆𝑖, the coordinate random variable is

𝑋𝑖(𝜔)=𝜔𝑖.

The product σ-algebra is the smallest σ-algebra making all coordinate maps measurable. Finite-dimensional distributions are pushforwards of the product law under finite coordinate projections.

Coordinate variables are canonical, but not every sequence of random variables originally appears on a product space. The product representation is a model realization. What matters is whether the joint law of the given sequence matches the coordinate law on the canonical product carrier.

13.10 Kolmogorov consistency

A family of finite-dimensional distributions {𝜇𝐼} is consistent if marginalizing from a larger finite set 𝐽 to a smaller finite set 𝐼𝐽 gives 𝜇𝐼. Formally,

(𝜋𝐽𝐼)𝜇𝐽=𝜇𝐼.

Consistency is necessary because any genuine process must have compatible finite projections.

Kolmogorov’s extension theorem says that, under suitable state-space hypotheses, consistency is sufficient to produce a probability measure on the infinite product. The theorem turns local finite-dimensional specifications into a global stochastic process carrier.

13.11 Product-space pathologies

Product spaces can behave badly when measurability, completion, or topology is mishandled. The product of completed σ-algebras need not equal the completion of the product σ-algebra. Sections of measurable sets may be measurable under good hypotheses, but projections of measurable sets may require analytic-set machinery.

Another pathology is assuming pointwise path regularity from finite-dimensional distributions. A process law on 𝑅[0,) may exist without having continuous or càdlàg paths. Regularity requires separate certificates such as continuity criteria, separability, or modification theorems. Product construction gives existence of coordinates; it does not automatically give nice sample paths.


Chapter 14 — Conditional Probability and Conditioning

14.1 Conditioning on positive-probability events

If 𝐵𝐹 with 𝑃(𝐵)>0, define

𝑃(𝐴𝐵)=𝑃(𝐴𝐵)𝑃(𝐵).

This creates a new probability law on Ω, concentrated on 𝐵, or equivalently a normalized restriction of 𝑃 to 𝐵.

The condition 𝑃(𝐵)>0 is essential. If 𝑃(𝐵)=0, the ratio is undefined. Many paradoxes in continuous probability arise from informal conditioning on null events. Conditioning on 𝑋=𝑥 for a continuous variable requires regular conditional distributions or limiting procedures, not the elementary ratio.

14.2 Conditioning on partitions

If {𝐵𝑖} is a countable partition of Ω with 𝑃(𝐵𝑖)>0, conditioning on the partition means replacing a random quantity by its average on the cell containing the outcome. For an event 𝐴,

𝑃(𝐴𝐵𝑖)=𝑃(𝐴𝐵𝑖)𝑃(𝐵𝑖).

The law of total probability is

𝑃(𝐴)=𝑖𝑃(𝐴𝐵𝑖)𝑃(𝐵𝑖).

For an integrable 𝑋, the conditional expectation given the partition is

𝐸[𝑋𝜎(𝐵𝑖)]=𝑖𝐸[𝑋1𝐵𝑖]𝑃(𝐵𝑖)1𝐵𝑖.

Thus conditioning on a finite or countable partition is averaging over information cells.

14.3 Conditioning on σ-algebras

Conditioning on a σ-algebra 𝐺𝐹 means conditioning on available information. The conditional expectation 𝐸[𝑋𝐺] is 𝐺-measurable and satisfies

𝐺𝐸[𝑋𝐺]𝑑𝑃=𝐺𝑋𝑑𝑃

for all 𝐺𝐺.

The σ-algebra formulation subsumes conditioning on events, partitions, random variables, and filtrations. Conditioning on a random variable 𝑌 means conditioning on 𝜎(𝑌). The result is a 𝜎(𝑌)-measurable object, often representable as a function 𝑔(𝑌) under standard conditions.

14.4 Conditional expectation

For 𝑋𝐿1, conditional expectation is the unique almost sure equivalence class 𝑍=𝐸[𝑋𝐺] such that 𝑍 is 𝐺-measurable and

𝐸[𝑍1𝐺]=𝐸[𝑋1𝐺]

for every 𝐺𝐺. It preserves all integrals over events visible to 𝐺.

Key properties include linearity, positivity, monotonicity, Jensen’s inequality in conditional form, and the tower property:

𝐸[𝐸[𝑋𝐺]𝐻]=𝐸[𝑋𝐻]if 𝐻𝐺.

Conditional expectation is the projection of 𝑋 onto the information carrier 𝐺.

14.5 Conditional distributions

A conditional distribution of 𝑋 given 𝑌 is a family of probability measures

𝐾(𝑦,𝐴)=𝑃(𝑋𝐴𝑌=𝑦)

such that 𝑦𝐾(𝑦,𝐴) is measurable and

𝑃(𝑋𝐴,𝑌𝐵)=𝐵𝐾(𝑦,𝐴)𝜇𝑌(𝑑𝑦).

This is a Markov kernel from the state space of 𝑌 to the state space of 𝑋.

Conditional distributions refine conditional expectation. If 𝑔 is integrable,

𝐸[𝑔(𝑋)𝑌]=𝑔(𝑥)𝐾(𝑌,𝑑𝑥).

Existence of regular conditional distributions requires suitable measurable-space hypotheses, usually standard Borel spaces.

14.6 Regular conditional probability

A regular conditional probability is a conditional distribution that behaves as an actual probability measure in the conditioned variable and as a measurable function in the conditioning value. For each 𝑦, 𝐾(𝑦,) is a probability measure; for each event 𝐴, 𝐾(,𝐴) is measurable.

Regular conditional probabilities justify expressions such as 𝑃(𝑋𝐴𝑌=𝑦), even when 𝑃(𝑌=𝑦)=0. But they are versions, defined only up to 𝜇𝑌-null sets in 𝑦. Treating a chosen version as canonical at every point can produce errors, especially on null conditioning values.

14.7 Disintegration

Disintegration decomposes a joint measure into conditional measures over a marginal:

𝛾(𝑑𝑥,𝑑𝑦)=𝐾(𝑦,𝑑𝑥)𝜇𝑌(𝑑𝑦).

It says a joint law can be represented by first drawing 𝑌𝜇𝑌, then drawing 𝑋 according to 𝐾(𝑌,), under appropriate space regularity.

This is the measure-theoretic form of conditional modeling. It generalizes density factorization

𝑓𝑋,𝑌(𝑥,𝑦)=𝑓𝑋𝑌(𝑥𝑦)𝑓𝑌(𝑦).

The density formula is only a special representation; disintegration is the invariant carrier.

14.8 Bayes theorem in measure form

Bayes’ theorem becomes a change of disintegration. Suppose a prior 𝜋(𝑑𝜃) and likelihood kernel 𝐿(𝜃,𝑑𝑥) define a joint law

𝛾(𝑑𝜃,𝑑𝑥)=𝐿(𝜃,𝑑𝑥)𝜋(𝑑𝜃).

The posterior is a conditional distribution of 𝜃 given 𝑥:

𝜋(𝑑𝜃𝑥)=𝐿(𝜃,𝑑𝑥)𝜋(𝑑𝜃)𝐿(𝜃,𝑑𝑥)𝜋(𝑑𝜃)

in density notation.

The invariant statement is not the formula with densities; it is the existence and identification of the reverse conditional kernel. Bayes’ theorem is therefore disintegration reversal: it transports from generative factorization to inferential factorization.

14.9 Versions of conditional expectation

Conditional expectation is unique only almost surely. If 𝑍 and 𝑍 both satisfy the defining property, then

𝑃(𝑍=𝑍)=1.

They may differ on a null set. Thus 𝐸[𝑋𝐺] is an equivalence class unless a version is selected.

Version issues become serious when evaluating conditional expectations at specific points, especially points of probability zero. A statement true almost surely in the conditioning variable may fail at an exceptional value. Any pointwise use of conditional objects requires a version certificate or regularity theorem.

14.10 Conditioning on null events

The elementary formula 𝑃(𝐴𝐵)=𝑃(𝐴𝐵)/𝑃(𝐵) fails when 𝑃(𝐵)=0. Continuous conditioning, such as 𝑃(𝑋𝐴𝑌=𝑦), requires regular conditional probability, limiting conditioning, or geometric disintegration.

Different limiting procedures can give different answers when conditioning on null events. This is the source of Borel-type paradoxes. The conditioning event must be replaced by a specified σ-algebra, kernel, limiting scheme, or coordinate-invariant disintegration. Null conditioning is not illegal, but it is not handled by the finite event-ratio formula.

14.11 Conditional independence

Random variables 𝑋 and 𝑌 are conditionally independent given 𝐺 if for bounded measurable 𝑓,𝑔,

𝐸[𝑓(𝑋)𝑔(𝑌)𝐺]=𝐸[𝑓(𝑋)𝐺]𝐸[𝑔(𝑌)𝐺].

This is the conditional factorization of joint information.

Conditional independence is the language of graphical models, Markov properties, hidden-variable models, and filtering. It is stronger and more precise than saying dependence is “explained by” 𝐺. Once 𝐺 is known, the remaining randomness in 𝑋 and 𝑌 factors.

14.12 Filtrations and information

A filtration is an increasing family of σ-algebras:

𝐹𝑠𝐹𝑡,𝑠𝑡.

It represents information accumulated over time. A process 𝑋𝑡 is adapted if 𝑋𝑡 is 𝐹𝑡-measurable for every 𝑡.

Filtrations are the carrier for martingales, stopping times, stochastic integration, and dynamic conditioning. The phrase “known at time 𝑡” means measurable with respect to 𝐹𝑡. Without filtration, temporal probability statements are under-specified.

14.13 Updating as transport between σ-algebras

Updating is movement from a coarser information σ-algebra to a finer one. If 𝐺𝐻, then 𝐸[𝑋𝐺] is the best estimate with less information, while 𝐸[𝑋𝐻] is the updated estimate with more information. The tower property guarantees coherence:

𝐸[𝐸[𝑋𝐻]𝐺]=𝐸[𝑋𝐺].

This is the formal version of rational updating. New information refines the event grammar. Probabilities change not because the underlying law is incoherent, but because the conditioning σ-algebra has changed. The update is a projection onto a different information carrier.


Chapter 15 — Probability Convergence Grammar

15.1 Almost sure convergence

A sequence 𝑋𝑛 converges almost surely to 𝑋 if

𝑃({𝜔:𝑋𝑛(𝜔)𝑋(𝜔)})=1.

This is pointwise convergence outside a null set. It requires all 𝑋𝑛 and 𝑋 to live on a common probability space.

Almost sure convergence is strong because it controls entire sample paths eventually for almost every outcome. It is the natural mode for strong laws, martingale convergence, and pathwise stochastic analysis. But it is still not pointwise convergence everywhere; null exceptional sets remain.

15.2 Convergence in probability

𝑋𝑛𝑋 in probability if for every 𝜀>0,

𝑃(𝑋𝑛𝑋>𝜀)0.

This says large deviations from 𝑋 become unlikely. It also requires a common probability space because the expression 𝑋𝑛𝑋 must be meaningful.

Almost sure convergence implies convergence in probability, but not conversely. Convergence in probability is often the correct mode for weak laws of large numbers, estimator consistency, and randomized approximation. It controls typical behavior at each large 𝑛, but not necessarily eventual behavior along almost every sample path.

15.3 Lᵖ convergence

𝑋𝑛𝑋 in 𝐿𝑝 if

𝐸𝑋𝑛𝑋𝑝0.

For 𝑝>0, this implies convergence in probability. For 𝑝1, it is norm convergence in the Banach space 𝐿𝑝.

𝐿𝑝 convergence controls both probability of deviation and magnitude of deviation. 𝐿1 convergence controls expected absolute error; 𝐿2 convergence controls mean-square error; higher 𝑝 controls stronger tail behavior. It is more quantitative than convergence in probability but less pathwise than almost sure convergence.

15.4 Convergence in distribution

𝑋𝑛 converges in distribution to 𝑋, written

𝑋𝑛𝑋,

if the laws 𝜇𝑋𝑛 converge weakly to 𝜇𝑋. For real variables, this is equivalent to

𝐹𝑋𝑛(𝑡)𝐹𝑋(𝑡)

at every continuity point 𝑡 of 𝐹𝑋.

Convergence in distribution is law-level. The variables do not need to live on a common probability space. It is the natural mode for central limit theorems and many asymptotic approximations. But it does not by itself imply convergence in probability or almost surely. The arrow is weaker because it forgets coupling.

15.5 Total variation convergence

Probability measures 𝜇𝑛 converge to 𝜇 in total variation if

𝜇𝑛𝜇TV=sup𝐴𝜇𝑛(𝐴)𝜇(𝐴)0.

Equivalently, when densities exist with respect to a common dominating measure,

𝜇𝑛𝜇TV=12𝑓𝑛𝑓𝑑𝜆.

Total variation is stronger than weak convergence. It controls probabilities of all measurable events uniformly. It is central in Markov chain mixing, coupling, statistical distance, and approximation theory. A coupling characterization says total variation is the minimal mismatch probability over all couplings:

𝜇𝜈TV=inf𝑃(𝑋𝑌).

15.6 Weak convergence

Weak convergence of probability measures on a metric space means

𝑓𝑑𝜇𝑛𝑓𝑑𝜇

for every bounded continuous 𝑓. This is convergence tested by continuous bounded probes, not by all measurable events.

Weak convergence is topology-sensitive. It sees the geometry of the state space. It is weaker than total variation and does not generally imply convergence of expectations for unbounded functions. To pass expectations of unbounded functions, one needs uniform integrability, moment bounds, or stronger Wasserstein-type convergence.

15.7 Vague convergence

Vague convergence is used mainly for locally compact spaces and measures that may not be probability measures. Measures 𝜇𝑛 converge vaguely to 𝜇 if

𝑓𝑑𝜇𝑛𝑓𝑑𝜇

for every continuous compactly supported 𝑓.

Vague convergence is weaker than weak convergence because compactly supported tests may not see mass escaping to infinity. It is appropriate for point processes, extreme-value theory, and infinite measures. For probability measures, vague convergence plus tightness can often recover weak convergence.

15.8 Relations between convergence modes

The basic implication chain is:

𝐿𝑝 convergenceconvergence in probabilityconvergence in distribution.

Almost sure convergence also implies convergence in probability. If 𝑋𝑛𝑐 where 𝑐 is constant, then 𝑋𝑛𝑐 in probability.

No reverse implication holds generally without extra hypotheses. Distributional convergence is law-level and can occur without a shared sample space. Probability convergence requires coupling. Almost sure convergence requires pathwise eventual control. 𝐿𝑝 convergence requires moment control. Each arrow has a different carrier.

15.9 Counterexamples separating convergence modes

Let 𝑋𝑛 be independent Bernoulli variables with 𝑃(𝑋𝑛=1)=1/𝑛. Then 𝑋𝑛0 in probability, since 𝑃(𝑋𝑛>𝜀)=1/𝑛0, but 𝑋𝑛 does not converge almost surely to zero if the events are arranged with divergent sum and independence in a Borel–Cantelli construction.

Let 𝑋𝑛=𝑛 with probability 1/𝑛, else 0. Then 𝑋𝑛0 in probability, but 𝐸[𝑋𝑛]=1, so 𝑋𝑛↛0 in 𝐿1. Let 𝑋𝑛𝑁(0,1) independently of 𝑋𝑁(0,1); then 𝑋𝑛𝑋, but without coupling there is no reason for 𝑋𝑛𝑋 in probability. These examples prove the convergence modes are not interchangeable.

15.10 Skorokhod representation

The Skorokhod representation theorem states that, under suitable conditions such as Polish state spaces, if 𝑋𝑛𝑋, then one can construct random variables 𝑌𝑛,𝑌 on a new probability space such that

𝑌𝑛=d𝑋𝑛,𝑌=d𝑋,𝑌𝑛𝑌 almost surely.

This converts weak convergence into almost sure convergence after changing the carrier.

The theorem is powerful but dangerous if misread. It does not say the original 𝑋𝑛 converge almost surely. It says there exists a coupling with almost sure convergence. Therefore it is a law-level-to-coupling liftback theorem, not a statement about the original sample space.

15.11 Borel–Cantelli lemmas

For events 𝐴𝑛, define

𝐴𝑛 i.o.=lim sup𝐴𝑛=𝑚=1𝑛𝑚𝐴𝑛.

The first Borel–Cantelli lemma states that if

𝑛𝑃(𝐴𝑛)<,

then

𝑃(𝐴𝑛 i.o.)=0.

No independence is required.

The second Borel–Cantelli lemma states that if the 𝐴𝑛 are independent and

𝑛𝑃(𝐴𝑛)=,

then

𝑃(𝐴𝑛 i.o.)=1.

Thus summability controls eventual occurrence. The second direction requires independence or sufficient dependence control.

15.12 Cauchy criteria in probability

A sequence 𝑋𝑛 is Cauchy in probability if for every 𝜀>0,

𝑃(𝑋𝑛𝑋𝑚>𝜀)0

as 𝑚,𝑛. In many standard settings, Cauchy in probability implies convergence in probability to some random variable.

For 𝐿𝑝, the Cauchy criterion is norm-based:

𝑋𝑛𝑋𝑚𝑝0.

Completeness of 𝐿𝑝 then supplies an 𝐿𝑝 limit. Cauchy criteria are useful when the limit is not explicitly known, as in stochastic integration, martingale convergence, and construction of processes.

15.13 Convergence of expectations

Convergence of random variables does not automatically imply convergence of expectations. Sufficient conditions include dominated convergence, bounded convergence, monotone convergence, 𝐿1 convergence, or convergence in probability plus uniform integrability.

For weak convergence, bounded continuous test functions are safe:

𝑋𝑛𝑋𝐸[𝑓(𝑋𝑛)]𝐸[𝑓(𝑋)]

for bounded continuous 𝑓. For unbounded 𝑓, additional conditions are required. The missing bridge is usually uniform integrability of 𝑓(𝑋𝑛).

15.14 Uniform integrability as missing bridge

Uniform integrability converts weak or probability convergence into expectation convergence. If 𝑋𝑛𝑋 in probability and {𝑋𝑛} is uniformly integrable, then

𝐸[𝑋𝑛]𝐸[𝑋].

If 𝑋𝑛𝑋 and {𝑋𝑛} has sufficient uniform integrability under a coupling or moment condition, then expectations can often be transferred.

This is the recurring audit rule: convergence controls where most mass goes; uniform integrability controls what rare large mass can do. Without it, expectations can remain fixed, diverge, or oscillate despite convergence in probability or distribution.


Chapter 16 — Laws and Weak Convergence

16.1 Probability measures on metric spaces

A probability law on a metric space 𝑆 is a probability measure on its Borel σ-algebra 𝐵(𝑆). Metric structure allows one to define weak convergence, tightness, continuity sets, compactness, and convergence-determining classes.

The move from real-valued variables to metric-space-valued random elements is essential for stochastic processes, empirical measures, random graphs, and random functions. A law on 𝐶[0,1], 𝐷[0,1], or the space of probability measures is still a probability measure; only the state carrier changes.

16.2 Bounded continuous test functions

Weak convergence 𝜇𝑛𝜇 is defined by

𝑓𝑑𝜇𝑛𝑓𝑑𝜇

for every bounded continuous 𝑓:𝑆𝑅. These functions probe the law without being sensitive to null boundary artifacts.

Boundedness prevents tail mass from distorting the test integral; continuity prevents tests from seeing abrupt boundary behavior not controlled by weak convergence. Indicator functions are generally not continuous, so event probabilities require continuity-set conditions.

16.3 Portmanteau theorem

The Portmanteau theorem gives equivalent formulations of weak convergence. On metric spaces, 𝜇𝑛𝜇 iff

lim sup𝑛𝜇𝑛(𝐹)𝜇(𝐹)

for every closed set 𝐹, equivalently

lim inf𝑛𝜇𝑛(𝐺)𝜇(𝐺)

for every open set 𝐺, and equivalently

𝜇𝑛(𝐴)𝜇(𝐴)

for every 𝜇-continuity set 𝐴, meaning 𝜇(𝐴)=0.

The theorem explains why convergence of CDFs is required only at continuity points. Discontinuities are atoms or boundary masses where indicator tests are not continuous probes. Weak convergence controls events whose boundaries carry no limiting mass.

16.4 Tightness

A family of probability measures {𝜇𝑖} on a metric space is tight if for every 𝜀>0, there exists compact 𝐾 such that

sup𝑖𝜇𝑖(𝐾𝑐)<𝜀.

Tightness says mass does not escape to infinity or into noncompact regions.

On 𝑅𝑑, tightness often follows from moment bounds:

sup𝑖𝑥𝑝𝜇𝑖(𝑑𝑥)<{𝜇𝑖} tight.

Tightness is the compactness gate for weak convergence. It gives subsequential limits, but it does not identify the limit; identification requires convergence of test functions, characteristic functions, finite-dimensional distributions, or other determining data.

16.5 Prokhorov theorem

Prokhorov’s theorem states that, on Polish spaces, a family of probability measures is relatively compact in the weak topology iff it is tight. Thus tightness is not merely sufficient but exactly the compactness criterion in good spaces.

This theorem is central in process convergence. To prove 𝑋𝑛𝑋 in a function space, one typically proves tightness of the laws and then identifies all subsequential limits by finite-dimensional distributions or martingale problems. Tightness gives existence of limit candidates; identification eliminates ambiguity.

16.6 Weak convergence on ℝᵈ

For laws on 𝑅𝑑, weak convergence can be tested by bounded continuous functions, by convergence of distribution functions at continuity points, or by characteristic functions:

𝜙𝜇𝑛(𝑡)=𝑒𝑖𝑡,𝑥𝜇𝑛(𝑑𝑥).

Lévy’s continuity theorem states that pointwise convergence of characteristic functions to a function continuous at zero gives weak convergence to the corresponding law.

In 𝑅𝑑, Cramér–Wold also applies: 𝑋𝑛𝑋 iff

𝑢,𝑋𝑛𝑢,𝑋

for every 𝑢𝑅𝑑. This reduces multivariate convergence to one-dimensional projections.

16.7 Weak convergence on function spaces

Weak convergence on spaces such as 𝐶[0,1] or 𝐷[0,1] requires more than convergence of finite-dimensional distributions. One must also prove tightness in the function-space topology. For 𝐶[0,1], tightness is often certified by modulus-of-continuity estimates. For 𝐷[0,1], Skorokhod topologies handle jumps.

The process-level law contains path regularity. Finite-dimensional distributions only describe coordinates at finitely many times; they do not control oscillation between times. Thus process convergence requires both finite-dimensional convergence and tightness. This is the standard two-gate structure.

16.8 Empirical measures

Given samples 𝑋1,,𝑋𝑛, the empirical measure is

𝜇𝑛=1𝑛𝑖=1𝑛𝛿𝑋𝑖.

It is a random probability measure. For iid samples with law 𝜇, the empirical measure converges weakly to 𝜇 almost surely under broad conditions:

𝜇𝑛𝜇.

Empirical measures convert samples into law-level random objects. The Glivenko–Cantelli theorem strengthens this on 𝑅 to uniform convergence of CDFs:

sup𝑥𝐹𝑛(𝑥)𝐹(𝑥)0

almost surely. Empirical process theory studies fluctuations around this convergence.

16.9 Wasserstein convergence

The 𝑝-Wasserstein distance between probability measures on a metric space is

𝑊𝑝(𝜇,𝜈)=(inf𝛾Π(𝜇,𝜈)𝑑(𝑥,𝑦)𝑝𝛾(𝑑𝑥,𝑑𝑦))1/𝑝,

where Π(𝜇,𝜈) is the set of couplings. Wasserstein convergence combines weak convergence with moment control.

On 𝑅𝑑, 𝑊𝑝(𝜇𝑛,𝜇)0 iff 𝜇𝑛𝜇 and the 𝑝-th moments converge appropriately. This makes Wasserstein distance a liftback metric: it remembers geometry and tail magnitude, not just weak law convergence.

16.10 Lévy metric

The Lévy metric metrizes weak convergence of probability measures on 𝑅. For distribution functions 𝐹,𝐺, it measures the smallest 𝜀 such that

𝐹(𝑥𝜀)𝜀𝐺(𝑥)𝐹(𝑥+𝜀)+𝜀

for all 𝑥. It permits small horizontal and vertical errors.

The metric is useful because weak convergence of real laws is exactly convergence in this metric. It is less commonly used in computations than characteristic functions or bounded-Lipschitz metrics, but it formalizes the CDF geometry of weak convergence.

16.11 Convergence-determining classes

A class 𝐶 of test functions or sets is convergence-determining if convergence on 𝐶 implies weak convergence. On 𝑅, intervals (,𝑡] at continuity points determine convergence. On 𝑅𝑑, characteristic functions or bounded Lipschitz functions determine convergence.

Convergence-determining classes reduce verification. Instead of testing every bounded continuous function, one tests a smaller structured family. The class must be rich enough to identify the law. Insufficient tests can miss mass or dependence.

16.12 Mapping theorem

If 𝑋𝑛𝑋 and 𝑔:𝑆𝑇 is continuous, then

𝑔(𝑋𝑛)𝑔(𝑋).

More generally, it is enough that the discontinuity set of 𝑔 has 𝑃𝑋-measure zero. This is the mapping theorem.

The theorem transports weak convergence through functions. It is widely used for statistics: once an estimator converges in distribution, continuous transformations of it converge by mapping. Discontinuous transformations require boundary audits; atoms at discontinuities can break the conclusion.

16.13 Continuous mapping theorem

The continuous mapping theorem is the random-variable version of the mapping theorem. If

𝑋𝑛𝑋

and 𝑔 is continuous at 𝑋 almost surely, then

𝑔(𝑋𝑛)𝑔(𝑋).

For vector-valued variables, this handles sums, products, norms, maxima, and smooth transformations when continuous.

The theorem is law-level. It does not claim pointwise convergence of 𝑔(𝑋𝑛). It says the laws transport through continuous maps. For discontinuous 𝑔, one must verify that the limit avoids discontinuity points almost surely.

16.14 Slutsky’s theorem

Slutsky’s theorem states that if

𝑋𝑛𝑋,𝑌𝑛𝑐

in probability, where 𝑐 is constant, then

𝑋𝑛+𝑌𝑛𝑋+𝑐,𝑋𝑛𝑌𝑛𝑐𝑋,

and if 𝑐0,

𝑋𝑛/𝑌𝑛𝑋/𝑐.

Slutsky’s theorem is the standard tool for replacing unknown normalizing constants or nuisance estimators by consistent estimates. It combines weak convergence with probability convergence. The constant limit is important; if 𝑌𝑛𝑌 nonconstant, joint convergence is required to conclude anything about 𝑋𝑛+𝑌𝑛.


Chapter 17 — Laws of Large Numbers

17.1 Weak law of large numbers

The weak law states that for iid integrable random variables 𝑋𝑖 with mean 𝜇,

1𝑛𝑖=1𝑛𝑋𝑖𝜇

in probability, under standard hypotheses. With finite variance, the proof is immediate from Chebyshev:

Var(1𝑛𝑖𝑋𝑖)=𝜎2𝑛.

Thus

𝑃(1𝑛𝑖𝑋𝑖𝜇>𝜀)𝜎2𝑛𝜀2.

The weak law says empirical averages are close to expectation with high probability for large 𝑛. It is a typical-sample theorem, not a pathwise theorem. It does not say every infinite sequence has average 𝜇, nor that convergence occurs almost surely.

17.2 Strong law of large numbers

The strong law upgrades convergence to almost sure:

1𝑛𝑖=1𝑛𝑋𝑖𝜇a.s.

For iid 𝑋𝑖, the sharp classical integrability condition is 𝐸𝑋1<. Under finite variance, easier proofs use maximal inequalities or subsequence arguments.

The strong law is pathwise. It says that with probability one, a realized infinite sample sequence has empirical average converging to the mean. This is the formal theorem behind long-run frequency stabilization. It still allows exceptional null sequences where convergence fails.

17.3 Kolmogorov inequality

For independent mean-zero random variables 𝑋𝑖 with finite variances, Kolmogorov’s maximal inequality states

𝑃(max1𝑘𝑛𝑖=1𝑘𝑋𝑖𝜆)Var(𝑖=1𝑛𝑋𝑖)𝜆2.

If the variables are independent,

Var(𝑖𝑋𝑖)=𝑖Var(𝑋𝑖).

This inequality controls the maximum partial sum, not just the final sum. It is therefore suited to almost sure convergence proofs. Strong laws require control over all sufficiently large partial sums, and maximal inequalities provide that pathwise bridge.

17.4 Three-series theorem

Kolmogorov’s three-series theorem characterizes almost sure convergence of sums of independent random variables. For independent 𝑋𝑛, the series 𝑋𝑛 converges almost surely iff, for some truncation level 𝑐>0, three series involving large jumps, truncated means, and truncated variances converge:

𝑃(𝑋𝑛>𝑐)<,𝐸[𝑋𝑛1{𝑋𝑛𝑐}] converges,Var(𝑋𝑛1{𝑋𝑛𝑐})<.

The theorem decomposes convergence into jump control, drift control, and fluctuation control. It is the precise independent-sum audit. Large rare terms, accumulated bias, and residual variance are the three possible obstructions.

17.5 Etemadi’s proof

Etemadi gave a clean proof of the strong law for pairwise independent identically distributed integrable random variables. The proof avoids some heavier machinery and shows that full mutual independence is not always necessary for averaging.

The conceptual point is that strong laws require enough independence to control deviations of partial sums. Pairwise independence can suffice when combined with identical distribution and truncation. This is an independence-strength lesson: different limit theorems require different factorization payloads.

17.6 Truncation methods

Truncation replaces 𝑋𝑖 by bounded variables such as

𝑋𝑖(𝑛)=𝑋𝑖1{𝑋𝑖𝑛}.

The goal is to separate ordinary fluctuations from rare large jumps. For integrable 𝑋,

𝐸[𝑋1{𝑋>𝑛}]0.

This makes tail contributions negligible after normalization.

Truncation is a core probability technique because many theorems are easy for bounded variables and hard for unbounded ones. The proof strategy is: prove the theorem for truncated variables, show the discarded tails do not matter, and then lift back to the original variables. The tail estimate is the decisive debt payment.

17.7 Identically distributed versus independent

Identically distributed means all variables have the same law. Independent means the joint law factors. Neither implies the other. A constant sequence 𝑋𝑛=𝑋1 is identically distributed but maximally dependent. Independent variables may have different distributions.

The law of large numbers typically requires both a stable average law and weak enough dependence. Identical distribution supplies a common mean; independence supplies variance or fluctuation control. General versions replace identical distribution with average moment conditions and replace independence with mixing, martingale differences, or ergodicity.

17.8 Weak dependence versions

Weak dependence versions of LLN allow correlations but require them to decay or average out. If 𝑋𝑖 are centered and

Var(1𝑛𝑖=1𝑛𝑋𝑖)0,

then the average converges to zero in 𝐿2 and hence in probability. This criterion can hold under covariance summability or mixing conditions.

For stationary sequences, ergodic theorems replace independence. The average converges to a conditional expectation on the invariant σ-algebra. If the system is ergodic, that conditional expectation is constant. Thus independence is one route to averaging, but not the only route.

17.9 Ergodic theorem preview

Birkhoff’s ergodic theorem states that for a measure-preserving transformation 𝑇 and 𝑓𝐿1,

1𝑛𝑘=0𝑛1𝑓(𝑇𝑘𝜔)𝐸[𝑓𝐼](𝜔)a.s.,

where 𝐼 is the invariant σ-algebra. If the system is ergodic, 𝐼 is trivial and the limit is 𝐸𝑓.

This generalizes the strong law from independent samples to deterministic or dependent dynamical sampling. The limit is not automatically the ensemble mean; it is the invariant-information conditional mean. Ergodicity is the gate that collapses time average to ensemble average.

17.10 Empirical averages and model liftback

The LLN connects formal probability to empirical averaging:

𝑋ˉ𝑛=1𝑛𝑖=1𝑛𝑋𝑖𝐸𝑋.

But the theorem lives inside a model. To export it to data, one must justify that observations are modeled as iid, weakly dependent, stationary ergodic, or otherwise governed by the theorem’s assumptions.

The liftback error is to treat LLN as saying “averages always stabilize.” They stabilize under specific carrier conditions. Heavy tails, nonstationarity, dependence, selection bias, and adversarial sampling can all break the empirical interpretation. LLN is a mathematical certificate after hypotheses are paid, not a universal law of data.


Chapter 18 — Central Limit Theory

18.1 Normal distribution

The normal distribution 𝑁(𝜇,𝜎2) has density

𝑓(𝑥)=12𝜋𝜎2exp((𝑥𝜇)22𝜎2).

The standard normal is 𝑁(0,1). Its characteristic function is

𝜙(𝑡)=𝑒𝑡2/2.

The normal law is stable under independent summation: sums of independent normal variables are normal. It also emerges as the universal finite-variance fluctuation limit for many independent sums. The CLT explains Gaussian appearance not as a primitive assumption but as a consequence of aggregation under appropriate normalization.

18.2 Characteristic functions

For a real random variable 𝑋,

𝜙𝑋(𝑡)=𝐸[𝑒𝑖𝑡𝑋].

If 𝑋,𝑌 are independent,

𝜙𝑋+𝑌(𝑡)=𝜙𝑋(𝑡)𝜙𝑌(𝑡).

This multiplicative property makes characteristic functions ideal for sums.

If 𝑋 has mean 0 and variance 𝜎2, then near zero,

𝜙𝑋(𝑡)=1𝜎2𝑡22+𝑜(𝑡2),

under finite second moment. In CLT proofs, this local expansion is raised to the 𝑛-th power after scaling 𝑡/𝑛, producing the Gaussian exponential.

18.3 Lévy continuity theorem

Lévy’s continuity theorem states that if characteristic functions 𝜙𝑛(𝑡) converge pointwise to a function 𝜙(𝑡) continuous at 0, then 𝜙 is the characteristic function of some probability law 𝜇, and

𝜇𝑛𝜇.

Conversely, weak convergence implies pointwise convergence of characteristic functions.

This theorem is the main transport gate from Fourier analysis to weak convergence. It allows one to prove distributional limits by proving analytic convergence of characteristic functions. Continuity at zero prevents mass loss.

18.4 Classical CLT

Let 𝑋1,𝑋2, be iid with mean 𝜇 and variance 0<𝜎2<. Then

𝑖=1𝑛𝑋𝑖𝑛𝜇𝜎𝑛𝑁(0,1).

This is the classical central limit theorem.

The normalization is essential. The sum has mean 𝑛𝜇 and variance 𝑛𝜎2, so subtracting 𝑛𝜇 centers it and dividing by 𝜎𝑛 gives unit variance. The theorem describes fluctuations around the law-of-large-numbers scale. It does not describe rare large deviations far into the tails.

18.5 Lindeberg–Feller CLT

For triangular arrays 𝑋𝑛,𝑘, the Lindeberg–Feller theorem gives conditions under which normalized sums converge to normal. Let independent centered variables have variances 𝜎𝑛,𝑘2 and total variance 𝑠𝑛2=𝑘𝜎𝑛,𝑘2. The Lindeberg condition is

1𝑠𝑛2𝑘𝐸[𝑋𝑛,𝑘21{𝑋𝑛,𝑘>𝜀𝑠𝑛}]0

for every 𝜀>0.

This condition says no single summand contributes a macroscopic part of the variance. It is a tail-negligibility certificate. The theorem generalizes iid CLT to nonidentically distributed arrays and identifies the obstruction: large individual jumps.

18.6 Lyapunov CLT

The Lyapunov condition is a stronger, easier-to-check sufficient condition. If independent centered variables have total variance 𝑠𝑛2, and for some 𝛿>0,

1𝑠𝑛2+𝛿𝑘𝐸𝑋𝑛,𝑘2+𝛿0,

then the normalized sum converges to 𝑁(0,1).

Lyapunov implies Lindeberg by Markov-type estimates. It pays the large-jump debt using a higher moment. The cost is stronger hypotheses. In applications, Lyapunov is often simpler; Lindeberg is sharper.

18.7 Triangular arrays

A triangular array is a collection 𝑋𝑛,𝑘 for 1𝑘𝑘𝑛, where the 𝑛-th row is summed and normalized. Arrays model changing distributions, infinitesimal summands, and approximations where no fixed iid sequence exists.

Triangular arrays are the natural carrier for general CLT theory. They separate row-level normalization from variable-level assumptions. The main audit questions are: are variables independent within rows, are they centered, what is the row variance, and do large terms vanish relative to total scale?

18.8 Berry–Esseen theorem

For iid variables with mean 0, variance 𝜎2>0, and finite third absolute moment 𝜌=𝐸𝑋3, the Berry–Esseen theorem gives

sup𝑥𝑃(𝑖=1𝑛𝑋𝑖𝜎𝑛𝑥)Φ(𝑥)𝐶𝜌𝜎3𝑛.

It quantifies the CLT rate.

The theorem turns asymptotic convergence into finite-𝑛 error control. The third absolute moment is the rate debt. Without quantitative error, a CLT only says convergence eventually occurs; Berry–Esseen says how fast in Kolmogorov distance.

18.9 Multivariate CLT

For iid random vectors 𝑋𝑖𝑅𝑑 with mean 𝑚 and covariance matrix Σ,

1𝑛𝑖=1𝑛(𝑋𝑖𝑚)𝑁(0,Σ).

The limiting Gaussian is characterized by

𝐸𝑒𝑖𝑡,𝑍=𝑒12𝑡Σ𝑡.

The Cramér–Wold device reduces the proof to one-dimensional CLTs: convergence of all projections 𝑡,𝑋𝑛 implies multivariate convergence. The covariance matrix encodes all second-order limiting geometry.

18.10 Delta method

If

𝑛(𝜃^𝑛𝜃)𝑍

and 𝑔 is differentiable at 𝜃, then

𝑛(𝑔(𝜃^𝑛)𝑔(𝜃))𝐷𝑔(𝜃)𝑍.

In one dimension,

𝑛(𝑔(𝜃^𝑛)𝑔(𝜃))𝑔(𝜃)𝑍.

The delta method transports asymptotic normality through smooth transformations. If the first derivative vanishes, higher-order delta methods are needed with different scaling. The differentiability and nondegeneracy assumptions are the liftback gates.

18.11 Stable convergence

Stable convergence strengthens convergence in distribution by preserving joint convergence with bounded variables measurable with respect to an underlying σ-algebra. One writes

𝑋𝑛st𝑋

when 𝑋 may live on an extension and joint limits with background randomness are retained.

This mode appears in martingale CLTs, random environments, and asymptotic statistics with mixed normal limits. Ordinary weak convergence forgets coupling to the original information; stable convergence remembers enough joint structure to support conditional limits.

18.12 Failure of CLT under heavy tails

If 𝑋𝑖 have infinite variance, the 𝑛 Gaussian CLT may fail. Heavy-tailed variables with

𝑃(𝑋>𝑡)𝐶𝑡𝛼,0<𝛼<2,

can converge, after normalization 𝑛1/𝛼, to stable laws rather than Gaussians.

The obstruction is that large jumps do not become negligible. Variance is infinite, and no single quadratic scale controls fluctuations. The correct limit carrier becomes stable law theory, not Gaussian theory. Thus “sum of many variables is normal” is false without tail hypotheses.

18.13 CLT versus tail probabilities

The CLT describes probabilities at fluctuation scale:

𝑆𝑛𝑛𝜇=𝑂(𝑛).

It does not accurately describe rare deviations such as

𝑆𝑛𝑛𝜇𝑐𝑛.

Those belong to large deviation theory, where probabilities decay exponentially and rate functions, not Gaussian densities, govern behavior.

Using CLT for far-tail estimates is a common error. Moderate deviations interpolate between CLT and large deviations, but require their own hypotheses. The scale must be declared before a limit theorem is applied.

18.14 Gaussian approximation residue

A Gaussian approximation has residue: centering error, variance estimation error, dependence error, tail error, lattice correction, finite-sample rate, and test-function class. A bare CLT only gives weak convergence:

𝐸𝑓(𝑍𝑛)𝐸𝑓(𝑍)

for bounded continuous 𝑓. It does not automatically give density approximation, tail approximation, local probability approximation, or uniform finite-𝑛 accuracy.

The correct terminal depends on the claim. If the claim is asymptotic distribution, CLT may suffice. If the claim is finite probability, confidence interval accuracy, rare-event estimate, or density approximation, additional quantitative certificates are required.


Chapter 19 — Poisson and Rare-Event Limits

19.1 Law of small numbers

The law of small numbers says that the sum of many rare, approximately independent indicators tends to a Poisson distribution. If 𝑋𝑛Binomial(𝑛,𝜆/𝑛), then

𝑋𝑛Poisson(𝜆).

The heuristic is: many opportunities, each with small probability, with total expected count near 𝜆.

The Poisson law is therefore the rare-event analogue of the Gaussian law. Gaussian limits arise from many small additive fluctuations with finite variance; Poisson limits arise from sparse counts of rare events. The scaling regime determines the limit carrier.

19.2 Poisson approximation

For a sum of indicators

𝑊=𝑖𝐼𝐼𝑖,𝑝𝑖=𝑃(𝐼𝑖=1),

the natural Poisson parameter is

𝜆=𝐸𝑊=𝑖𝑝𝑖.

If the indicators are rare and weakly dependent, then 𝑊 is close to Poisson(𝜆).

Exact approximation requires an error metric such as total variation:

𝑑TV(𝐿(𝑊),Poisson(𝜆)).

The quality depends on probabilities of individual events and dependence neighborhoods. Rare-event approximation is not just matching the mean; dependence can create clusters and destroy Poisson behavior.

19.3 Binomial-to-Poisson limit

Let 𝑋𝑛Binomial(𝑛,𝑝𝑛) with 𝑛𝑝𝑛𝜆. Then for fixed 𝑘,

𝑃(𝑋𝑛=𝑘)=(𝑛𝑘)𝑝𝑛𝑘(1𝑝𝑛)𝑛𝑘𝑒𝜆𝜆𝑘𝑘!.

Thus 𝑋𝑛Poisson(𝜆).

The proof shows the three components: (𝑛𝑘)𝑛𝑘/𝑘!, 𝑝𝑛𝑘(𝜆/𝑛)𝑘, and (1𝑝𝑛)𝑛𝑒𝜆. This is the canonical sparse independent limit.

19.4 Chen–Stein method

The Chen–Stein method gives quantitative Poisson approximation for dependent indicators. It characterizes the Poisson distribution by an operator equation and bounds the distance between 𝑊 and Poisson through local dependence terms.

A typical bound involves dependency neighborhoods 𝐵𝑖, with errors built from sums such as

𝑖𝑗𝐵𝑖𝑝𝑖𝑝𝑗

and joint probabilities

𝑖𝑗𝐵𝑖,𝑗𝑖𝑃(𝐼𝑖=1,𝐼𝑗=1).

The method pays the dependence debt explicitly. It is widely used in random graphs, occupancy problems, pattern matching, and rare-event counts.

19.5 Rare indicators

Rare indicators are variables 𝐼𝑖=1𝐴𝑖 with small 𝑃(𝐴𝑖). Their sum counts rare events. The Poisson regime requires not only that each 𝑃(𝐴𝑖) is small, but that the total mean remains finite:

𝑖𝑃(𝐴𝑖)𝜆.

If rare events occur in clusters, the limit may be compound Poisson rather than Poisson. If dependence is too strong, no Poisson limit may hold. The audit is: individual rarity, total intensity, and clustering control.

19.6 Dependency neighborhoods

A dependency neighborhood 𝐵𝑖 for 𝐼𝑖 is a set of indices containing variables that may significantly depend on 𝐼𝑖. Outside 𝐵𝑖, approximate independence is assumed or proved. The smaller and weaker these neighborhoods are, the closer the count is to Poisson.

In random graphs, the indicator for a subgraph copy depends on indicators for overlapping copies. Nonoverlapping copies may be independent. Thus dependency neighborhoods are combinatorial overlap structures. Poisson approximation reduces to counting overlaps and showing their contribution vanishes.

19.7 Point processes preview

A point process is a random counting measure

𝑁=𝑖𝛿𝑋𝑖

on a space 𝑆. A Poisson point process with intensity measure 𝜇 satisfies:

𝑁(𝐴)Poisson(𝜇(𝐴))

for measurable 𝐴, and counts on disjoint sets are independent.

Poisson point processes are the spatial version of rare-event limits. Instead of only counting how many rare events occur, they record where they occur. Many rare-event limits are better stated as convergence of point processes, with ordinary Poisson count convergence as a projection.

19.8 Poissonization and depoissonization

Poissonization replaces a fixed sample size 𝑛 by a Poisson random sample size 𝑁Poisson(𝑛). This often makes counts independent. For example, in occupancy problems, Poissonizing the number of balls makes bin occupancies independent Poisson variables.

Depoissonization transfers estimates back to fixed 𝑛. This transfer requires error control showing that randomizing the sample size did not change the target quantity too much. Poissonization is a carrier simplification; depoissonization is the liftback.

19.9 Occupancy and collision limits

In occupancy with 𝑚 balls and 𝑛 boxes, the number of boxes with exactly 𝑟 balls is a rare-event count in many regimes. Collision counts are sums over pairs:

𝐶=𝑖<𝑗1{𝑋𝑖=𝑋𝑗}.

When 𝑚2/𝑛𝜆, collisions can have Poisson limits.

The exact limit depends on scaling. If 𝑚𝑛, collisions vanish. If 𝑚𝑐𝑛, collisions have nontrivial Poisson behavior. If 𝑚𝑛, collisions become abundant. Occupancy models therefore display rare-event thresholds through elementary counting.


Chapter 20 — Local Limit Theorems

20.1 Global versus local convergence

A global limit theorem, such as the CLT, says distribution functions converge:

𝑃(𝑆𝑛𝑎𝑛𝑏𝑛𝑥)Φ(𝑥).

A local limit theorem estimates point probabilities or small-window probabilities:

𝑃(𝑆𝑛=𝑘)1𝑏𝑛𝑔(𝑘𝑎𝑛𝑏𝑛)

in lattice cases, or density-level approximations in continuous cases.

Local theorems are stronger. Weak convergence tests large intervals with continuity boundaries; local convergence probes individual atoms or shrinking windows. It requires finer Fourier control and must account for lattice structure.

20.2 Lattice distributions

A distribution is lattice if it is supported on

𝑎+𝑍

for some span >0. Sums of lattice variables remain on a lattice. For integer-valued iid variables with mean 𝜇 and variance 𝜎2, a typical local CLT has the form

𝑃(𝑆𝑛=𝑘)1𝜎𝑛𝜑(𝑘𝑛𝜇𝜎𝑛),

for admissible lattice points 𝑘, with 𝜑 the standard normal density.

The lattice span matters. If the variable only takes even values, odd target points have probability zero. A local theorem that ignores lattice support is false. Lattice audit is therefore mandatory.

20.3 Nonlattice distributions

A nonlattice distribution is not concentrated on any shifted arithmetic progression 𝑎+𝑍. For sums of nonlattice variables, local results are usually stated in terms of densities, small intervals, or smoothed probabilities rather than point masses, since 𝑃(𝑆𝑛=𝑥)=0 in continuous cases.

If 𝑆𝑛 has density 𝑓𝑛, a density local CLT may state

𝜎𝑛𝑓𝑛(𝑛𝜇+𝜎𝑛𝑥)𝜑(𝑥)

uniformly in 𝑥, under regularity conditions. Nonlattice assumptions prevent periodic Fourier obstructions that appear in lattice cases.

20.4 Fourier inversion

Local limit theorems rely heavily on Fourier inversion. For integer-valued 𝑆𝑛,

𝑃(𝑆𝑛=𝑘)=12𝜋𝜋𝜋𝑒𝑖𝑡𝑘𝜙𝑆𝑛(𝑡)𝑑𝑡.

For continuous densities,

𝑓𝑆𝑛(𝑥)=12𝜋𝑅𝑒𝑖𝑡𝑥𝜙𝑆𝑛(𝑡)𝑑𝑡

when inversion is justified.

The proof strategy splits the Fourier domain. Near 𝑡=0, characteristic functions approximate the Gaussian exponential. Away from zero, one needs decay bounds to show contributions are negligible. Local results require stronger global Fourier control than ordinary CLT.

20.5 Aperiodicity

For integer-valued variables, aperiodicity means the support is not contained in a proper sublattice 𝑎+𝑍 with >1. Equivalently, the characteristic function satisfies

𝜙(𝑡)<1

for 𝑡[𝜋,𝜋]{0}
in the span-one case.

Aperiodicity prevents periodic zeros and inaccessible residue classes. Without it, the local limit must be restricted to reachable lattice points and corrected by the span. Aperiodicity is the discrete analogue of nonlattice regularity.

20.6 Gaussian local limit theorem

For iid integer-valued aperiodic variables with mean 𝜇 and variance 𝜎2, the Gaussian local limit theorem states

sup𝑘𝜎𝑛𝑃(𝑆𝑛=𝑘)𝜑(𝑘𝑛𝜇𝜎𝑛)0.

This is stronger than the CLT because it approximates individual probabilities.

The theorem shows that the distribution mass near 𝑘 is approximately the Gaussian density times the lattice cell width 1/(𝜎𝑛). It is the correct bridge from continuous Gaussian shape to discrete point probabilities.

20.7 Edgeworth expansions

Edgeworth expansions refine normal approximation by adding correction terms involving cumulants. A typical density-level expansion has the form

𝑓𝑛(𝑥)=𝜑(𝑥)[1+𝑃1(𝑥)𝑛+𝑃2(𝑥)𝑛+]+𝑜(𝑛𝑚/2),

where 𝑃𝑖 are polynomials determined by cumulants.

These expansions give higher-order asymptotics beyond CLT. They require stronger moment and smoothness conditions and may fail in far tails. Edgeworth expansions are asymptotic series, not automatically positive densities or uniform global approximations.

20.8 Saddle-point methods preview

Saddle-point methods estimate probabilities using complex analytic or exponential tilting techniques. They are especially useful for local probabilities and tail probabilities where Gaussian approximation is insufficient. The method locates the dominant contribution to an integral representation, often from a point where a phase derivative vanishes.

In probability, saddle-point methods appear in sums, combinatorial enumeration, branching processes, and large deviations. They refine exponential-scale estimates by adding prefactors. The carrier is analytic: moment generating functions, cumulant generating functions, and contour integrals.

20.9 Local probability estimates

Local estimates determine probabilities of specific values or small windows:

𝑃(𝑆𝑛=𝑘),𝑃(𝑆𝑛[𝑥,𝑥+Δ]).

They are required in random walks, renewal theory, combinatorics, statistical mechanics, and number-theoretic probability. Global convergence may be too coarse because it cannot resolve individual mass points.

A local estimate usually needs variance scale, lattice/nonlattice classification, smoothness or aperiodicity, and Fourier decay. The output is often uniform over central ranges and weaker in tails. Tail-local estimates may require large deviation or saddle-point machinery.

20.10 Lattice obstruction audit

Before applying a local limit theorem, check the support. If 𝑋 is supported on 𝑎+𝑍, then 𝑆𝑛 is supported on

𝑛𝑎+𝑍.

Any claimed approximation at points outside this set is false because the probability is exactly zero.

The audit also includes span, periodicity, residue classes, and whether the target variable has density or atoms. The CLT can ignore these details because intervals blur them; local limits cannot. Local probability is where carrier arithmetic reappears.

Comments

Popular posts from this blog

Semiotics Rebooted

THE COLLAPSE ENGINE: AI, Capital, and the Terminal Logic of 2025

ORSI: The Telic Geometry of Meaning