Probability Theory — Chapters 21–29

 Probability Theory — Chapters 21–29

Chapter 21 — Large Deviations

21.1 Exponential scales

Large deviation theory studies probabilities that decay exponentially in a scale parameter. Instead of asking whether 𝑋𝑛𝑋, it asks for asymptotics of the form

𝑃(𝑋𝑛𝐴)𝑒𝑛inf𝑥𝐴𝐼(𝑥).

The scale 𝑛 is not always the sample size; it may be 𝑡, 1/𝜀, 𝑛2, volume, inverse noise, or another problem-specific growth parameter. The object 𝐼 is the rate function. Events with small 𝐼 are less expensive; events with large 𝐼 are exponentially suppressed.

The exponential scale is invisible to the central limit theorem. A CLT sees fluctuations of size 𝑛, while large deviations often concern deviations of size 𝑛. For iid 𝑋𝑖 with mean 𝜇, the event

1𝑛𝑖=1𝑛𝑋𝑖𝑎𝜇

usually has probability 𝑒𝑛𝐼(𝑎), not Gaussian-scale probability. The correct question is not “does the probability go to zero?” but “what exponential cost does the deviation pay?”

21.2 Chernoff bounds

Chernoff bounds use exponential moments to estimate tail probabilities. For any 𝑡>0,

𝑃(𝑋𝑎)=𝑃(𝑒𝑡𝑋𝑒𝑡𝑎)𝑒𝑡𝑎𝐸[𝑒𝑡𝑋].

Optimizing over 𝑡 gives

𝑃(𝑋𝑎)inf𝑡>0𝑒𝑡𝑎𝐸[𝑒𝑡𝑋].

For sums of independent variables, moment generating functions factor:

𝐸𝑒𝑡𝑖𝑋𝑖=𝑖𝐸𝑒𝑡𝑋𝑖.

Chernoff bounds are the finite-𝑛 ancestor of large deviation theory. They convert an event constraint into an exponential tilting problem. The bound is often sharp at exponential scale because the optimizing 𝑡 defines a tilted probability measure under which the rare event becomes typical. The hidden debt is existence of the moment generating function in the needed range.

21.3 Cramér’s theorem

Cramér’s theorem is the fundamental large deviation theorem for iid sample means. Let 𝑋1,𝑋2, be iid real variables and define

𝑆𝑛=1𝑛𝑖=1𝑛𝑋𝑖.

Let

Λ(𝑡)=log𝐸[𝑒𝑡𝑋1]

be the log-moment generating function. Under standard hypotheses, 𝑆𝑛 satisfies a large deviation principle with rate function

𝐼(𝑥)=sup𝑡𝑅{𝑡𝑥Λ(𝑡)}.

Thus for suitable sets 𝐴,

𝑃(𝑆𝑛𝐴)𝑒𝑛inf𝑥𝐴𝐼(𝑥).

Cramér’s theorem says empirical means deviate from their expectation with exponential cost determined by the Legendre transform of Λ. The mean 𝜇=𝐸𝑋 is the zero-cost point: 𝐼(𝜇)=0. Away from 𝜇, 𝐼(𝑥)>0 when the law is nondegenerate. The theorem gives the correct rare-event regime for averages, not merely the CLT regime.

21.4 Rate functions

A rate function 𝐼:𝑆[0,] assigns exponential cost to states. It is usually required to be lower semicontinuous. A good rate function has compact level sets:

{𝑥:𝐼(𝑥)𝑐}

compact for every 𝑐<. Goodness is a compactness condition; it prevents minimizing sequences from escaping.

A large deviation principle has two inequalities. For closed 𝐹,

lim sup𝑛1𝑛log𝑃(𝑋𝑛𝐹)inf𝑥𝐹𝐼(𝑥),

and for open 𝐺,

lim inf𝑛1𝑛log𝑃(𝑋𝑛𝐺)inf𝑥𝐺𝐼(𝑥).

The closed-set bound controls upper tails; the open-set bound guarantees enough mass near feasible points. Together they define exponential asymptotics at the topological level.

21.5 Legendre transform

The Legendre transform converts a convex log-moment generating function into a rate function:

𝐼(𝑥)=Λ(𝑥)=sup𝑡{𝑡𝑥Λ(𝑡)}.

This operation is convex duality. The parameter 𝑡 is the exponential tilt; the value 𝑥 is the forced mean. At differentiability points, the optimizer satisfies

Λ(𝑡)=𝑥.

In large deviations, the Legendre transform is not formal decoration. It is the certificate that the cheapest way to force an atypical average is to tilt the underlying distribution. The original law 𝑃 is replaced by

𝑑𝑃𝑡𝑑𝑃(𝑥)=𝑒𝑡𝑥Λ(𝑡).

Under 𝑃𝑡, the rare mean 𝑥 becomes typical. The exponential cost of this tilt is the rate 𝐼(𝑥).

21.6 Sanov theorem

Sanov’s theorem is the large deviation principle for empirical measures. If 𝑋1,,𝑋𝑛 are iid with law 𝜇, the empirical measure

𝐿𝑛=1𝑛𝑖=1𝑛𝛿𝑋𝑖

satisfies an LDP with rate function

𝐻(𝜈𝜇)={log(𝑑𝜈𝑑𝜇)𝑑𝜈,𝜈𝜇,+,otherwise.

This is relative entropy, or Kullback–Leibler divergence.

Sanov says the probability that the empirical distribution resembles 𝜈 decays like

𝑒𝑛𝐻(𝜈𝜇).

The theorem lifts large deviations from sample means to empirical laws. Cramér’s theorem follows by contraction: the empirical mean is the image of the empirical measure under 𝜈𝑥𝑑𝜈.

21.7 Varadhan lemma

Varadhan’s lemma evaluates exponential integrals under an LDP. If 𝑋𝑛 satisfies an LDP with rate 𝐼, then under suitable continuity and integrability hypotheses,

lim𝑛1𝑛log𝐸[𝑒𝑛𝐹(𝑋𝑛)]=sup𝑥{𝐹(𝑥)𝐼(𝑥)}.

This is Laplace’s principle.

The lemma says exponential integrals are dominated by the best compromise between reward 𝐹(𝑥) and cost 𝐼(𝑥). It is the probabilistic analogue of saddle-point asymptotics. The supremum is not necessarily attained at the typical point; exponential reweighting selects a new dominant state.

21.8 Gärtner–Ellis theorem

The Gärtner–Ellis theorem derives an LDP from convergence of scaled log-moment generating functions. If

Λ𝑛(𝑡)=1𝑛log𝐸[𝑒𝑛𝑡𝑋𝑛]

converges to Λ(𝑡), and Λ satisfies suitable regularity conditions, then 𝑋𝑛 satisfies an LDP with rate

𝐼(𝑥)=Λ(𝑥)=sup𝑡{𝑡𝑥Λ(𝑡)}.

This theorem generalizes Cramér’s theorem beyond iid sums. It is useful for dependent systems, statistical mechanics, Markov processes, and random matrices. The differentiability or exposed-point conditions are not cosmetic; without them, only partial lower bounds may hold. The moment-generating carrier may see convexified costs rather than all pathwise obstructions.

21.9 Moderate deviations

Moderate deviations sit between the CLT and large deviations. For iid centered variables with variance 𝜎2, consider scales 𝑎𝑛 satisfying

𝑎𝑛,𝑎𝑛𝑛0.

Then

𝑃(𝑖=1𝑛𝑋𝑖𝜎𝑛𝑎𝑛𝑥)

often behaves like a Gaussian tail at exponential scale 𝑎𝑛2:

𝑒𝑎𝑛2𝑥2/2.

The regime is larger than CLT fluctuations but smaller than order-𝑛 deviations. Moderate deviations show where Gaussian behavior remains valid into the tail. They require more tail control than the CLT but less than full Cramér-type large deviations.

21.10 Concentration versus large deviations

Concentration inequalities give nonasymptotic probability bounds such as

𝑃(𝑋𝐸𝑋𝑡)2𝑒𝑐𝑡2.

Large deviation theory gives asymptotic exponential rates:

𝑃(𝑋𝑛𝐴)𝑒𝑛inf𝐴𝐼.

Both study rare events, but their outputs differ. Concentration prioritizes finite-sample bounds, often with constants. Large deviations prioritize sharp exponential asymptotics and variational structure.

The two methods overlap but should not be conflated. A concentration bound may be nonsharp at exponential scale. An LDP may not give usable finite-𝑛 estimates. Concentration is a certificate of suppression; large deviations are a certificate of exponential cost geometry.

21.11 Rare-event regimes outside CLT

The CLT cannot control probabilities far from the central scale. If

𝑆𝑛=𝑖=1𝑛𝑋𝑖,

then the CLT describes

𝑆𝑛𝑛𝜇=𝑂(𝑛).

Events like

𝑆𝑛𝑛𝜇𝑐𝑛

belong to large deviations. Events like

𝑆𝑛𝑛𝜇𝑎𝑛𝑛,𝑎𝑛

belong to moderate deviations. Events driven by one huge summand in heavy-tailed laws may follow subexponential rather than Cramér-type asymptotics.

The audit rule is scale first, theorem second. A Gaussian approximation at the wrong scale is a carrier error. Rare-event analysis must declare whether the deviation is collective tilting, one-big-jump behavior, boundary crossing, local lattice behavior, or path-level deviation.


Chapter 22 — Filtrations and Adapted Processes

22.1 Information over time

Probability over time requires an information structure. At time 𝑡, the model should know which events are observable by then. This is encoded by a σ-algebra 𝐹𝑡. If 𝑠𝑡, then information cannot decrease:

𝐹𝑠𝐹𝑡.

The family (𝐹𝑡) is a filtration.

Without a filtration, temporal statements like “known now,” “future,” “stopping time,” “martingale,” or “strategy” are undefined. A process alone gives values; a filtration tells which values and events are available at each time. Dynamic probability is probability plus information flow.

22.2 Filtrations

A filtration on (Ω,𝐹,𝑃) is an increasing family of sub-σ-algebras:

(𝐹𝑡)𝑡𝑇,𝐹𝑠𝐹𝑡𝐹for 𝑠𝑡.

The index set 𝑇 may be discrete, such as 𝑁, or continuous, such as [0,).

The filtration is the carrier for conditional expectation over time. For an integrable random variable 𝑋, 𝐸[𝑋𝐹𝑡] is the best estimate of 𝑋 using information available at time 𝑡. Increasing 𝑡 refines the information. The tower property expresses temporal consistency.

22.3 Adapted processes

A process 𝑋𝑡 is adapted to (𝐹𝑡) if

𝑋𝑡 is 𝐹𝑡-measurable

for every 𝑡. This means the value of 𝑋𝑡 is known at time 𝑡. Adaptedness is the minimum legality condition for a process to be observable in real time.

A process can be adapted to one filtration and not another. For example, if 𝐹𝑡=𝜎(𝑋𝑠:𝑠𝑡), then 𝑋 is adapted to its natural filtration. If the filtration also contains future values, adaptedness may still hold but the model permits anticipatory information. That matters for stochastic integration and trading strategies.

22.4 Predictable processes

In discrete time, a process 𝐻𝑛 is predictable if 𝐻𝑛 is 𝐹𝑛1-measurable. The value used at time 𝑛 must be chosen using information from before time 𝑛. In continuous time, predictability is defined through the predictable σ-algebra, generated by left-continuous adapted processes.

Predictability is stronger than adaptedness and is essential for stochastic integration. An integrand in an Itô integral must not see the future increment it is integrating against. Predictability is the no-anticipation certificate. Without it, one can manufacture false gains or invalid integrals by using future noise.

22.5 Stopping times

A random time 𝜏 is a stopping time if

{𝜏𝑡}𝐹𝑡

for every 𝑡. This means that by time 𝑡, one can determine whether 𝜏 has occurred. In discrete time, equivalently {𝜏=𝑛}𝐹𝑛.

Hitting times are standard examples:

𝜏=inf{𝑡:𝑋𝑡𝐴}.

Under suitable adaptedness and path regularity, such times are stopping times. Last exit times or times defined using future behavior are usually not stopping times. The stopping-time condition is the temporal measurability gate.

22.6 Optional times

The term “optional time” is often used synonymously with stopping time in many probability texts, though in finer continuous-time theory it connects to optional σ-algebras and optional processes. Optional processes are measurable with respect to the optional σ-algebra, generated by adapted càdlàg processes.

The distinction matters in stochastic calculus. Predictable objects are known just before time 𝑡; optional objects are observable at time 𝑡. Jump processes expose the difference. At a jump time, the jump itself may be optional but not predictable. This split is central to compensators, stochastic integration, and semimartingale theory.

22.7 Natural filtration

The natural filtration of a process 𝑋𝑡 is

𝐹𝑡𝑋=𝜎(𝑋𝑠:𝑠𝑡).

It is the smallest filtration making 𝑋 adapted. It represents exactly the information generated by observing the process up to time 𝑡.

Natural filtrations are minimal carriers. In applications, one often enlarges them to include external information, initial randomness, completed null sets, or independent noise. Enlargement changes what counts as a stopping time or predictable strategy. The process is not the whole model; the filtration determines the information regime.

22.8 Completed and right-continuous filtrations

A filtration is complete if 𝐹0 contains all 𝑃-null subsets of 𝐹. It is right-continuous if

𝐹𝑡=𝑠>𝑡𝐹𝑠.

The usual conditions in continuous-time probability require completeness and right-continuity.

These conditions are technical but load-bearing. They improve behavior of stopping times, optional projections, martingale modifications, and debut theorems. Completing adds null information; right-continuity prevents infinitesimal information gaps immediately after 𝑡. Many stochastic calculus theorems assume these conditions silently.

22.9 Information leakage errors

Information leakage occurs when a process, strategy, stopping rule, or integrand uses future data while being treated as nonanticipative. For example, defining 𝐻𝑛=sign(𝑋𝑛+1𝑋𝑛) and using it as a trading strategy at time 𝑛 is illegal because it depends on future movement.

The audit is always measurability against the correct σ-algebra. Adaptedness uses 𝐹𝑡; predictability often uses 𝐹𝑡; stopping uses {𝜏𝑡}𝐹𝑡. If the information gate fails, martingale fairness, optional stopping, and stochastic integration conclusions can collapse.


Chapter 23 — Martingales

23.1 Martingales, submartingales, supermartingales

A process (𝑀𝑡,𝐹𝑡) is a martingale if 𝑀𝑡 is adapted, integrable, and

𝐸[𝑀𝑡𝐹𝑠]=𝑀𝑠,𝑠𝑡.

It is a submartingale if

𝐸[𝑀𝑡𝐹𝑠]𝑀𝑠,

and a supermartingale if the inequality is reversed.

A martingale is a fair process relative to the information filtration. It does not mean the path is constant or predictable. It means conditional future expectation equals present value. Submartingales have nonnegative conditional drift; supermartingales have nonpositive conditional drift.

23.2 Examples from sums, games, likelihood ratios

If 𝑋𝑖 are integrable martingale differences with

𝐸[𝑋𝑖𝐹𝑖1]=0,

then

𝑀𝑛=𝑖=1𝑛𝑋𝑖

is a martingale. Fair gambling capital under fair bets is the canonical example. If the bets have nonpositive expected gain, the capital is a supermartingale.

Likelihood ratios also produce martingales. If 𝑃 and 𝑄 are probability measures and 𝐹𝑡 is increasing, the density process

𝐿𝑡=𝑑𝑄𝐹𝑡𝑑𝑃𝐹𝑡

is a 𝑃-martingale under suitable absolute continuity. This connects martingales to change of measure, statistics, and filtering.

23.3 Doob decomposition

In discrete time, any integrable adapted submartingale 𝑋𝑛 admits a decomposition

𝑋𝑛=𝑀𝑛+𝐴𝑛,

where 𝑀𝑛 is a martingale and 𝐴𝑛 is a predictable increasing process with 𝐴0=0. For a general adapted integrable process, one can decompose into martingale plus predictable drift increments.

The decomposition separates fair fluctuation from systematic drift. For a submartingale, the drift part is increasing. This is the prototype for compensators in continuous time, where counting processes are decomposed into martingale plus predictable intensity.

23.4 Martingale transforms

In discrete time, if 𝑀𝑛 is a martingale with increments Δ𝑀𝑛=𝑀𝑛𝑀𝑛1, and 𝐻𝑛 is predictable, then

(𝐻𝑀)𝑛=𝑘=1𝑛𝐻𝑘Δ𝑀𝑘

is a martingale under integrability conditions. This is the martingale transform.

The predictability of 𝐻𝑘 is essential. It means the multiplier is chosen before observing Δ𝑀𝑘. Martingale transforms model fair betting strategies and form the discrete ancestor of stochastic integrals. If 𝐻𝑘 can depend on future increments, the martingale property can fail.

23.5 Optional stopping theorem

The optional stopping theorem gives conditions under which

𝐸[𝑀𝜏]=𝐸[𝑀0]

for a martingale 𝑀 and stopping time 𝜏. Sufficient conditions include bounded 𝜏, uniformly integrable stopped martingales, or suitable integrability and bounded increment conditions.

The theorem is not the false slogan “you cannot beat a martingale by stopping.” Without hypotheses, optional stopping can fail. Stopping times with infinite expectation, unbounded gains, or non-uniform integrability can create invalid expectations. The theorem is a gated result, not a universal gambling principle.

23.6 Optional sampling theorem

Optional sampling generalizes optional stopping to two stopping times 𝜎𝜏:

𝐸[𝑀𝜏𝐹𝜎]=𝑀𝜎

under suitable hypotheses. For submartingales,

𝐸[𝑋𝜏𝐹𝜎]𝑋𝜎.

This theorem says martingale fairness persists between random observation times, provided those times are legitimate stopping times and integrability conditions hold. It is central for hitting probabilities, stopped processes, stochastic calculus, and Markov process arguments.

23.7 Maximal inequalities

Doob’s maximal inequality controls the running maximum of a submartingale. For a nonnegative submartingale 𝑋𝑘,

𝜆𝑃(max𝑘𝑛𝑋𝑘𝜆)𝐸[𝑋𝑛].

For martingales, applying this to 𝑀𝑘 or 𝑀𝑘𝑝 gives pathwise maximum bounds.

Maximal inequalities convert terminal integrability into pathwise control over all earlier times. This is the martingale analogue of Kolmogorov’s inequality. It is indispensable for convergence theorems, stopping arguments, and tightness of stochastic processes.

23.8 Doob’s Lᵖ inequalities

For 𝑝>1 and a martingale 𝑀𝑘,

max𝑘𝑛𝑀𝑘𝑝𝑝𝑝1𝑀𝑛𝑝.

Equivalently,

𝐸[max𝑘𝑛𝑀𝑘𝑝](𝑝𝑝1)𝑝𝐸𝑀𝑛𝑝.

This inequality gives 𝐿𝑝 control of the entire path from 𝐿𝑝 control of the endpoint. It is one of the main reasons martingales are analytically tractable. The condition 𝑝>1 matters; 𝑝=1 has weaker forms and requires different handling.

23.9 Upcrossing inequality

For a submartingale 𝑋𝑛, the upcrossing inequality bounds the number of completed crossings of an interval [𝑎,𝑏]. If 𝑈𝑛[𝑎,𝑏] is the number of upcrossings by time 𝑛, then

(𝑏𝑎)𝐸[𝑈𝑛[𝑎,𝑏]]𝐸[(𝑋𝑛𝑎)+]𝐸[(𝑋0𝑎)+].

Upcrossings measure oscillation. If infinitely many upcrossings occur between rational levels 𝑎<𝑏, the sequence cannot converge. The inequality is therefore a convergence engine: boundedness in 𝐿1 controls oscillation, forcing almost sure convergence of submartingales under suitable assumptions.

23.10 Martingale convergence theorem

If 𝑀𝑛 is a martingale bounded in 𝐿1 in the right way, such as a nonnegative supermartingale or an 𝐿𝑝-bounded martingale for 𝑝>1, then 𝑀𝑛 converges almost surely to a limit 𝑀. Under uniform integrability,

𝑀𝑛=𝐸[𝑀𝐹𝑛]

and convergence also holds in 𝐿1.

The theorem is a central pathwise compactness result. Martingales do not have deterministic monotonicity, but their conditional fairness plus integrability prevents endless profitable oscillation. Upcrossing control is the core mechanism.

23.11 Uniform integrability of martingales

A martingale (𝑀𝑛) is uniformly integrable if

lim𝐾sup𝑛𝐸[𝑀𝑛1{𝑀𝑛>𝐾}]=0.

Uniform integrability is exactly the condition that allows 𝐿1 terminal representation:

𝑀𝑛=𝐸[𝑀𝐹𝑛].

Boundedness in 𝐿𝑝 for 𝑝>1 implies uniform integrability. Without uniform integrability, a martingale can converge almost surely but lose mass in expectation. This is the martingale version of the general expectation convergence problem.

23.12 Reverse martingales

A reverse martingale is adapted to a decreasing sequence of σ-algebras 𝐺𝑛, with

𝐸[𝑋𝑛𝐺𝑛+1]=𝑋𝑛+1

when 𝐺𝑛+1𝐺𝑛. A standard example is

𝑋𝑛=𝐸[𝑋𝐺𝑛].

Reverse martingales converge almost surely and in 𝐿1 under integrability. They are used in exchangeability, de Finetti theory, ergodic theory, and tail σ-algebra arguments. They describe what remains when information is progressively forgotten rather than accumulated.

23.13 Martingale difference sequences

A martingale difference sequence satisfies

𝐸[𝐷𝑛𝐹𝑛1]=0.

Then

𝑀𝑛=𝑘=1𝑛𝐷𝑘

is a martingale. Martingale differences generalize independent centered variables; they allow dependence while preserving zero conditional drift.

Many limit theorems and concentration inequalities for independent sums extend to martingale differences with conditional variance and boundedness assumptions. The adapted conditional-mean-zero property is the replacement certificate for independence.


Chapter 24 — Martingale Limit and Concentration Tools

24.1 Azuma–Hoeffding inequality

If 𝑀𝑛 is a martingale with bounded increments

𝑀𝑘𝑀𝑘1𝑐𝑘,

then

𝑃(𝑀𝑛𝑀0𝑡)exp(𝑡22𝑘=1𝑛𝑐𝑘2).

A two-sided version gives

𝑃(𝑀𝑛𝑀0𝑡)2exp(𝑡22𝑐𝑘2).

Azuma–Hoeffding is a concentration theorem for bounded martingale differences. It converts conditional fairness plus bounded step size into subgaussian tails. It is especially useful when independence is absent but exposure martingales can be constructed.

24.2 Freedman inequality

Freedman’s inequality refines Azuma by using conditional variance. If 𝑀𝑛 has increments bounded above by 𝑏, and predictable quadratic variation

𝑉𝑛=𝑘=1𝑛𝐸[(Δ𝑀𝑘)2𝐹𝑘1],

then

𝑃(𝑀𝑛𝑡, 𝑉𝑛𝑣)exp(𝑡22(𝑣+𝑏𝑡/3)).

This inequality is stronger when actual variance is much smaller than the worst-case squared increment bound. It is the martingale analogue of Bernstein concentration. It separates two debts: fluctuation variance 𝑣 and maximum jump 𝑏.

24.3 Burkholder–Davis–Gundy inequalities

The BDG inequalities relate moments of a martingale’s maximum to moments of its quadratic variation. In continuous time, for 𝑝>0,

𝐸[sup𝑡𝑇𝑀𝑡𝑝]𝑝𝐸[[𝑀]𝑇𝑝/2]

for suitable martingales. Here [𝑀]𝑇 is quadratic variation.

BDG is a deep path-control theorem. It says the size of a martingale path is governed by accumulated quadratic variation. This is foundational for stochastic integration, SDE estimates, tightness, and regularity of martingale-driven processes.

24.4 Exponential martingales

Exponential martingales are built by exponentiating a martingale and subtracting compensating drift. For Brownian motion 𝐵𝑡,

exp(𝜃𝐵𝑡12𝜃2𝑡)

is a martingale. In discrete time, exponential supermartingales underlie Chernoff, Hoeffding, Bernstein, and Freedman bounds.

The compensation term is the log-moment correction. Exponential martingales turn tail events into optional stopping or maximal inequality problems. They are also the mechanism behind Girsanov change of measure: exponential martingales become Radon–Nikodym densities.

24.5 Stopping and localization

Localization handles processes that are well-behaved only up to random times. A process 𝑀𝑡 is a local martingale if there exists an increasing sequence of stopping times 𝜏𝑛 such that

𝑀𝑡𝜏𝑛

is a martingale for each 𝑛.

Localization permits stochastic calculus for unbounded or explosive processes by proving statements on bounded stopped intervals and then passing to the limit. The liftback requires care: local martingales need not be true martingales, and expectation identities may fail without integrability or uniform integrability.

24.6 Quadratic variation

Quadratic variation measures accumulated squared increments. For a continuous martingale 𝑀,

[𝑀]𝑡

is the increasing process such that

𝑀𝑡2[𝑀]𝑡

is a martingale. For Brownian motion,

[𝐵]𝑡=𝑡.

Quadratic variation is the second-order clock of martingale fluctuation. In stochastic calculus, it is why Itô’s formula has a second derivative term. Ordinary differentiable paths have zero quadratic variation; Brownian paths have finite nonzero quadratic variation. This is the analytic signature of stochastic motion.

24.7 Martingale CLT

A martingale central limit theorem gives Gaussian limits for martingale difference arrays. If 𝐷𝑛,𝑘 are martingale differences and the conditional variance sums converge:

𝑘𝐸[𝐷𝑛,𝑘2𝐹𝑛,𝑘1]𝜎2

in probability, and a conditional Lindeberg condition holds, then

𝑘𝐷𝑛,𝑘𝑁(0,𝜎2).

The theorem replaces independence with conditional mean-zero plus conditional variance stabilization. It is crucial in dependent asymptotics, stochastic approximation, statistics, Markov chains, and random environments. The conditional variance process is the carrier of limiting Gaussian scale.

24.8 Concentration via bounded differences

If a function 𝑓(𝑋1,,𝑋𝑛) changes by at most 𝑐𝑖 when only coordinate 𝑖 is changed, and the 𝑋𝑖 are independent, define the Doob martingale

𝑀𝑖=𝐸[𝑓(𝑋1,,𝑋𝑛)𝑋1,,𝑋𝑖].

Its increments are bounded by 𝑐𝑖, so Azuma–Hoeffding yields

𝑃(𝑓𝐸𝑓𝑡)exp(2𝑡2𝑖𝑐𝑖2)

up to convention-dependent constants.

This is McDiarmid’s bounded differences inequality. It converts low sensitivity to individual inputs into concentration. The martingale is an exposure process: reveal inputs one at a time and track the conditional expectation.

24.9 Algorithmic and combinatorial applications

Martingale concentration applies to random graphs, randomized algorithms, hashing, load balancing, random permutations, and combinatorial optimization. The typical method is to expose random choices sequentially, define a Doob martingale, bound the effect of each exposure, and apply Azuma, Freedman, or a refined inequality.

The key is not independence alone but controlled sensitivity under information revelation. If a single input can change the output drastically, bounded differences fail. If conditional variances are small, Freedman can improve bounds. The proof engine is filtration design plus increment audit.


Chapter 25 — Random Sequences and Processes

25.1 Process as random element in path space

A stochastic process (𝑋𝑡)𝑡𝑇 can be viewed as a random element

𝑋:Ω𝑆𝑇,𝑋(𝜔)=(𝑋𝑡(𝜔))𝑡𝑇.

The path space 𝑆𝑇 carries a σ-algebra, often generated by coordinate maps. Under stronger regularity, the law may live on 𝐶[0,𝑇], 𝐷[0,𝑇], or another structured path space.

This viewpoint treats the whole trajectory as one random object. Finite-dimensional distributions become projections of the path law. Path properties such as continuity, bounded variation, hitting times, and oscillation become events in path space, requiring measurability and often regularity certificates.

25.2 Finite-dimensional distributions

The finite-dimensional distributions of a process are the laws of

(𝑋𝑡1,,𝑋𝑡𝑘)

for finite choices 𝑡1,,𝑡𝑘. These distributions describe all finite coordinate observations. They are necessary data for the process law.

For many process constructions, one first specifies finite-dimensional distributions and then checks consistency. But finite-dimensional distributions alone do not determine path regularity unless the path-space carrier is fixed. A process with continuous paths and a process with wild paths may share finite-dimensional distributions unless modifications are controlled.

25.3 Consistency conditions

Finite-dimensional distributions 𝜇𝑡1,,𝑡𝑘 must be compatible under permutation and marginalization. If one forgets a coordinate, the larger distribution must project to the smaller one:

(𝜋)𝜇𝑡1,,𝑡𝑘=𝜇𝑠1,,𝑠𝑚.

They must also be invariant under relabeling of repeated or reordered indices.

Consistency is the finite-data certificate for a global process law. Without it, no process can have those finite-dimensional distributions. With it, extension theorems can build a measure on a product path space under suitable assumptions.

25.4 Kolmogorov extension theorem

Kolmogorov’s extension theorem says that a consistent family of finite-dimensional distributions on suitable state spaces defines a probability measure on the infinite product space 𝑆𝑇. The coordinate process on that space then has the prescribed finite-dimensional laws.

The theorem provides existence of a process as a product-space random element. It does not guarantee continuous or measurable sample paths beyond the product σ-algebra. Regularity is a separate payload. This distinction is essential for Brownian motion: extension gives a process with Gaussian finite-dimensional distributions; continuity requires additional work.

25.5 Sample-path regularity

Sample-path regularity concerns properties of 𝑡𝑋𝑡(𝜔): continuity, càdlàg behavior, bounded variation, Hölder regularity, differentiability, or jump structure. These are not determined merely by one-time marginal laws.

Kolmogorov’s continuity criterion is a standard regularity tool. If

𝐸𝑋𝑡𝑋𝑠𝛼𝐶𝑡𝑠1+𝛽

for suitable 𝛼,𝛽>0, then the process has a continuous modification with Hölder-type regularity below a threshold. Moment bounds on increments become path regularity certificates.

25.6 Separability

Separability reduces uncountable-time behavior to countable dense-time behavior. A process is separable, roughly, if its path properties can be determined by values on a countable dense subset, up to null sets. This prevents uncountable measurability traps.

In continuous-time probability, events like

sup𝑡[0,𝑇]𝑋𝑡>𝑎

may not be measurable without path regularity or separability. If paths are continuous or càdlàg, the supremum over an interval can be reduced to a supremum over rational times. This is why path-space regularity and separability are not optional technicalities.

25.7 Modification and indistinguishability

Two processes 𝑋𝑡 and 𝑌𝑡 are modifications of each other if for every fixed 𝑡,

𝑃(𝑋𝑡=𝑌𝑡)=1.

They are indistinguishable if

𝑃(𝑋𝑡=𝑌𝑡 for all 𝑡)=1.

Indistinguishability is stronger, especially when 𝑇 is uncountable.

A fixed-time null set may depend on 𝑡, and uncountably many null sets cannot be unioned safely. If processes have continuous paths, modification often implies indistinguishability under suitable conditions because equality on a countable dense set extends by continuity. Without regularity, the implication can fail.

25.8 Stationarity

A process is stationary if its finite-dimensional distributions are invariant under time shifts:

(𝑋𝑡1+,,𝑋𝑡𝑘+)=d(𝑋𝑡1,,𝑋𝑡𝑘)

whenever the shifted indices are valid. Strict stationarity is full finite-dimensional invariance. Weak stationarity usually means constant mean and covariance depending only on time lag.

Stationarity is symmetry in time. It does not imply independence or ergodicity. A process may have the same distribution at all times while retaining long-range dependence or latent random environment. Stationarity supplies invariant law; ergodicity determines whether time averages collapse to constants.

25.9 Ergodicity

A stationary process is ergodic if shift-invariant events have probability 0 or 1. Equivalently, time averages converge to deterministic expectations rather than random invariant components. For an ergodic stationary sequence,

1𝑛𝑘=1𝑛𝑓(𝑋𝑘)𝐸[𝑓(𝑋1)]

almost surely under integrability.

Without ergodicity, the limit is conditional on the invariant σ-algebra:

1𝑛𝑘=1𝑛𝑓(𝑋𝑘)𝐸[𝑓(𝑋1)𝐼].

Thus ergodicity is the gate that turns ensemble averages into time averages.

25.10 Mixing

Mixing is asymptotic independence between distant time regions. A strong mixing condition may take the form

𝛼(𝑛)=sup𝐴𝐹0, 𝐵𝐹𝑛𝑃(𝐴𝐵)𝑃(𝐴)𝑃(𝐵)0.

Different mixing notions measure different strengths of dependence decay.

Mixing usually implies ergodicity, but it is stronger. It supports CLTs, concentration, empirical process results, and statistical consistency for dependent data. The rate of mixing matters: summable mixing coefficients give stronger limit theorems than mere convergence to zero.

25.11 Process-level carrier debt

A stochastic process claim must declare its path carrier, filtration, finite-dimensional laws, regularity, modification class, and convergence topology. A statement about 𝑋𝑡 at each fixed 𝑡 is weaker than a statement about the whole path. A statement about finite-dimensional convergence is weaker than weak convergence in path space.

Common process-level errors include treating modifications as indistinguishable, assuming sample continuity from Gaussian marginals, using stopping times without a filtration, or applying continuous mapping arguments in the wrong path topology. Process probability is not coordinate probability repeated; it is probability on trajectories.


Chapter 26 — Markov Chains

26.1 Markov property

A discrete-time process 𝑋𝑛 is Markov if, conditional on the present, the future is independent of the past:

𝑃(𝑋𝑛+1𝐴𝑋0,,𝑋𝑛)=𝑃(𝑋𝑛+1𝐴𝑋𝑛).

For time-homogeneous chains, this conditional law depends only on 𝑋𝑛, not on 𝑛.

The Markov property is a conditional independence statement. It does not say the variables are independent; dependence passes through the current state. The present state is a sufficient statistic for future prediction. This is the carrier principle behind transition kernels and semigroups.

26.2 Transition kernels

A transition kernel 𝑃(𝑥,𝐴) gives the probability of moving from state 𝑥 to a measurable set 𝐴. For countable spaces, this is a matrix

𝑃𝑖𝑗=𝑃(𝑋𝑛+1=𝑗𝑋𝑛=𝑖).

The 𝑛-step transition kernel is obtained by composition:

𝑃𝑛(𝑥,𝐴)=𝑃(𝑋𝑛𝐴𝑋0=𝑥).

Kernels are probability-valued functions of the current state. They define the chain law once an initial distribution 𝜇 is specified. The finite-dimensional law is

𝜇(𝑑𝑥0)𝑃(𝑥0,𝑑𝑥1)𝑃(𝑥𝑛1,𝑑𝑥𝑛).

This factorization is the Markov certificate.

26.3 Finite-state chains

For a finite state space, the transition kernel is a stochastic matrix 𝑃 with rows summing to one. If the distribution at time 𝑛 is row vector 𝜇𝑛, then

𝜇𝑛+1=𝜇𝑛𝑃,𝜇𝑛=𝜇0𝑃𝑛.

Finite chains reduce many questions to matrix analysis. Stationary distributions solve

𝜋=𝜋𝑃.

Long-term behavior depends on irreducibility, periodicity, and spectral structure. Finite-state chains are the cleanest setting for mixing time and convergence to equilibrium.

26.4 Countable-state chains

Countable-state chains require more classification than finite chains because recurrence, transience, and invariant measures become subtle. Transition probabilities still form a stochastic matrix, but stationary distributions may fail to exist or may not be normalizable.

A simple random walk on 𝑍 is recurrent but has no probability stationary distribution. A biased random walk on 𝑍 is transient. Countable chains force separation between invariant measures, stationary probabilities, recurrence classes, and positive recurrence.

26.5 Communicating classes

State 𝑖 reaches state 𝑗, written 𝑖𝑗, if

𝑃𝑛(𝑖,𝑗)>0

for some 𝑛. States communicate if 𝑖𝑗 and 𝑗𝑖. Communication is an equivalence relation; equivalence classes are communicating classes.

A class is closed if the chain cannot leave it once entered. Irreducibility means all states communicate. Long-run behavior is analyzed class by class. Transient states may lead into closed recurrent classes; closed classes trap the chain.

26.6 Recurrence and transience

A state 𝑖 is recurrent if, starting from 𝑖, the chain returns to 𝑖 with probability one. It is transient if the return probability is less than one. Equivalently, recurrence can be characterized by

𝑛=0𝑃𝑛(𝑖,𝑖)=,

while transience corresponds to finiteness of this sum.

In irreducible chains, recurrence or transience is a class property. Positive recurrence means expected return time is finite; null recurrence means return occurs almost surely but with infinite expected return time. Stationary probability distributions exist for irreducible positive recurrent chains.

26.7 Stationary distributions

A stationary distribution 𝜋 satisfies

𝜋𝑃=𝜋.

If 𝑋0𝜋, then 𝑋𝑛𝜋 for all 𝑛. Stationarity is distributional invariance under the transition kernel.

For finite irreducible chains, a stationary distribution exists and is unique. For countable chains, existence of a probability stationary distribution corresponds to positive recurrence under irreducibility. Stationarity does not automatically mean convergence to stationarity; periodicity can obstruct convergence.

26.8 Reversibility

A chain with stationary distribution 𝜋 is reversible if

𝜋𝑖𝑃𝑖𝑗=𝜋𝑗𝑃𝑗𝑖

for all states 𝑖,𝑗. This detailed balance condition says the stationary flow from 𝑖 to 𝑗 equals the flow from 𝑗 to 𝑖.

Reversibility implies stationarity by summing over 𝑖:

𝑖𝜋𝑖𝑃𝑖𝑗=𝑖𝜋𝑗𝑃𝑗𝑖=𝜋𝑗.

Reversible chains are analytically tractable because the transition operator is self-adjoint in 𝐿2(𝜋). This connects Markov chains to spectral theory.

26.9 Detailed balance

Detailed balance is the local equation

𝜋(𝑥)𝑃(𝑥,𝑑𝑦)=𝜋(𝑦)𝑃(𝑦,𝑑𝑥)

in general-state notation. It is stronger than stationarity, which only requires global balance:

𝜋(𝑑𝑥)𝑃(𝑥,𝐴)=𝜋(𝐴).

Detailed balance is widely used to construct Markov chains with desired stationary distribution. In MCMC, Metropolis–Hastings transition rules are engineered to satisfy detailed balance. The cost is that reversible chains may be slower than nonreversible alternatives; reversibility is a certificate, not always an optimal design.

26.10 Hitting times

For a set 𝐴, the hitting time is

𝜏𝐴=inf{𝑛0:𝑋𝑛𝐴}.

Hitting probabilities solve harmonic equations. For (𝑖)=𝑃𝑖(𝜏𝐴<𝜏𝐵),

(𝑖)=𝑗𝑃𝑖𝑗(𝑗)

on states outside 𝐴𝐵, with boundary values =1 on 𝐴, =0 on 𝐵.

Expected hitting times solve Poisson equations. If 𝑔(𝑖)=𝐸𝑖[𝜏𝐴], then

𝑔(𝑖)=1+𝑗𝑃𝑖𝑗𝑔(𝑗)

outside 𝐴, with 𝑔=0 on 𝐴. Hitting times connect Markov chains to potential theory.

26.11 Absorbing chains

An absorbing state 𝑎 satisfies 𝑃𝑎𝑎=1. Once entered, it cannot be left. More generally, a closed class acts as an absorbing region. Absorbing chains are analyzed by separating transient and absorbing states.

For finite absorbing chains with transient submatrix 𝑄, the fundamental matrix is

𝑁=(𝐼𝑄)1=𝑛=0𝑄𝑛.

The entry 𝑁𝑖𝑗 gives the expected number of visits to transient state 𝑗 starting from 𝑖. Absorption probabilities and expected absorption times follow from 𝑁.

26.12 Mixing time

For an irreducible aperiodic finite chain with stationary distribution 𝜋, the mixing time measures how long until the law is close to 𝜋. In total variation,

𝑡mix(𝜀)=min{𝑡:sup𝑥𝑃𝑡(𝑥,)𝜋TV𝜀}.

Mixing time quantifies convergence to equilibrium. It depends on bottlenecks, spectral gap, conductance, coupling, and geometry of the state space. Stationarity alone does not give speed. Mixing analysis is about finite-time convergence, not merely limiting invariance.

26.13 Coupling methods

A coupling of two chains constructs (𝑋𝑛,𝑌𝑛) on one probability space with each marginal following the chain. If they meet and stay together after time 𝑇, then

𝑃𝑛(𝑥,)𝑃𝑛(𝑦,)TV𝑃(𝑇>𝑛).

Coupling bounds mixing by controlling coalescence time.

Good couplings exploit contraction, monotonicity, shared randomness, or geometry. Coupling is not automatic; it is a designed joint carrier. The success of the method depends on how quickly the coupled chains meet under the chosen construction.

26.14 Spectral gap

For a reversible finite chain, the spectral gap is

𝛾=1𝜆2

where 𝜆2 is the second-largest eigenvalue in appropriate ordering. A positive spectral gap implies exponential convergence to equilibrium. Roughly,

𝑃𝑛(𝑥,)𝜋TV

decays at a rate controlled by (1𝛾)𝑛, with constants depending on 𝜋.

The spectral gap measures relaxation speed in 𝐿2(𝜋). It is connected to Poincaré inequalities, conductance, and geometry. Small spectral gap indicates bottlenecks or slow modes. Spectral methods are powerful but usually require reversibility or operator control.

26.15 Markov chain Monte Carlo preview

MCMC constructs a Markov chain whose stationary distribution is a target law 𝜋. After running the chain, samples are used to estimate

𝑓𝑑𝜋.

Metropolis–Hastings, Gibbs sampling, and Langevin algorithms are standard examples.

The validity of MCMC requires more than stationarity. One needs convergence to stationarity, adequate mixing, diagnostic control, and error estimates. A chain that has 𝜋 as invariant law but mixes slowly may be computationally useless. The MCMC audit is invariant distribution plus convergence rate plus sampling error.


Chapter 27 — Markov Processes

27.1 Markov kernels on general spaces

On a measurable state space (𝑆,𝑆), a Markov kernel 𝑃(𝑥,𝐴) satisfies: for each 𝑥, 𝐴𝑃(𝑥,𝐴) is a probability measure, and for each 𝐴, 𝑥𝑃(𝑥,𝐴) is measurable. It defines transition probabilities from states to measurable sets.

General kernels replace matrices. The Markov property becomes

𝑃(𝑋𝑡+𝑠𝐴𝐹𝑡)=𝑃𝑠(𝑋𝑡,𝐴)

for a transition family 𝑃𝑠. The measurable-space structure is essential; without it, the transition probability as a function of state may not be well-defined.

27.2 Semigroups

A Markov semigroup (𝑃𝑡)𝑡0 acts on functions by

𝑃𝑡𝑓(𝑥)=𝐸𝑥[𝑓(𝑋𝑡)].

It satisfies

𝑃0=𝐼,𝑃𝑡+𝑠=𝑃𝑡𝑃𝑠.

The second identity is the Chapman–Kolmogorov equation and encodes the Markov property.

Semigroups are operator carriers for Markov processes. Instead of tracking paths directly, one studies how expectations evolve. Semigroup methods connect probability to PDE, functional analysis, spectral theory, and generators.

27.3 Strong Markov property

The strong Markov property says the Markov property holds at stopping times. If 𝜏 is a stopping time, then conditional on 𝐹𝜏, the post-𝜏 process behaves like a fresh process started from 𝑋𝜏:

𝐸[𝑓(𝑋𝜏+𝑡)𝐹𝜏]=𝑃𝑡𝑓(𝑋𝜏).

This is stronger than the deterministic-time Markov property. It is essential for hitting times, excursions, renewal arguments, boundary problems, and potential theory. The stopping-time condition is load-bearing; arbitrary random times do not generally permit Markov restart.

27.4 Feller processes

A Feller semigroup acts on 𝐶0(𝑆), the continuous functions vanishing at infinity, and satisfies

𝑃𝑡:𝐶0(𝑆)𝐶0(𝑆),lim𝑡0𝑃𝑡𝑓𝑓=0.

A Markov process with such a semigroup is called a Feller process.

Feller structure connects stochastic processes to analytic generators. It gives regular dependence on initial conditions and supports construction through semigroup theory. Many diffusions, Lévy processes, and jump processes are Feller under suitable assumptions.

27.5 Generators

The generator 𝐴 of a semigroup is

𝐴𝑓=lim𝑡0𝑃𝑡𝑓𝑓𝑡

for functions 𝑓 where the limit exists. It describes infinitesimal evolution. For a diffusion,

𝐴𝑓=𝑏(𝑥)𝑓(𝑥)+12Tr(𝑎(𝑥)2𝑓(𝑥)).

The generator is the local analytic fingerprint of a Markov process. It determines PDEs, martingale problems, transition behavior, and invariant measures. The domain of 𝐴 is part of the data; writing a differential expression without specifying domain can be incomplete.

27.6 Resolvents

The resolvent of a semigroup is

𝑅𝜆𝑓=0𝑒𝜆𝑡𝑃𝑡𝑓𝑑𝑡,𝜆>0.

It satisfies

𝑅𝜆=(𝜆𝐼𝐴)1

formally, when 𝐴 is the generator.

Resolvents encode potential behavior: expected discounted occupation times. They are useful for constructing processes, solving boundary value problems, and analyzing recurrence. In Markov process theory, semigroup, generator, and resolvent are three equivalent analytic views when the carrier is well-behaved.

27.7 Invariant measures

A probability measure 𝜋 is invariant if

𝑃𝑡(𝑥,𝐴)𝜋(𝑑𝑥)=𝜋(𝐴)

for all 𝑡 and measurable 𝐴. Equivalently,

𝑃𝑡𝑓𝑑𝜋=𝑓𝑑𝜋.

If 𝑋0𝜋, then 𝑋𝑡𝜋 for all 𝑡.

Invariant measures describe equilibrium laws. Existence, uniqueness, and convergence to equilibrium require additional recurrence, irreducibility, Lyapunov, or compactness conditions. In continuous spaces, invariant measures may exist but convergence may fail or depend strongly on topology.

27.8 Harris recurrence

Harris recurrence is a recurrence notion for general-state Markov processes or chains. Roughly, a process is Harris recurrent if it visits every set of positive measure infinitely often with probability one, relative to an irreducibility measure.

This theory replaces communicating classes from countable chains. It provides the correct framework for ergodic theorems, invariant probability measures, and convergence in total variation on general spaces. Petite sets, minorization, and Lyapunov functions often appear as certificates for Harris recurrence and geometric ergodicity.

27.9 Regeneration

Regeneration occurs when a process probabilistically restarts, forgetting its past. A regeneration time 𝜏 splits the path into independent or conditionally independent cycles. For regenerative processes, long-run averages reduce to renewal-reward ratios:

lim𝑡1𝑡0𝑡𝑓(𝑋𝑠)𝑑𝑠=𝐸0𝜏𝑓(𝑋𝑠)𝑑𝑠𝐸𝜏

under suitable conditions.

Regeneration is a powerful replacement for independence. It decomposes dependent paths into iid cycles. In Markov chains, regeneration can be constructed through atoms or minorization/splitting techniques.

27.10 Continuous-time Markov chains

A continuous-time Markov chain on a countable state space has transition rates 𝑞𝑖𝑗 for 𝑖𝑗, with

𝑞𝑖=𝑗𝑖𝑞𝑖𝑗.

The chain waits an exponential time with rate 𝑞𝑖 in state 𝑖, then jumps to 𝑗 with probability 𝑞𝑖𝑗/𝑞𝑖.

The generator matrix 𝑄 has entries

𝑄𝑖𝑗=𝑞𝑖𝑗,𝑖𝑗,𝑄𝑖𝑖=𝑞𝑖.

Transition probabilities satisfy Kolmogorov equations:

𝑑𝑑𝑡𝑃𝑡=𝑃𝑡𝑄

or 𝑄𝑃𝑡, depending on forward/backward formulation. Explosion control is required if infinitely many jumps can occur in finite time.

27.11 Jump processes

Jump processes evolve by sudden discontinuities rather than continuous paths. A pure jump Markov process is specified by jump rates or a jump kernel. More generally, jump-diffusions combine continuous diffusion with jumps.

The generator often has the form

𝐴𝑓(𝑥)=𝑏(𝑥)𝑓(𝑥)+(𝑓(𝑦)𝑓(𝑥))𝑞(𝑥,𝑑𝑦),

possibly with compensation for small jumps. Jump processes require càdlàg path spaces and careful handling of predictable compensators, intensities, and explosion.

27.12 Birth-death processes

A birth-death process is a continuous-time Markov chain on 𝑁 with transitions only from 𝑛 to 𝑛+1 or 𝑛1. Birth rates are 𝜆𝑛, death rates are 𝜇𝑛. The generator is

𝐴𝑓(𝑛)=𝜆𝑛[𝑓(𝑛+1)𝑓(𝑛)]+𝜇𝑛[𝑓(𝑛1)𝑓(𝑛)].

Birth-death processes model queues, populations, branching-like systems, and chemical reactions. Stationary distributions, when they exist, often satisfy detailed balance:

𝜋𝑛𝜆𝑛=𝜋𝑛+1𝜇𝑛+1.

Recurrence and explosion depend on the rate growth.

27.13 Martingale problems

Given an operator 𝐴, the martingale problem asks for a process 𝑋𝑡 such that for all test functions 𝑓 in the domain,

𝑓(𝑋𝑡)𝑓(𝑋0)0𝑡𝐴𝑓(𝑋𝑠)𝑑𝑠

is a martingale. Solving the martingale problem constructs a Markov process with generator 𝐴.

This formulation avoids directly solving SDEs or transition densities. It is robust under weak convergence and is central in diffusion theory. Well-posedness means existence and uniqueness in law. If the martingale problem is well-posed, the generator determines the process law.


Chapter 28 — Brownian Motion

28.1 Construction of Brownian motion

Brownian motion 𝐵𝑡 is a process with 𝐵0=0, independent increments, Gaussian increments

𝐵𝑡𝐵𝑠𝑁(0,𝑡𝑠),

and continuous paths. Kolmogorov extension constructs a process with the correct finite-dimensional Gaussian laws; continuity is then obtained through moment estimates and a continuity theorem.

The construction separates law specification from path regularity. Gaussian finite-dimensional distributions alone do not automatically give continuous paths. Brownian motion is the continuous modification of the Gaussian process with covariance

𝐸[𝐵𝑠𝐵𝑡]=𝑠𝑡.

28.2 Gaussian processes

A Gaussian process is a process whose every finite linear combination is Gaussian. It is determined by its mean function

𝑚(𝑡)=𝐸[𝑋𝑡]

and covariance function

𝐾(𝑠,𝑡)=Cov(𝑋𝑠,𝑋𝑡).

Brownian motion is centered with 𝐾(𝑠,𝑡)=𝑠𝑡.

Gaussianity reduces finite-dimensional laws to linear algebra. Positive semidefiniteness of the covariance kernel is the consistency condition. Path properties, however, depend on metric behavior induced by

𝑑(𝑠,𝑡)2=𝐸𝑋𝑠𝑋𝑡2.

Covariance determines law; regularity requires additional analysis of increments.

28.3 Independent increments

Brownian motion has independent increments: for

0𝑡0<𝑡1<<𝑡𝑛,

the variables

𝐵𝑡1𝐵𝑡0,,𝐵𝑡𝑛𝐵𝑡𝑛1

are independent. Each increment has variance equal to its time length.

Independent increments make Brownian motion a continuous-time analogue of random walk. The Markov property, martingale property, and Gaussian transition densities follow. Independent increments are stronger than uncorrelated increments, though for Gaussian processes uncorrelated implies independent.

28.4 Continuous paths

Brownian motion has continuous paths almost surely, but those paths are almost surely nowhere differentiable. Continuity follows from moment bounds such as

𝐸𝐵𝑡𝐵𝑠𝑝=𝐶𝑝𝑡𝑠𝑝/2

and Kolmogorov continuity criteria.

The paths are rough: for small , increments are typically of size , so difference quotients behave like 1/2. This roughness is why stochastic calculus is not ordinary calculus. Brownian paths have finite quadratic variation but infinite total variation on intervals.

28.5 Scaling

Brownian motion satisfies the scaling property

(𝐵𝑐𝑡)𝑡0=d(𝑐𝐵𝑡)𝑡0.

This follows from Gaussian increments: 𝐵𝑐𝑡 has variance 𝑐𝑡, matching 𝑐𝐵𝑡.

Scaling is a structural symmetry. It determines hitting-time distributions, fractal path behavior, local time scaling, and invariance principles. Brownian motion is self-similar with exponent 1/2. This exponent is the source of diffusive scaling.

28.6 Reflection principle

The reflection principle states, for Brownian motion and 𝑎>0,

𝑃(sup0𝑠𝑡𝐵𝑠𝑎)=2𝑃(𝐵𝑡𝑎).

Paths that cross level 𝑎 are reflected after their first hitting time, giving a bijective probability argument.

This principle yields distributions of maxima and hitting times. For 𝜏𝑎=inf{𝑡:𝐵𝑡=𝑎},

𝑃(𝜏𝑎𝑡)=2𝑃(𝐵𝑡𝑎).

Reflection uses continuity and symmetry of Brownian increments; it is not available for arbitrary martingales or jump processes without modification.

28.7 Hitting times

For Brownian motion, the hitting time of level 𝑎 is

𝜏𝑎=inf{𝑡0:𝐵𝑡=𝑎}.

Its distribution satisfies

𝑃(𝜏𝑎𝑡)=2(1Φ(𝑎𝑡))

for 𝑎>0. The density is

𝑓𝜏𝑎(𝑡)=𝑎2𝜋𝑡3𝑒𝑎2/(2𝑡).

Hitting times connect Brownian motion to boundary value problems. Probabilities of hitting one boundary before another solve harmonic equations. Expected hitting times solve Poisson equations when finite. In unbounded domains, expectations may be infinite even when hitting occurs almost surely.

28.8 Brownian bridge

A Brownian bridge from 0 to 0 over [0,1] is Brownian motion conditioned on 𝐵1=0. It can be represented as

𝛽𝑡=𝐵𝑡𝑡𝐵1.

It is a centered Gaussian process with covariance

𝐸[𝛽𝑠𝛽𝑡]=𝑠𝑡𝑠𝑡.

The Brownian bridge appears in empirical process limits, Kolmogorov–Smirnov statistics, conditioned diffusions, and Gaussian process theory. It is not Brownian motion because its increments are not independent; the endpoint constraint introduces global dependence.

28.9 Brownian motion in ℝᵈ

A 𝑑-dimensional Brownian motion is

𝐵𝑡=(𝐵𝑡(1),,𝐵𝑡(𝑑)),

where the coordinates are independent one-dimensional Brownian motions. Its transition density is

𝑝𝑡(𝑥,𝑦)=1(2𝜋𝑡)𝑑/2exp(𝑦𝑥22𝑡).

The generator is

𝐴=12Δ.

Thus Brownian motion is probabilistically tied to the heat equation:

𝑡𝑢=12Δ𝑢.

The dimension 𝑑 changes recurrence, hitting probabilities, potential theory, and path intersection behavior.

28.10 Recurrence and transience

Brownian motion is recurrent in dimensions 𝑑=1,2 in the sense that it returns arbitrarily close to points or neighborhoods, while it is transient in dimensions 𝑑3. In 𝑑=1, points are hit almost surely. In 𝑑=2, points are polar in a finer sense, but neighborhoods are visited recurrently. In 𝑑3, Brownian motion tends to infinity.

This dimension transition reflects potential theory. The Green function changes behavior by dimension. Recurrence means infinite occupation of neighborhoods; transience means finite expected occupation. The threshold at dimension two is fundamental.

28.11 Quadratic variation

Brownian motion has quadratic variation

[𝐵]𝑡=𝑡.

For partitions Π𝑛 of [0,𝑡] with mesh going to zero,

[𝑢,𝑣]Π𝑛(𝐵𝑣𝐵𝑢)2𝑡

in probability, and along suitable partitions almost surely.

This property distinguishes Brownian motion from differentiable functions, whose quadratic variation is zero. Quadratic variation is why Itô calculus has correction terms. It is also the intrinsic clock of Brownian martingales.

28.12 Lévy characterization

Lévy’s characterization states that a continuous adapted process 𝑀𝑡 with 𝑀0=0 is Brownian motion if 𝑀𝑡 is a martingale and

𝑀𝑡2𝑡

is also a martingale. Equivalently, 𝑀 is a continuous martingale with quadratic variation [𝑀]𝑡=𝑡.

This theorem characterizes Brownian motion by martingale and quadratic variation properties rather than independent Gaussian increments. It is powerful in proving that transformed processes are Brownian under new measures or filtrations.

28.13 Donsker invariance principle

Donsker’s theorem says that rescaled random walks converge to Brownian motion. If 𝑋𝑖 are iid mean zero, variance one, and

𝑆𝑛(𝑡)=1𝑛𝑖=1𝑛𝑡𝑋𝑖

with interpolation, then

𝑆𝑛𝐵

in a suitable path space.

This is the process-level CLT. It upgrades convergence of one-time sums to convergence of entire paths. The theorem explains Brownian motion as the universal scaling limit of many independent small-step systems. Tightness is the additional payload beyond finite-dimensional CLT.

28.14 Brownian motion as universal limit carrier

Brownian motion is universal because it is the canonical limit of centered finite-variance cumulative noise under diffusive scaling. It appears in random walks, empirical processes, martingales, queueing, statistical estimation, stochastic differential equations, and physics.

Universality has conditions. Heavy tails lead to stable Lévy processes; long-range dependence can lead to fractional Brownian motion; jumps lead to discontinuous Lévy limits; nonstandard dependence can change scaling. Brownian motion is the universal carrier only after finite-variance, weak-dependence, and diffusive-scaling debts are paid.


Chapter 29 — Stochastic Integration

29.1 Motivation

Ordinary integration fails for Brownian-driven equations because Brownian paths have infinite variation and are nowhere differentiable. Expressions like

0𝑡𝐻𝑠𝑑𝐵𝑠

cannot be defined as Riemann–Stieltjes integrals in the usual pathwise sense for general adapted 𝐻.

Stochastic integration builds an integral using probability, martingales, and 𝐿2 limits. The integrand must be nonanticipative, typically predictable. The integral is not just a pathwise area; it is a limit of sums where the integrand is evaluated using information before the Brownian increment.

29.2 Itô integral for simple processes

For a simple predictable process

𝐻𝑡=𝑘=0𝑛1𝐻𝑘1(𝑡𝑘,𝑡𝑘+1](𝑡),

where 𝐻𝑘 is 𝐹𝑡𝑘-measurable, define

0𝑇𝐻𝑡𝑑𝐵𝑡=𝑘=0𝑛1𝐻𝑘(𝐵𝑡𝑘+1𝐵𝑡𝑘).

Predictability ensures 𝐻𝑘 is known before the increment.

For such integrals,

𝐸[0𝑇𝐻𝑡𝑑𝐵𝑡]=0,

and the Itô isometry holds:

𝐸[(0𝑇𝐻𝑡𝑑𝐵𝑡)2]=𝐸0𝑇𝐻𝑡2𝑑𝑡.

29.3 Extension by isometry

The Itô isometry extends the integral from simple predictable processes to all predictable processes satisfying

𝐸0𝑇𝐻𝑡2𝑑𝑡<.

Choose simple 𝐻𝑛𝐻 in 𝐿2(Ω×[0,𝑇]), define

𝐻𝑑𝐵=𝐿2-lim𝑛𝐻𝑛𝑑𝐵.

The isometry guarantees the definition is independent of approximating sequence. This is the measure-theoretic construction of stochastic integration: build on simple predictable objects, prove an exact norm identity, then complete the space.

29.4 Predictable integrands

An Itô integrand must be predictable or at least progressively measurable with suitable integrability, depending on formulation. Predictability encodes nonanticipation: 𝐻𝑡 may depend on past and present information but not future Brownian increments.

This condition is structural. If one uses right-endpoint values or future-dependent integrands, different integrals arise, such as Stratonovich-type integrals or anticipative integrals requiring Malliavin calculus. Itô integration is tied to adapted prediction and martingale structure.

29.5 Local martingales

The stochastic integral

𝑀𝑡=0𝑡𝐻𝑠𝑑𝐵𝑠

is a martingale when 𝐻 is square-integrable on finite horizons. Under local square integrability, it is a local martingale. Its quadratic variation is

[𝑀]𝑡=0𝑡𝐻𝑠2𝑑𝑠.

Local martingales are the natural output of stochastic integration. They behave like martingales up to localization times but may fail to be true martingales globally. This distinction matters for expectation identities, change of measure, and financial modeling.

29.6 Quadratic variation

For Itô integrals,

[0𝐻𝑠𝑑𝐵𝑠]𝑡=0𝑡𝐻𝑠2𝑑𝑠.

For two integrals,

[𝐻𝑑𝐵,𝐾𝑑𝐵]𝑡=0𝑡𝐻𝑠𝐾𝑠𝑑𝑠.

Quadratic variation is the second-order bookkeeping of stochastic calculus. It replaces ordinary differential products. The symbolic rule is

(𝑑𝐵𝑡)2=𝑑𝑡,𝑑𝐵𝑡𝑑𝑡=0,(𝑑𝑡)2=0.

This rule is mnemonic for rigorous quadratic variation calculations.

29.7 Itô formula

If 𝑋𝑡 satisfies

𝑑𝑋𝑡=𝑏𝑡𝑑𝑡+𝜎𝑡𝑑𝐵𝑡

and 𝑓𝐶2, then

𝑑𝑓(𝑋𝑡)=𝑓(𝑋𝑡)𝑑𝑋𝑡+12𝑓(𝑋𝑡)𝜎𝑡2𝑑𝑡.

Equivalently,

𝑓(𝑋𝑡)=𝑓(𝑋0)+0𝑡𝑓(𝑋𝑠)𝑏𝑠𝑑𝑠+0𝑡𝑓(𝑋𝑠)𝜎𝑠𝑑𝐵𝑠+120𝑡𝑓(𝑋𝑠)𝜎𝑠2𝑑𝑠.

The second derivative term is the Itô correction caused by quadratic variation. Ordinary chain rule fails because Brownian increments have square size comparable to 𝑑𝑡. Itô formula is the fundamental calculus rule for diffusion processes.

29.8 Stochastic integration by parts

For semimartingales 𝑋,𝑌,

𝑑(𝑋𝑡𝑌𝑡)=𝑋𝑡𝑑𝑌𝑡+𝑌𝑡𝑑𝑋𝑡+𝑑[𝑋,𝑌]𝑡.

Integrated,

𝑋𝑡𝑌𝑡=𝑋0𝑌0+0𝑡𝑋𝑠𝑑𝑌𝑠+0𝑡𝑌𝑠𝑑𝑋𝑠+[𝑋,𝑌]𝑡

with predictable convention refinements depending on exact formulation.

The extra quadratic covariation term is the stochastic correction. If one process has finite variation, its quadratic covariation with a continuous martingale is zero. Integration by parts is essential for exponential martingales, product processes, and SDE transformations.

29.9 Exponential martingales

For Brownian motion,

𝐸𝑡(𝜃𝐵)=exp(𝜃𝐵𝑡12𝜃2𝑡)

is a martingale. More generally, for suitable 𝐻,

𝐸𝑡(𝐻𝑑𝐵)=exp(0𝑡𝐻𝑠𝑑𝐵𝑠120𝑡𝐻𝑠2𝑑𝑠)

is a local martingale and under conditions such as Novikov’s condition is a true martingale.

Exponential martingales are density processes for changing measure. They also yield concentration bounds and solve linear stochastic equations. The correction 12𝐻2 compensates quadratic variation.

29.10 Girsanov theorem

Girsanov’s theorem describes how Brownian drift changes under an equivalent measure. If

𝑍𝑡=exp(0𝑡𝜃𝑠𝑑𝐵𝑠120𝑡𝜃𝑠2𝑑𝑠)

is a martingale and defines 𝑑𝑄=𝑍𝑇𝑑𝑃, then

𝐵~𝑡=𝐵𝑡+0𝑡𝜃𝑠𝑑𝑠

is Brownian motion under 𝑄.

The theorem is a change-of-measure engine. It transforms drift into likelihood ratio. It is central in filtering, finance, diffusion theory, rare-event simulation, and stochastic control. The martingale condition on 𝑍 is the integrability gate; without it, the measure change may fail.

29.11 Martingale representation theorem

In a Brownian filtration, every square-integrable martingale 𝑀𝑡 can be represented as

𝑀𝑡=𝑀0+0𝑡𝐻𝑠𝑑𝐵𝑠

for some predictable 𝐻 with suitable square integrability. This says Brownian stochastic integrals span martingales in the Brownian information carrier.

The theorem is foundational for hedging in mathematical finance, filtering, and stochastic control. It is filtration-specific. In a larger filtration or with jump noise, representation may require additional martingale drivers. The theorem is not a universal statement about all martingales; it is a carrier theorem for Brownian filtrations.

Comments

Popular posts from this blog

Semiotics Rebooted

THE COLLAPSE ENGINE: AI, Capital, and the Terminal Logic of 2025

ORSI: The Telic Geometry of Meaning