Probability Theory — Chapters 21

Chapter 21 — Large Deviations

21.1 Exponential scales

Large deviation theory studies probabilities that decay exponentially in a scale parameter. Instead of asking whether $𝑋_{𝑛} \Rightarrow 𝑋$ , it asks for asymptotics of the form

𝑃 (𝑋_{𝑛} \in 𝐴) \approx 𝑒^{- 𝑛 \inf_{𝑥 \in 𝐴} 𝐼 (𝑥)} .

The scale $𝑛$ is not always the sample size; it may be $𝑡$ , $1 / 𝜀$ , $𝑛^{2}$ , volume, inverse noise, or another problem-specific growth parameter. The object $𝐼$ is the rate function. Events with small $𝐼$ are less expensive; events with large $𝐼$ are exponentially suppressed.

The exponential scale is invisible to the central limit theorem. A CLT sees fluctuations of size $\sqrt{𝑛}$ , while large deviations often concern deviations of size $𝑛$ . For iid $𝑋_{𝑖}$ with mean $𝜇$ , the event

\frac{1}{𝑛} \sum_{𝑖 = 1}^{𝑛} 𝑋_{𝑖} \approx 𝑎 \neq 𝜇

usually has probability $𝑒^{- 𝑛 𝐼 (𝑎)}$ , not Gaussian-scale probability. The correct question is not “does the probability go to zero?” but “what exponential cost does the deviation pay?”

21.2 Chernoff bounds

Chernoff bounds use exponential moments to estimate tail probabilities. For any $𝑡 > 0$ ,

𝑃 (𝑋 \geq 𝑎) = 𝑃 (𝑒^{𝑡 𝑋} \geq 𝑒^{𝑡 𝑎}) \leq 𝑒^{- 𝑡 𝑎} 𝐸 [𝑒^{𝑡 𝑋}] .

Optimizing over $𝑡$ gives

𝑃 (𝑋 \geq 𝑎) \leq \inf_{𝑡 > 0} 𝑒^{- 𝑡 𝑎} 𝐸 [𝑒^{𝑡 𝑋}] .

For sums of independent variables, moment generating functions factor:

𝐸 𝑒^{𝑡 \sum_{𝑖} 𝑋_{𝑖}} = \prod_{𝑖} 𝐸 𝑒^{𝑡 𝑋_{𝑖}} .

Chernoff bounds are the finite- $𝑛$ ancestor of large deviation theory. They convert an event constraint into an exponential tilting problem. The bound is often sharp at exponential scale because the optimizing $𝑡$ defines a tilted probability measure under which the rare event becomes typical. The hidden debt is existence of the moment generating function in the needed range.

21.3 Cramér’s theorem

Cramér’s theorem is the fundamental large deviation theorem for iid sample means. Let $𝑋_{1}, 𝑋_{2}, \dots$ be iid real variables and define

𝑆_{𝑛} = \frac{1}{𝑛} \sum_{𝑖 = 1}^{𝑛} 𝑋_{𝑖} .

Let

Λ (𝑡) = \log 𝐸 [𝑒^{𝑡 𝑋_{1}}]

be the log-moment generating function. Under standard hypotheses, $𝑆_{𝑛}$ satisfies a large deviation principle with rate function

𝐼 (𝑥) = \sup_{𝑡 \in 𝑅} {𝑡 𝑥 - Λ (𝑡)} .

Thus for suitable sets $𝐴$ ,

𝑃 (𝑆_{𝑛} \in 𝐴) \approx 𝑒^{- 𝑛 \inf_{𝑥 \in 𝐴} 𝐼 (𝑥)} .

Cramér’s theorem says empirical means deviate from their expectation with exponential cost determined by the Legendre transform of $Λ$ . The mean $𝜇 = 𝐸 𝑋$ is the zero-cost point: $𝐼 (𝜇) = 0$ . Away from $𝜇$ , $𝐼 (𝑥) > 0$ when the law is nondegenerate. The theorem gives the correct rare-event regime for averages, not merely the CLT regime.

21.4 Rate functions

A rate function $𝐼 : 𝑆 \to [0, \infty]$ assigns exponential cost to states. It is usually required to be lower semicontinuous. A good rate function has compact level sets:

{𝑥 : 𝐼 (𝑥) \leq 𝑐}

compact for every $𝑐 < \infty$ . Goodness is a compactness condition; it prevents minimizing sequences from escaping.

A large deviation principle has two inequalities. For closed $𝐹$ ,

\underset{𝑛 \to \infty}{lim sup} \frac{1}{𝑛} \log 𝑃 (𝑋_{𝑛} \in 𝐹) \leq - \inf_{𝑥 \in 𝐹} 𝐼 (𝑥),

and for open $𝐺$ ,

\underset{𝑛 \to \infty}{lim inf} \frac{1}{𝑛} \log 𝑃 (𝑋_{𝑛} \in 𝐺) \geq - \inf_{𝑥 \in 𝐺} 𝐼 (𝑥) .

The closed-set bound controls upper tails; the open-set bound guarantees enough mass near feasible points. Together they define exponential asymptotics at the topological level.

21.5 Legendre transform

The Legendre transform converts a convex log-moment generating function into a rate function:

𝐼 (𝑥) = Λ^{*} (𝑥) = \sup_{𝑡} {𝑡 𝑥 - Λ (𝑡)} .

This operation is convex duality. The parameter $𝑡$ is the exponential tilt; the value $𝑥$ is the forced mean. At differentiability points, the optimizer satisfies

Λ^{'} (𝑡) = 𝑥 .

In large deviations, the Legendre transform is not formal decoration. It is the certificate that the cheapest way to force an atypical average is to tilt the underlying distribution. The original law $𝑃$ is replaced by

\frac{𝑑 𝑃_{𝑡}}{𝑑 𝑃} (𝑥) = 𝑒^{𝑡 𝑥 - Λ (𝑡)} .

Under $𝑃_{𝑡}$ , the rare mean $𝑥$ becomes typical. The exponential cost of this tilt is the rate $𝐼 (𝑥)$ .

21.6 Sanov theorem

Sanov’s theorem is the large deviation principle for empirical measures. If $𝑋_{1}, \dots, 𝑋_{𝑛}$ are iid with law $𝜇$ , the empirical measure

𝐿_{𝑛} = \frac{1}{𝑛} \sum_{𝑖 = 1}^{𝑛} 𝛿_{𝑋_{𝑖}}

satisfies an LDP with rate function

𝐻 (𝜈 ∣ 𝜇) = {\begin{cases} \int \log (\frac{𝑑 𝜈}{𝑑 𝜇}) 𝑑 𝜈, & 𝜈 ≪ 𝜇, \\ + \infty, & otherwise . \end{cases}

This is relative entropy, or Kullback–Leibler divergence.

Sanov says the probability that the empirical distribution resembles $𝜈$ decays like

𝑒^{- 𝑛 𝐻 (𝜈 ∣ 𝜇)} .

The theorem lifts large deviations from sample means to empirical laws. Cramér’s theorem follows by contraction: the empirical mean is the image of the empirical measure under $𝜈 \mapsto \int 𝑥 𝑑 𝜈$ .

21.7 Varadhan lemma

Varadhan’s lemma evaluates exponential integrals under an LDP. If $𝑋_{𝑛}$ satisfies an LDP with rate $𝐼$ , then under suitable continuity and integrability hypotheses,

\lim_{𝑛 \to \infty} \frac{1}{𝑛} \log 𝐸 [𝑒^{𝑛 𝐹 (𝑋_{𝑛})}] = \sup_{𝑥} {𝐹 (𝑥) - 𝐼 (𝑥)} .

This is Laplace’s principle.

The lemma says exponential integrals are dominated by the best compromise between reward $𝐹 (𝑥)$ and cost $𝐼 (𝑥)$ . It is the probabilistic analogue of saddle-point asymptotics. The supremum is not necessarily attained at the typical point; exponential reweighting selects a new dominant state.

21.8 Gärtner–Ellis theorem

The Gärtner–Ellis theorem derives an LDP from convergence of scaled log-moment generating functions. If

Λ_{𝑛} (𝑡) = \frac{1}{𝑛} \log 𝐸 [𝑒^{𝑛 𝑡 𝑋_{𝑛}}]

converges to $Λ (𝑡)$ , and $Λ$ satisfies suitable regularity conditions, then $𝑋_{𝑛}$ satisfies an LDP with rate

𝐼 (𝑥) = Λ^{*} (𝑥) = \sup_{𝑡} {𝑡 𝑥 - Λ (𝑡)} .

This theorem generalizes Cramér’s theorem beyond iid sums. It is useful for dependent systems, statistical mechanics, Markov processes, and random matrices. The differentiability or exposed-point conditions are not cosmetic; without them, only partial lower bounds may hold. The moment-generating carrier may see convexified costs rather than all pathwise obstructions.

21.9 Moderate deviations

Moderate deviations sit between the CLT and large deviations. For iid centered variables with variance $𝜎^{2}$ , consider scales $𝑎_{𝑛}$ satisfying

𝑎_{𝑛} \to \infty, \frac{𝑎_{𝑛}}{\sqrt{𝑛}} \to 0.

Then

𝑃 (\frac{\sum_{𝑖 = 1}^{𝑛} 𝑋_{𝑖}}{𝜎 \sqrt{𝑛}} \geq 𝑎_{𝑛} 𝑥)

often behaves like a Gaussian tail at exponential scale $𝑎_{𝑛}^{2}$ :

\approx 𝑒^{- 𝑎_{𝑛}^{2} 𝑥^{2} / 2} .

The regime is larger than CLT fluctuations but smaller than order- $𝑛$ deviations. Moderate deviations show where Gaussian behavior remains valid into the tail. They require more tail control than the CLT but less than full Cramér-type large deviations.

21.10 Concentration versus large deviations

Concentration inequalities give nonasymptotic probability bounds such as

𝑃 (∣ 𝑋 - 𝐸 𝑋 ∣ \geq 𝑡) \leq 2 𝑒^{- 𝑐 𝑡^{2}} .

Large deviation theory gives asymptotic exponential rates:

𝑃 (𝑋_{𝑛} \in 𝐴) \approx 𝑒^{- 𝑛 \inf_{𝐴} 𝐼} .

Both study rare events, but their outputs differ. Concentration prioritizes finite-sample bounds, often with constants. Large deviations prioritize sharp exponential asymptotics and variational structure.

The two methods overlap but should not be conflated. A concentration bound may be nonsharp at exponential scale. An LDP may not give usable finite- $𝑛$ estimates. Concentration is a certificate of suppression; large deviations are a certificate of exponential cost geometry.

21.11 Rare-event regimes outside CLT

The CLT cannot control probabilities far from the central scale. If

𝑆_{𝑛} = \sum_{𝑖 = 1}^{𝑛} 𝑋_{𝑖},

then the CLT describes

𝑆_{𝑛} - 𝑛 𝜇 = 𝑂 (\sqrt{𝑛}) .

Events like

𝑆_{𝑛} - 𝑛 𝜇 \geq 𝑐 𝑛

belong to large deviations. Events like

𝑆_{𝑛} - 𝑛 𝜇 \geq 𝑎_{𝑛} \sqrt{𝑛}, 𝑎_{𝑛} \to \infty

belong to moderate deviations. Events driven by one huge summand in heavy-tailed laws may follow subexponential rather than Cramér-type asymptotics.

The audit rule is scale first, theorem second. A Gaussian approximation at the wrong scale is a carrier error. Rare-event analysis must declare whether the deviation is collective tilting, one-big-jump behavior, boundary crossing, local lattice behavior, or path-level deviation.

Chapter 22 — Filtrations and Adapted Processes

22.1 Information over time

Probability over time requires an information structure. At time $𝑡$ , the model should know which events are observable by then. This is encoded by a σ-algebra $𝐹_{𝑡}$ . If $𝑠 \leq 𝑡$ , then information cannot decrease:

𝐹_{𝑠} \subseteq 𝐹_{𝑡} .

The family $(𝐹_{𝑡})$ is a filtration.

Without a filtration, temporal statements like “known now,” “future,” “stopping time,” “martingale,” or “strategy” are undefined. A process alone gives values; a filtration tells which values and events are available at each time. Dynamic probability is probability plus information flow.

22.2 Filtrations

A filtration on $(Ω, 𝐹, 𝑃)$ is an increasing family of sub-σ-algebras:

(𝐹_{𝑡})_{𝑡 \in 𝑇}, 𝐹_{𝑠} \subseteq 𝐹_{𝑡} \subseteq 𝐹 for 𝑠 \leq 𝑡 .

The index set $𝑇$ may be discrete, such as $𝑁$ , or continuous, such as $[0, \infty)$ .

The filtration is the carrier for conditional expectation over time. For an integrable random variable $𝑋$ , $𝐸 [𝑋 ∣ 𝐹_{𝑡}]$ is the best estimate of $𝑋$ using information available at time $𝑡$ . Increasing $𝑡$ refines the information. The tower property expresses temporal consistency.

22.3 Adapted processes

A process $𝑋_{𝑡}$ is adapted to $(𝐹_{𝑡})$ if

𝑋_{𝑡} is 𝐹_{𝑡} -measurable

for every $𝑡$ . This means the value of $𝑋_{𝑡}$ is known at time $𝑡$ . Adaptedness is the minimum legality condition for a process to be observable in real time.

A process can be adapted to one filtration and not another. For example, if $𝐹_{𝑡} = 𝜎 (𝑋_{𝑠} : 𝑠 \leq 𝑡)$ , then $𝑋$ is adapted to its natural filtration. If the filtration also contains future values, adaptedness may still hold but the model permits anticipatory information. That matters for stochastic integration and trading strategies.

22.4 Predictable processes

In discrete time, a process $𝐻_{𝑛}$ is predictable if $𝐻_{𝑛}$ is $𝐹_{𝑛 - 1}$ -measurable. The value used at time $𝑛$ must be chosen using information from before time $𝑛$ . In continuous time, predictability is defined through the predictable σ-algebra, generated by left-continuous adapted processes.

Predictability is stronger than adaptedness and is essential for stochastic integration. An integrand in an Itô integral must not see the future increment it is integrating against. Predictability is the no-anticipation certificate. Without it, one can manufacture false gains or invalid integrals by using future noise.

22.5 Stopping times

A random time $𝜏$ is a stopping time if

{𝜏 \leq 𝑡} \in 𝐹_{𝑡}

for every $𝑡$ . This means that by time $𝑡$ , one can determine whether $𝜏$ has occurred. In discrete time, equivalently ${𝜏 = 𝑛} \in 𝐹_{𝑛}$ .

Hitting times are standard examples:

𝜏 = \inf {𝑡 : 𝑋_{𝑡} \in 𝐴} .

Under suitable adaptedness and path regularity, such times are stopping times. Last exit times or times defined using future behavior are usually not stopping times. The stopping-time condition is the temporal measurability gate.

22.6 Optional times

The term “optional time” is often used synonymously with stopping time in many probability texts, though in finer continuous-time theory it connects to optional σ-algebras and optional processes. Optional processes are measurable with respect to the optional σ-algebra, generated by adapted càdlàg processes.

The distinction matters in stochastic calculus. Predictable objects are known just before time $𝑡$ ; optional objects are observable at time $𝑡$ . Jump processes expose the difference. At a jump time, the jump itself may be optional but not predictable. This split is central to compensators, stochastic integration, and semimartingale theory.

22.7 Natural filtration

The natural filtration of a process $𝑋_{𝑡}$ is

𝐹_{𝑡}^{𝑋} = 𝜎 (𝑋_{𝑠} : 𝑠 \leq 𝑡) .

It is the smallest filtration making $𝑋$ adapted. It represents exactly the information generated by observing the process up to time $𝑡$ .

Natural filtrations are minimal carriers. In applications, one often enlarges them to include external information, initial randomness, completed null sets, or independent noise. Enlargement changes what counts as a stopping time or predictable strategy. The process is not the whole model; the filtration determines the information regime.

22.8 Completed and right-continuous filtrations

A filtration is complete if $𝐹_{0}$ contains all $𝑃$ -null subsets of $𝐹$ . It is right-continuous if

𝐹_{𝑡} = ⋂_{𝑠 > 𝑡} 𝐹_{𝑠} .

The usual conditions in continuous-time probability require completeness and right-continuity.

These conditions are technical but load-bearing. They improve behavior of stopping times, optional projections, martingale modifications, and debut theorems. Completing adds null information; right-continuity prevents infinitesimal information gaps immediately after $𝑡$ . Many stochastic calculus theorems assume these conditions silently.

22.9 Information leakage errors

Information leakage occurs when a process, strategy, stopping rule, or integrand uses future data while being treated as nonanticipative. For example, defining $𝐻_{𝑛} = sign (𝑋_{𝑛 + 1} - 𝑋_{𝑛})$ and using it as a trading strategy at time $𝑛$ is illegal because it depends on future movement.

The audit is always measurability against the correct σ-algebra. Adaptedness uses $𝐹_{𝑡}$ ; predictability often uses $𝐹_{𝑡 -}$ ; stopping uses ${𝜏 \leq 𝑡} \in 𝐹_{𝑡}$ . If the information gate fails, martingale fairness, optional stopping, and stochastic integration conclusions can collapse.

Chapter 23 — Martingales

23.1 Martingales, submartingales, supermartingales

A process $(𝑀_{𝑡}, 𝐹_{𝑡})$ is a martingale if $𝑀_{𝑡}$ is adapted, integrable, and

𝐸 [𝑀_{𝑡} ∣ 𝐹_{𝑠}] = 𝑀_{𝑠}, 𝑠 \leq 𝑡 .

It is a submartingale if

𝐸 [𝑀_{𝑡} ∣ 𝐹_{𝑠}] \geq 𝑀_{𝑠},

and a supermartingale if the inequality is reversed.

A martingale is a fair process relative to the information filtration. It does not mean the path is constant or predictable. It means conditional future expectation equals present value. Submartingales have nonnegative conditional drift; supermartingales have nonpositive conditional drift.

23.2 Examples from sums, games, likelihood ratios

If $𝑋_{𝑖}$ are integrable martingale differences with

𝐸 [𝑋_{𝑖} ∣ 𝐹_{𝑖 - 1}] = 0,

then

𝑀_{𝑛} = \sum_{𝑖 = 1}^{𝑛} 𝑋_{𝑖}

is a martingale. Fair gambling capital under fair bets is the canonical example. If the bets have nonpositive expected gain, the capital is a supermartingale.

Likelihood ratios also produce martingales. If $𝑃$ and $𝑄$ are probability measures and $𝐹_{𝑡}$ is increasing, the density process

𝐿_{𝑡} = \frac{𝑑 𝑄 ∣_{𝐹_{𝑡}}}{𝑑 𝑃 ∣_{𝐹_{𝑡}}}

is a $𝑃$ -martingale under suitable absolute continuity. This connects martingales to change of measure, statistics, and filtering.

23.3 Doob decomposition

In discrete time, any integrable adapted submartingale $𝑋_{𝑛}$ admits a decomposition

𝑋_{𝑛} = 𝑀_{𝑛} + 𝐴_{𝑛},

where $𝑀_{𝑛}$ is a martingale and $𝐴_{𝑛}$ is a predictable increasing process with $𝐴_{0} = 0$ . For a general adapted integrable process, one can decompose into martingale plus predictable drift increments.

The decomposition separates fair fluctuation from systematic drift. For a submartingale, the drift part is increasing. This is the prototype for compensators in continuous time, where counting processes are decomposed into martingale plus predictable intensity.

23.4 Martingale transforms

In discrete time, if $𝑀_{𝑛}$ is a martingale with increments $Δ 𝑀_{𝑛} = 𝑀_{𝑛} - 𝑀_{𝑛 - 1}$ , and $𝐻_{𝑛}$ is predictable, then

(𝐻 \cdot 𝑀)_{𝑛} = \sum_{𝑘 = 1}^{𝑛} 𝐻_{𝑘} Δ 𝑀_{𝑘}

is a martingale under integrability conditions. This is the martingale transform.

The predictability of $𝐻_{𝑘}$ is essential. It means the multiplier is chosen before observing $Δ 𝑀_{𝑘}$ . Martingale transforms model fair betting strategies and form the discrete ancestor of stochastic integrals. If $𝐻_{𝑘}$ can depend on future increments, the martingale property can fail.

23.5 Optional stopping theorem

The optional stopping theorem gives conditions under which

𝐸 [𝑀_{𝜏}] = 𝐸 [𝑀_{0}]

for a martingale $𝑀$ and stopping time $𝜏$ . Sufficient conditions include bounded $𝜏$ , uniformly integrable stopped martingales, or suitable integrability and bounded increment conditions.

The theorem is not the false slogan “you cannot beat a martingale by stopping.” Without hypotheses, optional stopping can fail. Stopping times with infinite expectation, unbounded gains, or non-uniform integrability can create invalid expectations. The theorem is a gated result, not a universal gambling principle.

23.6 Optional sampling theorem

Optional sampling generalizes optional stopping to two stopping times $𝜎 \leq 𝜏$ :

𝐸 [𝑀_{𝜏} ∣ 𝐹_{𝜎}] = 𝑀_{𝜎}

under suitable hypotheses. For submartingales,

𝐸 [𝑋_{𝜏} ∣ 𝐹_{𝜎}] \geq 𝑋_{𝜎} .

This theorem says martingale fairness persists between random observation times, provided those times are legitimate stopping times and integrability conditions hold. It is central for hitting probabilities, stopped processes, stochastic calculus, and Markov process arguments.

23.7 Maximal inequalities

Doob’s maximal inequality controls the running maximum of a submartingale. For a nonnegative submartingale $𝑋_{𝑘}$ ,

𝜆 𝑃 (\max_{𝑘 \leq 𝑛} 𝑋_{𝑘} \geq 𝜆) \leq 𝐸 [𝑋_{𝑛}] .

For martingales, applying this to $∣ 𝑀_{𝑘} ∣$ or $∣ 𝑀_{𝑘} ∣^{𝑝}$ gives pathwise maximum bounds.

Maximal inequalities convert terminal integrability into pathwise control over all earlier times. This is the martingale analogue of Kolmogorov’s inequality. It is indispensable for convergence theorems, stopping arguments, and tightness of stochastic processes.

23.8 Doob’s `Lᵖ` inequalities

For $𝑝 > 1$ and a martingale $𝑀_{𝑘}$ ,

{∥ \max_{𝑘 \leq 𝑛} ∣ 𝑀_{𝑘} ∣ ∥}_{𝑝} \leq \frac{𝑝}{𝑝 - 1} ∥ 𝑀_{𝑛} ∥_{𝑝} .

Equivalently,

𝐸 [\max_{𝑘 \leq 𝑛} ∣ 𝑀_{𝑘} ∣^{𝑝}] \leq {(\frac{𝑝}{𝑝 - 1})}^{𝑝} 𝐸 ∣ 𝑀_{𝑛} ∣^{𝑝} .

This inequality gives $𝐿^{𝑝}$ control of the entire path from $𝐿^{𝑝}$ control of the endpoint. It is one of the main reasons martingales are analytically tractable. The condition $𝑝 > 1$ matters; $𝑝 = 1$ has weaker forms and requires different handling.

23.9 Upcrossing inequality

For a submartingale $𝑋_{𝑛}$ , the upcrossing inequality bounds the number of completed crossings of an interval $[𝑎, 𝑏]$ . If $𝑈_{𝑛} [𝑎, 𝑏]$ is the number of upcrossings by time $𝑛$ , then

(𝑏 - 𝑎) 𝐸 [𝑈_{𝑛} [𝑎, 𝑏]] \leq 𝐸 [(𝑋_{𝑛} - 𝑎)^{+}] - 𝐸 [(𝑋_{0} - 𝑎)^{+}] .

Upcrossings measure oscillation. If infinitely many upcrossings occur between rational levels $𝑎 < 𝑏$ , the sequence cannot converge. The inequality is therefore a convergence engine: boundedness in $𝐿^{1}$ controls oscillation, forcing almost sure convergence of submartingales under suitable assumptions.

23.10 Martingale convergence theorem

If $𝑀_{𝑛}$ is a martingale bounded in $𝐿^{1}$ in the right way, such as a nonnegative supermartingale or an $𝐿^{𝑝}$ -bounded martingale for $𝑝 > 1$ , then $𝑀_{𝑛}$ converges almost surely to a limit $𝑀_{\infty}$ . Under uniform integrability,

𝑀_{𝑛} = 𝐸 [𝑀_{\infty} ∣ 𝐹_{𝑛}]

and convergence also holds in $𝐿^{1}$ .

The theorem is a central pathwise compactness result. Martingales do not have deterministic monotonicity, but their conditional fairness plus integrability prevents endless profitable oscillation. Upcrossing control is the core mechanism.

23.11 Uniform integrability of martingales

A martingale $(𝑀_{𝑛})$ is uniformly integrable if

\lim_{𝐾 \to \infty} \sup_{𝑛} 𝐸 [∣ 𝑀_{𝑛} ∣ 1_{{∣ 𝑀_{𝑛} ∣ > 𝐾}}] = 0.

Uniform integrability is exactly the condition that allows $𝐿^{1}$ terminal representation:

𝑀_{𝑛} = 𝐸 [𝑀_{\infty} ∣ 𝐹_{𝑛}] .

Boundedness in $𝐿^{𝑝}$ for $𝑝 > 1$ implies uniform integrability. Without uniform integrability, a martingale can converge almost surely but lose mass in expectation. This is the martingale version of the general expectation convergence problem.

23.12 Reverse martingales

A reverse martingale is adapted to a decreasing sequence of σ-algebras $𝐺_{𝑛}$ , with

𝐸 [𝑋_{𝑛} ∣ 𝐺_{𝑛 + 1}] = 𝑋_{𝑛 + 1}

when $𝐺_{𝑛 + 1} \subseteq 𝐺_{𝑛}$ . A standard example is

𝑋_{𝑛} = 𝐸 [𝑋 ∣ 𝐺_{𝑛}] .

Reverse martingales converge almost surely and in $𝐿^{1}$ under integrability. They are used in exchangeability, de Finetti theory, ergodic theory, and tail σ-algebra arguments. They describe what remains when information is progressively forgotten rather than accumulated.

23.13 Martingale difference sequences

A martingale difference sequence satisfies

𝐸 [𝐷_{𝑛} ∣ 𝐹_{𝑛 - 1}] = 0.

Then

𝑀_{𝑛} = \sum_{𝑘 = 1}^{𝑛} 𝐷_{𝑘}

is a martingale. Martingale differences generalize independent centered variables; they allow dependence while preserving zero conditional drift.

Many limit theorems and concentration inequalities for independent sums extend to martingale differences with conditional variance and boundedness assumptions. The adapted conditional-mean-zero property is the replacement certificate for independence.

Chapter 24 — Martingale Limit and Concentration Tools

24.1 Azuma–Hoeffding inequality

If $𝑀_{𝑛}$ is a martingale with bounded increments

∣ 𝑀_{𝑘} - 𝑀_{𝑘 - 1} ∣ \leq 𝑐_{𝑘},

then

𝑃 (𝑀_{𝑛} - 𝑀_{0} \geq 𝑡) \leq \exp (- \frac{𝑡^{2}}{2 \sum_{𝑘 = 1}^{𝑛} 𝑐_{𝑘}^{2}}) .

A two-sided version gives

𝑃 (∣ 𝑀_{𝑛} - 𝑀_{0} ∣ \geq 𝑡) \leq 2 \exp (- \frac{𝑡^{2}}{2 \sum 𝑐_{𝑘}^{2}}) .

Azuma–Hoeffding is a concentration theorem for bounded martingale differences. It converts conditional fairness plus bounded step size into subgaussian tails. It is especially useful when independence is absent but exposure martingales can be constructed.

24.2 Freedman inequality

Freedman’s inequality refines Azuma by using conditional variance. If $𝑀_{𝑛}$ has increments bounded above by $𝑏$ , and predictable quadratic variation

𝑉_{𝑛} = \sum_{𝑘 = 1}^{𝑛} 𝐸 [(Δ 𝑀_{𝑘})^{2} ∣ 𝐹_{𝑘 - 1}],

then

𝑃 (𝑀_{𝑛} \geq 𝑡, 𝑉_{𝑛} \leq 𝑣) \leq \exp (- \frac{𝑡^{2}}{2 (𝑣 + 𝑏 𝑡 / 3)}) .

This inequality is stronger when actual variance is much smaller than the worst-case squared increment bound. It is the martingale analogue of Bernstein concentration. It separates two debts: fluctuation variance $𝑣$ and maximum jump $𝑏$ .

24.3 Burkholder–Davis–Gundy inequalities

The BDG inequalities relate moments of a martingale’s maximum to moments of its quadratic variation. In continuous time, for $𝑝 > 0$ ,

𝐸 [\sup_{𝑡 \leq 𝑇} ∣ 𝑀_{𝑡} ∣^{𝑝}] ≍_{𝑝} 𝐸 [[𝑀]_{𝑇}^{𝑝 / 2}]

for suitable martingales. Here $[𝑀]_{𝑇}$ is quadratic variation.

BDG is a deep path-control theorem. It says the size of a martingale path is governed by accumulated quadratic variation. This is foundational for stochastic integration, SDE estimates, tightness, and regularity of martingale-driven processes.

24.4 Exponential martingales

Exponential martingales are built by exponentiating a martingale and subtracting compensating drift. For Brownian motion $𝐵_{𝑡}$ ,

\exp (𝜃 𝐵_{𝑡} - \frac{1}{2} 𝜃^{2} 𝑡)

is a martingale. In discrete time, exponential supermartingales underlie Chernoff, Hoeffding, Bernstein, and Freedman bounds.

The compensation term is the log-moment correction. Exponential martingales turn tail events into optional stopping or maximal inequality problems. They are also the mechanism behind Girsanov change of measure: exponential martingales become Radon–Nikodym densities.

24.5 Stopping and localization

Localization handles processes that are well-behaved only up to random times. A process $𝑀_{𝑡}$ is a local martingale if there exists an increasing sequence of stopping times $𝜏_{𝑛} ↑ \infty$ such that

𝑀_{𝑡 \land 𝜏_{𝑛}}

is a martingale for each $𝑛$ .

Localization permits stochastic calculus for unbounded or explosive processes by proving statements on bounded stopped intervals and then passing to the limit. The liftback requires care: local martingales need not be true martingales, and expectation identities may fail without integrability or uniform integrability.

24.6 Quadratic variation

Quadratic variation measures accumulated squared increments. For a continuous martingale $𝑀$ ,

[𝑀]_{𝑡}

is the increasing process such that

𝑀_{𝑡}^{2} - [𝑀]_{𝑡}

is a martingale. For Brownian motion,

[𝐵]_{𝑡} = 𝑡 .

Quadratic variation is the second-order clock of martingale fluctuation. In stochastic calculus, it is why Itô’s formula has a second derivative term. Ordinary differentiable paths have zero quadratic variation; Brownian paths have finite nonzero quadratic variation. This is the analytic signature of stochastic motion.

24.7 Martingale CLT

A martingale central limit theorem gives Gaussian limits for martingale difference arrays. If $𝐷_{𝑛, 𝑘}$ are martingale differences and the conditional variance sums converge:

\sum_{𝑘} 𝐸 [𝐷_{𝑛, 𝑘}^{2} ∣ 𝐹_{𝑛, 𝑘 - 1}] \to 𝜎^{2}

in probability, and a conditional Lindeberg condition holds, then

\sum_{𝑘} 𝐷_{𝑛, 𝑘} \Rightarrow 𝑁 (0, 𝜎^{2}) .

The theorem replaces independence with conditional mean-zero plus conditional variance stabilization. It is crucial in dependent asymptotics, stochastic approximation, statistics, Markov chains, and random environments. The conditional variance process is the carrier of limiting Gaussian scale.

24.8 Concentration via bounded differences

If a function $𝑓 (𝑋_{1}, \dots, 𝑋_{𝑛})$ changes by at most $𝑐_{𝑖}$ when only coordinate $𝑖$ is changed, and the $𝑋_{𝑖}$ are independent, define the Doob martingale

𝑀_{𝑖} = 𝐸 [𝑓 (𝑋_{1}, \dots, 𝑋_{𝑛}) ∣ 𝑋_{1}, \dots, 𝑋_{𝑖}] .

Its increments are bounded by $𝑐_{𝑖}$ , so Azuma–Hoeffding yields

𝑃 (𝑓 - 𝐸 𝑓 \geq 𝑡) \leq \exp (- \frac{2 𝑡^{2}}{\sum_{𝑖} 𝑐_{𝑖}^{2}})

up to convention-dependent constants.

This is McDiarmid’s bounded differences inequality. It converts low sensitivity to individual inputs into concentration. The martingale is an exposure process: reveal inputs one at a time and track the conditional expectation.

24.9 Algorithmic and combinatorial applications

Martingale concentration applies to random graphs, randomized algorithms, hashing, load balancing, random permutations, and combinatorial optimization. The typical method is to expose random choices sequentially, define a Doob martingale, bound the effect of each exposure, and apply Azuma, Freedman, or a refined inequality.

The key is not independence alone but controlled sensitivity under information revelation. If a single input can change the output drastically, bounded differences fail. If conditional variances are small, Freedman can improve bounds. The proof engine is filtration design plus increment audit.

Chapter 25 — Random Sequences and Processes

25.1 Process as random element in path space

A stochastic process $(𝑋_{𝑡})_{𝑡 \in 𝑇}$ can be viewed as a random element

𝑋 : Ω \to 𝑆^{𝑇}, 𝑋 (𝜔) = (𝑋_{𝑡} (𝜔))_{𝑡 \in 𝑇} .

The path space $𝑆^{𝑇}$ carries a σ-algebra, often generated by coordinate maps. Under stronger regularity, the law may live on $𝐶 [0, 𝑇]$ , $𝐷 [0, 𝑇]$ , or another structured path space.

This viewpoint treats the whole trajectory as one random object. Finite-dimensional distributions become projections of the path law. Path properties such as continuity, bounded variation, hitting times, and oscillation become events in path space, requiring measurability and often regularity certificates.

25.2 Finite-dimensional distributions

The finite-dimensional distributions of a process are the laws of

(𝑋_{𝑡_{1}}, \dots, 𝑋_{𝑡_{𝑘}})

for finite choices $𝑡_{1}, \dots, 𝑡_{𝑘}$ . These distributions describe all finite coordinate observations. They are necessary data for the process law.

For many process constructions, one first specifies finite-dimensional distributions and then checks consistency. But finite-dimensional distributions alone do not determine path regularity unless the path-space carrier is fixed. A process with continuous paths and a process with wild paths may share finite-dimensional distributions unless modifications are controlled.

25.3 Consistency conditions

Finite-dimensional distributions $𝜇_{𝑡_{1}, \dots, 𝑡_{𝑘}}$ must be compatible under permutation and marginalization. If one forgets a coordinate, the larger distribution must project to the smaller one:

(𝜋)_{*} 𝜇_{𝑡_{1}, \dots, 𝑡_{𝑘}} = 𝜇_{𝑠_{1}, \dots, 𝑠_{𝑚}} .

They must also be invariant under relabeling of repeated or reordered indices.

Consistency is the finite-data certificate for a global process law. Without it, no process can have those finite-dimensional distributions. With it, extension theorems can build a measure on a product path space under suitable assumptions.

25.4 Kolmogorov extension theorem

Kolmogorov’s extension theorem says that a consistent family of finite-dimensional distributions on suitable state spaces defines a probability measure on the infinite product space $𝑆^{𝑇}$ . The coordinate process on that space then has the prescribed finite-dimensional laws.

The theorem provides existence of a process as a product-space random element. It does not guarantee continuous or measurable sample paths beyond the product σ-algebra. Regularity is a separate payload. This distinction is essential for Brownian motion: extension gives a process with Gaussian finite-dimensional distributions; continuity requires additional work.

25.5 Sample-path regularity

Sample-path regularity concerns properties of $𝑡 \mapsto 𝑋_{𝑡} (𝜔)$ : continuity, càdlàg behavior, bounded variation, Hölder regularity, differentiability, or jump structure. These are not determined merely by one-time marginal laws.

Kolmogorov’s continuity criterion is a standard regularity tool. If

𝐸 ∣ 𝑋_{𝑡} - 𝑋_{𝑠} ∣^{𝛼} \leq 𝐶 ∣ 𝑡 - 𝑠 ∣^{1 + 𝛽}

for suitable $𝛼, 𝛽 > 0$ , then the process has a continuous modification with Hölder-type regularity below a threshold. Moment bounds on increments become path regularity certificates.

25.6 Separability

Separability reduces uncountable-time behavior to countable dense-time behavior. A process is separable, roughly, if its path properties can be determined by values on a countable dense subset, up to null sets. This prevents uncountable measurability traps.

In continuous-time probability, events like

\sup_{𝑡 \in [0, 𝑇]} 𝑋_{𝑡} > 𝑎

may not be measurable without path regularity or separability. If paths are continuous or càdlàg, the supremum over an interval can be reduced to a supremum over rational times. This is why path-space regularity and separability are not optional technicalities.

25.7 Modification and indistinguishability

Two processes $𝑋_{𝑡}$ and $𝑌_{𝑡}$ are modifications of each other if for every fixed $𝑡$ ,

𝑃 (𝑋_{𝑡} = 𝑌_{𝑡}) = 1.

They are indistinguishable if

𝑃 (𝑋_{𝑡} = 𝑌_{𝑡} for all 𝑡) = 1.

Indistinguishability is stronger, especially when $𝑇$ is uncountable.

A fixed-time null set may depend on $𝑡$ , and uncountably many null sets cannot be unioned safely. If processes have continuous paths, modification often implies indistinguishability under suitable conditions because equality on a countable dense set extends by continuity. Without regularity, the implication can fail.

25.8 Stationarity

A process is stationary if its finite-dimensional distributions are invariant under time shifts:

(𝑋_{𝑡_{1} + ℎ}, \dots, 𝑋_{𝑡_{𝑘} + ℎ}) \overset{d}{=} (𝑋_{𝑡_{1}}, \dots, 𝑋_{𝑡_{𝑘}})

whenever the shifted indices are valid. Strict stationarity is full finite-dimensional invariance. Weak stationarity usually means constant mean and covariance depending only on time lag.

Stationarity is symmetry in time. It does not imply independence or ergodicity. A process may have the same distribution at all times while retaining long-range dependence or latent random environment. Stationarity supplies invariant law; ergodicity determines whether time averages collapse to constants.

25.9 Ergodicity

A stationary process is ergodic if shift-invariant events have probability $0$ or $1$ . Equivalently, time averages converge to deterministic expectations rather than random invariant components. For an ergodic stationary sequence,

\frac{1}{𝑛} \sum_{𝑘 = 1}^{𝑛} 𝑓 (𝑋_{𝑘}) \to 𝐸 [𝑓 (𝑋_{1})]

almost surely under integrability.

Without ergodicity, the limit is conditional on the invariant σ-algebra:

\frac{1}{𝑛} \sum_{𝑘 = 1}^{𝑛} 𝑓 (𝑋_{𝑘}) \to 𝐸 [𝑓 (𝑋_{1}) ∣ 𝐼] .

Thus ergodicity is the gate that turns ensemble averages into time averages.

25.10 Mixing

Mixing is asymptotic independence between distant time regions. A strong mixing condition may take the form

𝛼 (𝑛) = \sup_{𝐴 \in 𝐹_{- \infty}^{0}, 𝐵 \in 𝐹_{𝑛}^{\infty}} ∣ 𝑃 (𝐴 \cap 𝐵) - 𝑃 (𝐴) 𝑃 (𝐵) ∣ \to 0.

Different mixing notions measure different strengths of dependence decay.

Mixing usually implies ergodicity, but it is stronger. It supports CLTs, concentration, empirical process results, and statistical consistency for dependent data. The rate of mixing matters: summable mixing coefficients give stronger limit theorems than mere convergence to zero.

25.11 Process-level carrier debt

A stochastic process claim must declare its path carrier, filtration, finite-dimensional laws, regularity, modification class, and convergence topology. A statement about $𝑋_{𝑡}$ at each fixed $𝑡$ is weaker than a statement about the whole path. A statement about finite-dimensional convergence is weaker than weak convergence in path space.

Common process-level errors include treating modifications as indistinguishable, assuming sample continuity from Gaussian marginals, using stopping times without a filtration, or applying continuous mapping arguments in the wrong path topology. Process probability is not coordinate probability repeated; it is probability on trajectories.

Chapter 26 — Markov Chains

26.1 Markov property

A discrete-time process $𝑋_{𝑛}$ is Markov if, conditional on the present, the future is independent of the past:

𝑃 (𝑋_{𝑛 + 1} \in 𝐴 ∣ 𝑋_{0}, \dots, 𝑋_{𝑛}) = 𝑃 (𝑋_{𝑛 + 1} \in 𝐴 ∣ 𝑋_{𝑛}) .

For time-homogeneous chains, this conditional law depends only on $𝑋_{𝑛}$ , not on $𝑛$ .

The Markov property is a conditional independence statement. It does not say the variables are independent; dependence passes through the current state. The present state is a sufficient statistic for future prediction. This is the carrier principle behind transition kernels and semigroups.

26.2 Transition kernels

A transition kernel $𝑃 (𝑥, 𝐴)$ gives the probability of moving from state $𝑥$ to a measurable set $𝐴$ . For countable spaces, this is a matrix

𝑃_{𝑖 𝑗} = 𝑃 (𝑋_{𝑛 + 1} = 𝑗 ∣ 𝑋_{𝑛} = 𝑖) .

The $𝑛$ -step transition kernel is obtained by composition:

𝑃^{𝑛} (𝑥, 𝐴) = 𝑃 (𝑋_{𝑛} \in 𝐴 ∣ 𝑋_{0} = 𝑥) .

Kernels are probability-valued functions of the current state. They define the chain law once an initial distribution $𝜇$ is specified. The finite-dimensional law is

𝜇 (𝑑 𝑥_{0}) 𝑃 (𝑥_{0}, 𝑑 𝑥_{1}) \dots 𝑃 (𝑥_{𝑛 - 1}, 𝑑 𝑥_{𝑛}) .

This factorization is the Markov certificate.

26.3 Finite-state chains

For a finite state space, the transition kernel is a stochastic matrix $𝑃$ with rows summing to one. If the distribution at time $𝑛$ is row vector $𝜇_{𝑛}$ , then

𝜇_{𝑛 + 1} = 𝜇_{𝑛} 𝑃, 𝜇_{𝑛} = 𝜇_{0} 𝑃^{𝑛} .

Finite chains reduce many questions to matrix analysis. Stationary distributions solve

𝜋 = 𝜋 𝑃 .

Long-term behavior depends on irreducibility, periodicity, and spectral structure. Finite-state chains are the cleanest setting for mixing time and convergence to equilibrium.

26.4 Countable-state chains

Countable-state chains require more classification than finite chains because recurrence, transience, and invariant measures become subtle. Transition probabilities still form a stochastic matrix, but stationary distributions may fail to exist or may not be normalizable.

A simple random walk on $𝑍$ is recurrent but has no probability stationary distribution. A biased random walk on $𝑍$ is transient. Countable chains force separation between invariant measures, stationary probabilities, recurrence classes, and positive recurrence.

26.5 Communicating classes

State $𝑖$ reaches state $𝑗$ , written $𝑖 \to 𝑗$ , if

𝑃^{𝑛} (𝑖, 𝑗) > 0

for some $𝑛$ . States communicate if $𝑖 \to 𝑗$ and $𝑗 \to 𝑖$ . Communication is an equivalence relation; equivalence classes are communicating classes.

A class is closed if the chain cannot leave it once entered. Irreducibility means all states communicate. Long-run behavior is analyzed class by class. Transient states may lead into closed recurrent classes; closed classes trap the chain.

26.6 Recurrence and transience

A state $𝑖$ is recurrent if, starting from $𝑖$ , the chain returns to $𝑖$ with probability one. It is transient if the return probability is less than one. Equivalently, recurrence can be characterized by

\sum_{𝑛 = 0}^{\infty} 𝑃^{𝑛} (𝑖, 𝑖) = \infty,

while transience corresponds to finiteness of this sum.

In irreducible chains, recurrence or transience is a class property. Positive recurrence means expected return time is finite; null recurrence means return occurs almost surely but with infinite expected return time. Stationary probability distributions exist for irreducible positive recurrent chains.

26.7 Stationary distributions

A stationary distribution $𝜋$ satisfies

𝜋 𝑃 = 𝜋 .

If $𝑋_{0} \sim 𝜋$ , then $𝑋_{𝑛} \sim 𝜋$ for all $𝑛$ . Stationarity is distributional invariance under the transition kernel.

For finite irreducible chains, a stationary distribution exists and is unique. For countable chains, existence of a probability stationary distribution corresponds to positive recurrence under irreducibility. Stationarity does not automatically mean convergence to stationarity; periodicity can obstruct convergence.

26.8 Reversibility

A chain with stationary distribution $𝜋$ is reversible if

𝜋_{𝑖} 𝑃_{𝑖 𝑗} = 𝜋_{𝑗} 𝑃_{𝑗 𝑖}

for all states $𝑖, 𝑗$ . This detailed balance condition says the stationary flow from $𝑖$ to $𝑗$ equals the flow from $𝑗$ to $𝑖$ .

Reversibility implies stationarity by summing over $𝑖$ :

\sum_{𝑖} 𝜋_{𝑖} 𝑃_{𝑖 𝑗} = \sum_{𝑖} 𝜋_{𝑗} 𝑃_{𝑗 𝑖} = 𝜋_{𝑗} .

Reversible chains are analytically tractable because the transition operator is self-adjoint in $𝐿^{2} (𝜋)$ . This connects Markov chains to spectral theory.

26.9 Detailed balance

Detailed balance is the local equation

𝜋 (𝑥) 𝑃 (𝑥, 𝑑 𝑦) = 𝜋 (𝑦) 𝑃 (𝑦, 𝑑 𝑥)

in general-state notation. It is stronger than stationarity, which only requires global balance:

\int 𝜋 (𝑑 𝑥) 𝑃 (𝑥, 𝐴) = 𝜋 (𝐴) .

Detailed balance is widely used to construct Markov chains with desired stationary distribution. In MCMC, Metropolis–Hastings transition rules are engineered to satisfy detailed balance. The cost is that reversible chains may be slower than nonreversible alternatives; reversibility is a certificate, not always an optimal design.

26.10 Hitting times

For a set $𝐴$ , the hitting time is

𝜏_{𝐴} = \inf {𝑛 \geq 0 : 𝑋_{𝑛} \in 𝐴} .

Hitting probabilities solve harmonic equations. For $ℎ (𝑖) = 𝑃_{𝑖} (𝜏_{𝐴} < 𝜏_{𝐵})$ ,

ℎ (𝑖) = \sum_{𝑗} 𝑃_{𝑖 𝑗} ℎ (𝑗)

on states outside $𝐴 \cup 𝐵$ , with boundary values $ℎ = 1$ on $𝐴$ , $ℎ = 0$ on $𝐵$ .

Expected hitting times solve Poisson equations. If $𝑔 (𝑖) = 𝐸_{𝑖} [𝜏_{𝐴}]$ , then

𝑔 (𝑖) = 1 + \sum_{𝑗} 𝑃_{𝑖 𝑗} 𝑔 (𝑗)

outside $𝐴$ , with $𝑔 = 0$ on $𝐴$ . Hitting times connect Markov chains to potential theory.

26.11 Absorbing chains

An absorbing state $𝑎$ satisfies $𝑃_{𝑎 𝑎} = 1$ . Once entered, it cannot be left. More generally, a closed class acts as an absorbing region. Absorbing chains are analyzed by separating transient and absorbing states.

For finite absorbing chains with transient submatrix $𝑄$ , the fundamental matrix is

𝑁 = (𝐼 - 𝑄)^{- 1} = \sum_{𝑛 = 0}^{\infty} 𝑄^{𝑛} .

The entry $𝑁_{𝑖 𝑗}$ gives the expected number of visits to transient state $𝑗$ starting from $𝑖$ . Absorption probabilities and expected absorption times follow from $𝑁$ .

26.12 Mixing time

For an irreducible aperiodic finite chain with stationary distribution $𝜋$ , the mixing time measures how long until the law is close to $𝜋$ . In total variation,

𝑡_{m i x} (𝜀) = \min {𝑡 : \sup_{𝑥} ∥ 𝑃^{𝑡} (𝑥, \cdot) - 𝜋 ∥_{T V} \leq 𝜀} .

Mixing time quantifies convergence to equilibrium. It depends on bottlenecks, spectral gap, conductance, coupling, and geometry of the state space. Stationarity alone does not give speed. Mixing analysis is about finite-time convergence, not merely limiting invariance.

26.13 Coupling methods

A coupling of two chains constructs $(𝑋_{𝑛}, 𝑌_{𝑛})$ on one probability space with each marginal following the chain. If they meet and stay together after time $𝑇$ , then

∥ 𝑃^{𝑛} (𝑥, \cdot) - 𝑃^{𝑛} (𝑦, \cdot) ∥_{T V} \leq 𝑃 (𝑇 > 𝑛) .

Coupling bounds mixing by controlling coalescence time.

Good couplings exploit contraction, monotonicity, shared randomness, or geometry. Coupling is not automatic; it is a designed joint carrier. The success of the method depends on how quickly the coupled chains meet under the chosen construction.

26.14 Spectral gap

For a reversible finite chain, the spectral gap is

𝛾 = 1 - 𝜆_{2}

where $𝜆_{2}$ is the second-largest eigenvalue in appropriate ordering. A positive spectral gap implies exponential convergence to equilibrium. Roughly,

∥ 𝑃^{𝑛} (𝑥, \cdot) - 𝜋 ∥_{T V}

decays at a rate controlled by $(1 - 𝛾)^{𝑛}$ , with constants depending on $𝜋$ .

The spectral gap measures relaxation speed in $𝐿^{2} (𝜋)$ . It is connected to Poincaré inequalities, conductance, and geometry. Small spectral gap indicates bottlenecks or slow modes. Spectral methods are powerful but usually require reversibility or operator control.

26.15 Markov chain Monte Carlo preview

MCMC constructs a Markov chain whose stationary distribution is a target law $𝜋$ . After running the chain, samples are used to estimate

\int 𝑓 𝑑 𝜋 .

Metropolis–Hastings, Gibbs sampling, and Langevin algorithms are standard examples.

The validity of MCMC requires more than stationarity. One needs convergence to stationarity, adequate mixing, diagnostic control, and error estimates. A chain that has $𝜋$ as invariant law but mixes slowly may be computationally useless. The MCMC audit is invariant distribution plus convergence rate plus sampling error.

Chapter 27 — Markov Processes

27.1 Markov kernels on general spaces

On a measurable state space $(𝑆, 𝑆)$ , a Markov kernel $𝑃 (𝑥, 𝐴)$ satisfies: for each $𝑥$ , $𝐴 \mapsto 𝑃 (𝑥, 𝐴)$ is a probability measure, and for each $𝐴$ , $𝑥 \mapsto 𝑃 (𝑥, 𝐴)$ is measurable. It defines transition probabilities from states to measurable sets.

General kernels replace matrices. The Markov property becomes

𝑃 (𝑋_{𝑡 + 𝑠} \in 𝐴 ∣ 𝐹_{𝑡}) = 𝑃_{𝑠} (𝑋_{𝑡}, 𝐴)

for a transition family $𝑃_{𝑠}$ . The measurable-space structure is essential; without it, the transition probability as a function of state may not be well-defined.

27.2 Semigroups

A Markov semigroup $(𝑃_{𝑡})_{𝑡 \geq 0}$ acts on functions by

𝑃_{𝑡} 𝑓 (𝑥) = 𝐸_{𝑥} [𝑓 (𝑋_{𝑡})] .

It satisfies

𝑃_{0} = 𝐼, 𝑃_{𝑡 + 𝑠} = 𝑃_{𝑡} 𝑃_{𝑠} .

The second identity is the Chapman–Kolmogorov equation and encodes the Markov property.

Semigroups are operator carriers for Markov processes. Instead of tracking paths directly, one studies how expectations evolve. Semigroup methods connect probability to PDE, functional analysis, spectral theory, and generators.

27.3 Strong Markov property

The strong Markov property says the Markov property holds at stopping times. If $𝜏$ is a stopping time, then conditional on $𝐹_{𝜏}$ , the post- $𝜏$ process behaves like a fresh process started from $𝑋_{𝜏}$ :

𝐸 [𝑓 (𝑋_{𝜏 + 𝑡}) ∣ 𝐹_{𝜏}] = 𝑃_{𝑡} 𝑓 (𝑋_{𝜏}) .

This is stronger than the deterministic-time Markov property. It is essential for hitting times, excursions, renewal arguments, boundary problems, and potential theory. The stopping-time condition is load-bearing; arbitrary random times do not generally permit Markov restart.

27.4 Feller processes

A Feller semigroup acts on $𝐶_{0} (𝑆)$ , the continuous functions vanishing at infinity, and satisfies

𝑃_{𝑡} : 𝐶_{0} (𝑆) \to 𝐶_{0} (𝑆), \lim_{𝑡 ↓ 0} ∥ 𝑃_{𝑡} 𝑓 - 𝑓 ∥_{\infty} = 0.

A Markov process with such a semigroup is called a Feller process.

Feller structure connects stochastic processes to analytic generators. It gives regular dependence on initial conditions and supports construction through semigroup theory. Many diffusions, Lévy processes, and jump processes are Feller under suitable assumptions.

27.5 Generators

The generator $𝐴$ of a semigroup is

𝐴 𝑓 = \lim_{𝑡 ↓ 0} \frac{𝑃_{𝑡} 𝑓 - 𝑓}{𝑡}

for functions $𝑓$ where the limit exists. It describes infinitesimal evolution. For a diffusion,

𝐴 𝑓 = 𝑏 (𝑥) \cdot \nabla 𝑓 (𝑥) + \frac{1}{2} Tr (𝑎 (𝑥) \nabla^{2} 𝑓 (𝑥)) .

The generator is the local analytic fingerprint of a Markov process. It determines PDEs, martingale problems, transition behavior, and invariant measures. The domain of $𝐴$ is part of the data; writing a differential expression without specifying domain can be incomplete.

27.6 Resolvents

The resolvent of a semigroup is

𝑅_{𝜆} 𝑓 = \int_{0}^{\infty} 𝑒^{- 𝜆 𝑡} 𝑃_{𝑡} 𝑓 𝑑 𝑡, 𝜆 > 0.

It satisfies

𝑅_{𝜆} = (𝜆 𝐼 - 𝐴)^{- 1}

formally, when $𝐴$ is the generator.

Resolvents encode potential behavior: expected discounted occupation times. They are useful for constructing processes, solving boundary value problems, and analyzing recurrence. In Markov process theory, semigroup, generator, and resolvent are three equivalent analytic views when the carrier is well-behaved.

27.7 Invariant measures

A probability measure $𝜋$ is invariant if

\int 𝑃_{𝑡} (𝑥, 𝐴) 𝜋 (𝑑 𝑥) = 𝜋 (𝐴)

for all $𝑡$ and measurable $𝐴$ . Equivalently,

\int 𝑃_{𝑡} 𝑓 𝑑 𝜋 = \int 𝑓 𝑑 𝜋 .

If $𝑋_{0} \sim 𝜋$ , then $𝑋_{𝑡} \sim 𝜋$ for all $𝑡$ .

Invariant measures describe equilibrium laws. Existence, uniqueness, and convergence to equilibrium require additional recurrence, irreducibility, Lyapunov, or compactness conditions. In continuous spaces, invariant measures may exist but convergence may fail or depend strongly on topology.

27.8 Harris recurrence

Harris recurrence is a recurrence notion for general-state Markov processes or chains. Roughly, a process is Harris recurrent if it visits every set of positive measure infinitely often with probability one, relative to an irreducibility measure.

This theory replaces communicating classes from countable chains. It provides the correct framework for ergodic theorems, invariant probability measures, and convergence in total variation on general spaces. Petite sets, minorization, and Lyapunov functions often appear as certificates for Harris recurrence and geometric ergodicity.

27.9 Regeneration

Regeneration occurs when a process probabilistically restarts, forgetting its past. A regeneration time $𝜏$ splits the path into independent or conditionally independent cycles. For regenerative processes, long-run averages reduce to renewal-reward ratios:

\lim_{𝑡 \to \infty} \frac{1}{𝑡} \int_{0}^{𝑡} 𝑓 (𝑋_{𝑠}) 𝑑 𝑠 = \frac{𝐸 \int_{0}^{𝜏} 𝑓 (𝑋_{𝑠}) 𝑑 𝑠}{𝐸 𝜏}

under suitable conditions.

Regeneration is a powerful replacement for independence. It decomposes dependent paths into iid cycles. In Markov chains, regeneration can be constructed through atoms or minorization/splitting techniques.

27.10 Continuous-time Markov chains

A continuous-time Markov chain on a countable state space has transition rates $𝑞_{𝑖 𝑗}$ for $𝑖 \neq 𝑗$ , with

𝑞_{𝑖} = \sum_{𝑗 \neq 𝑖} 𝑞_{𝑖 𝑗} .

The chain waits an exponential time with rate $𝑞_{𝑖}$ in state $𝑖$ , then jumps to $𝑗$ with probability $𝑞_{𝑖 𝑗} / 𝑞_{𝑖}$ .

The generator matrix $𝑄$ has entries

𝑄_{𝑖 𝑗} = 𝑞_{𝑖 𝑗}, 𝑖 \neq 𝑗, 𝑄_{𝑖 𝑖} = - 𝑞_{𝑖} .

Transition probabilities satisfy Kolmogorov equations:

\frac{𝑑}{𝑑 𝑡} 𝑃_{𝑡} = 𝑃_{𝑡} 𝑄

or $𝑄 𝑃_{𝑡}$ , depending on forward/backward formulation. Explosion control is required if infinitely many jumps can occur in finite time.

27.11 Jump processes

Jump processes evolve by sudden discontinuities rather than continuous paths. A pure jump Markov process is specified by jump rates or a jump kernel. More generally, jump-diffusions combine continuous diffusion with jumps.

The generator often has the form

𝐴 𝑓 (𝑥) = 𝑏 (𝑥) \cdot \nabla 𝑓 (𝑥) + \int (𝑓 (𝑦) - 𝑓 (𝑥)) 𝑞 (𝑥, 𝑑 𝑦),

possibly with compensation for small jumps. Jump processes require càdlàg path spaces and careful handling of predictable compensators, intensities, and explosion.

27.12 Birth-death processes

A birth-death process is a continuous-time Markov chain on $𝑁$ with transitions only from $𝑛$ to $𝑛 + 1$ or $𝑛 - 1$ . Birth rates are $𝜆_{𝑛}$ , death rates are $𝜇_{𝑛}$ . The generator is

𝐴 𝑓 (𝑛) = 𝜆_{𝑛} [𝑓 (𝑛 + 1) - 𝑓 (𝑛)] + 𝜇_{𝑛} [𝑓 (𝑛 - 1) - 𝑓 (𝑛)] .

Birth-death processes model queues, populations, branching-like systems, and chemical reactions. Stationary distributions, when they exist, often satisfy detailed balance:

𝜋_{𝑛} 𝜆_{𝑛} = 𝜋_{𝑛 + 1} 𝜇_{𝑛 + 1} .

Recurrence and explosion depend on the rate growth.

27.13 Martingale problems

Given an operator $𝐴$ , the martingale problem asks for a process $𝑋_{𝑡}$ such that for all test functions $𝑓$ in the domain,

𝑓 (𝑋_{𝑡}) - 𝑓 (𝑋_{0}) - \int_{0}^{𝑡} 𝐴 𝑓 (𝑋_{𝑠}) 𝑑 𝑠

is a martingale. Solving the martingale problem constructs a Markov process with generator $𝐴$ .

This formulation avoids directly solving SDEs or transition densities. It is robust under weak convergence and is central in diffusion theory. Well-posedness means existence and uniqueness in law. If the martingale problem is well-posed, the generator determines the process law.

Chapter 28 — Brownian Motion

28.1 Construction of Brownian motion

Brownian motion $𝐵_{𝑡}$ is a process with $𝐵_{0} = 0$ , independent increments, Gaussian increments

𝐵_{𝑡} - 𝐵_{𝑠} \sim 𝑁 (0, 𝑡 - 𝑠),

and continuous paths. Kolmogorov extension constructs a process with the correct finite-dimensional Gaussian laws; continuity is then obtained through moment estimates and a continuity theorem.

The construction separates law specification from path regularity. Gaussian finite-dimensional distributions alone do not automatically give continuous paths. Brownian motion is the continuous modification of the Gaussian process with covariance

𝐸 [𝐵_{𝑠} 𝐵_{𝑡}] = 𝑠 \land 𝑡 .

28.2 Gaussian processes

A Gaussian process is a process whose every finite linear combination is Gaussian. It is determined by its mean function

𝑚 (𝑡) = 𝐸 [𝑋_{𝑡}]

and covariance function

𝐾 (𝑠, 𝑡) = Cov (𝑋_{𝑠}, 𝑋_{𝑡}) .

Brownian motion is centered with $𝐾 (𝑠, 𝑡) = 𝑠 \land 𝑡$ .

Gaussianity reduces finite-dimensional laws to linear algebra. Positive semidefiniteness of the covariance kernel is the consistency condition. Path properties, however, depend on metric behavior induced by

𝑑 (𝑠, 𝑡)^{2} = 𝐸 ∣ 𝑋_{𝑠} - 𝑋_{𝑡} ∣^{2} .

Covariance determines law; regularity requires additional analysis of increments.

28.3 Independent increments

Brownian motion has independent increments: for

0 \leq 𝑡_{0} < 𝑡_{1} < \dots < 𝑡_{𝑛},

the variables

𝐵_{𝑡_{1}} - 𝐵_{𝑡_{0}}, \dots, 𝐵_{𝑡_{𝑛}} - 𝐵_{𝑡_{𝑛 - 1}}

are independent. Each increment has variance equal to its time length.

Independent increments make Brownian motion a continuous-time analogue of random walk. The Markov property, martingale property, and Gaussian transition densities follow. Independent increments are stronger than uncorrelated increments, though for Gaussian processes uncorrelated implies independent.

28.4 Continuous paths

Brownian motion has continuous paths almost surely, but those paths are almost surely nowhere differentiable. Continuity follows from moment bounds such as

𝐸 ∣ 𝐵_{𝑡} - 𝐵_{𝑠} ∣^{𝑝} = 𝐶_{𝑝} ∣ 𝑡 - 𝑠 ∣^{𝑝 / 2}

and Kolmogorov continuity criteria.

The paths are rough: for small $ℎ$ , increments are typically of size $\sqrt{ℎ}$ , so difference quotients behave like $ℎ^{- 1 / 2}$ . This roughness is why stochastic calculus is not ordinary calculus. Brownian paths have finite quadratic variation but infinite total variation on intervals.

28.5 Scaling

Brownian motion satisfies the scaling property

(𝐵_{𝑐 𝑡})_{𝑡 \geq 0} \overset{d}{=} (\sqrt{𝑐} 𝐵_{𝑡})_{𝑡 \geq 0} .

This follows from Gaussian increments: $𝐵_{𝑐 𝑡}$ has variance $𝑐 𝑡$ , matching $\sqrt{𝑐} 𝐵_{𝑡}$ .

Scaling is a structural symmetry. It determines hitting-time distributions, fractal path behavior, local time scaling, and invariance principles. Brownian motion is self-similar with exponent $1 / 2$ . This exponent is the source of diffusive scaling.

28.6 Reflection principle

The reflection principle states, for Brownian motion and $𝑎 > 0$ ,

𝑃 (\sup_{0 \leq 𝑠 \leq 𝑡} 𝐵_{𝑠} \geq 𝑎) = 2 𝑃 (𝐵_{𝑡} \geq 𝑎) .

Paths that cross level $𝑎$ are reflected after their first hitting time, giving a bijective probability argument.

This principle yields distributions of maxima and hitting times. For $𝜏_{𝑎} = \inf {𝑡 : 𝐵_{𝑡} = 𝑎}$ ,

𝑃 (𝜏_{𝑎} \leq 𝑡) = 2 𝑃 (𝐵_{𝑡} \geq 𝑎) .

Reflection uses continuity and symmetry of Brownian increments; it is not available for arbitrary martingales or jump processes without modification.

28.7 Hitting times

For Brownian motion, the hitting time of level $𝑎$ is

𝜏_{𝑎} = \inf {𝑡 \geq 0 : 𝐵_{𝑡} = 𝑎} .

Its distribution satisfies

𝑃 (𝜏_{𝑎} \leq 𝑡) = 2 (1 - Φ (\frac{𝑎}{\sqrt{𝑡}}))

for $𝑎 > 0$ . The density is

𝑓_{𝜏_{𝑎}} (𝑡) = \frac{𝑎}{\sqrt{2 𝜋 𝑡^{3}}} 𝑒^{- 𝑎^{2} / (2 𝑡)} .

Hitting times connect Brownian motion to boundary value problems. Probabilities of hitting one boundary before another solve harmonic equations. Expected hitting times solve Poisson equations when finite. In unbounded domains, expectations may be infinite even when hitting occurs almost surely.

28.8 Brownian bridge

A Brownian bridge from $0$ to $0$ over $[0, 1]$ is Brownian motion conditioned on $𝐵_{1} = 0$ . It can be represented as

𝛽_{𝑡} = 𝐵_{𝑡} - 𝑡 𝐵_{1} .

It is a centered Gaussian process with covariance

𝐸 [𝛽_{𝑠} 𝛽_{𝑡}] = 𝑠 \land 𝑡 - 𝑠 𝑡 .

The Brownian bridge appears in empirical process limits, Kolmogorov–Smirnov statistics, conditioned diffusions, and Gaussian process theory. It is not Brownian motion because its increments are not independent; the endpoint constraint introduces global dependence.

28.9 Brownian motion in `ℝᵈ`

A $𝑑$ -dimensional Brownian motion is

𝐵_{𝑡} = (𝐵_{𝑡}^{(1)}, \dots, 𝐵_{𝑡}^{(𝑑)}),

where the coordinates are independent one-dimensional Brownian motions. Its transition density is

𝑝_{𝑡} (𝑥, 𝑦) = \frac{1}{(2 𝜋 𝑡)^{𝑑 / 2}} \exp (- \frac{∣ 𝑦 - 𝑥 ∣^{2}}{2 𝑡}) .

The generator is

𝐴 = \frac{1}{2} Δ .

Thus Brownian motion is probabilistically tied to the heat equation:

\partial_{𝑡} 𝑢 = \frac{1}{2} Δ 𝑢 .

The dimension $𝑑$ changes recurrence, hitting probabilities, potential theory, and path intersection behavior.

28.10 Recurrence and transience

Brownian motion is recurrent in dimensions $𝑑 = 1, 2$ in the sense that it returns arbitrarily close to points or neighborhoods, while it is transient in dimensions $𝑑 \geq 3$ . In $𝑑 = 1$ , points are hit almost surely. In $𝑑 = 2$ , points are polar in a finer sense, but neighborhoods are visited recurrently. In $𝑑 \geq 3$ , Brownian motion tends to infinity.

This dimension transition reflects potential theory. The Green function changes behavior by dimension. Recurrence means infinite occupation of neighborhoods; transience means finite expected occupation. The threshold at dimension two is fundamental.

28.11 Quadratic variation

Brownian motion has quadratic variation

[𝐵]_{𝑡} = 𝑡 .

For partitions $Π_{𝑛}$ of $[0, 𝑡]$ with mesh going to zero,

\sum_{[𝑢, 𝑣] \in Π_{𝑛}} (𝐵_{𝑣} - 𝐵_{𝑢})^{2} \to 𝑡

in probability, and along suitable partitions almost surely.

This property distinguishes Brownian motion from differentiable functions, whose quadratic variation is zero. Quadratic variation is why Itô calculus has correction terms. It is also the intrinsic clock of Brownian martingales.

28.12 Lévy characterization

Lévy’s characterization states that a continuous adapted process $𝑀_{𝑡}$ with $𝑀_{0} = 0$ is Brownian motion if $𝑀_{𝑡}$ is a martingale and

𝑀_{𝑡}^{2} - 𝑡

is also a martingale. Equivalently, $𝑀$ is a continuous martingale with quadratic variation $[𝑀]_{𝑡} = 𝑡$ .

This theorem characterizes Brownian motion by martingale and quadratic variation properties rather than independent Gaussian increments. It is powerful in proving that transformed processes are Brownian under new measures or filtrations.

28.13 Donsker invariance principle

Donsker’s theorem says that rescaled random walks converge to Brownian motion. If $𝑋_{𝑖}$ are iid mean zero, variance one, and

𝑆_{𝑛} (𝑡) = \frac{1}{\sqrt{𝑛}} \sum_{𝑖 = 1}^{⌊ 𝑛 𝑡 ⌋} 𝑋_{𝑖}

with interpolation, then

𝑆_{𝑛} \Rightarrow 𝐵

in a suitable path space.

This is the process-level CLT. It upgrades convergence of one-time sums to convergence of entire paths. The theorem explains Brownian motion as the universal scaling limit of many independent small-step systems. Tightness is the additional payload beyond finite-dimensional CLT.

28.14 Brownian motion as universal limit carrier

Brownian motion is universal because it is the canonical limit of centered finite-variance cumulative noise under diffusive scaling. It appears in random walks, empirical processes, martingales, queueing, statistical estimation, stochastic differential equations, and physics.

Universality has conditions. Heavy tails lead to stable Lévy processes; long-range dependence can lead to fractional Brownian motion; jumps lead to discontinuous Lévy limits; nonstandard dependence can change scaling. Brownian motion is the universal carrier only after finite-variance, weak-dependence, and diffusive-scaling debts are paid.

Chapter 29 — Stochastic Integration

29.1 Motivation

Ordinary integration fails for Brownian-driven equations because Brownian paths have infinite variation and are nowhere differentiable. Expressions like

\int_{0}^{𝑡} 𝐻_{𝑠} 𝑑 𝐵_{𝑠}

cannot be defined as Riemann–Stieltjes integrals in the usual pathwise sense for general adapted $𝐻$ .

Stochastic integration builds an integral using probability, martingales, and $𝐿^{2}$ limits. The integrand must be nonanticipative, typically predictable. The integral is not just a pathwise area; it is a limit of sums where the integrand is evaluated using information before the Brownian increment.

29.2 Itô integral for simple processes

For a simple predictable process

𝐻_{𝑡} = \sum_{𝑘 = 0}^{𝑛 - 1} 𝐻_{𝑘} 1_{(𝑡_{𝑘}, 𝑡_{𝑘 + 1}]} (𝑡),

where $𝐻_{𝑘}$ is $𝐹_{𝑡_{𝑘}}$ -measurable, define

\int_{0}^{𝑇} 𝐻_{𝑡} 𝑑 𝐵_{𝑡} = \sum_{𝑘 = 0}^{𝑛 - 1} 𝐻_{𝑘} (𝐵_{𝑡_{𝑘 + 1}} - 𝐵_{𝑡_{𝑘}}) .

Predictability ensures $𝐻_{𝑘}$ is known before the increment.

For such integrals,

𝐸 [\int_{0}^{𝑇} 𝐻_{𝑡} 𝑑 𝐵_{𝑡}] = 0,

and the Itô isometry holds:

𝐸 [{(\int_{0}^{𝑇} 𝐻_{𝑡} 𝑑 𝐵_{𝑡})}^{2}] = 𝐸 \int_{0}^{𝑇} 𝐻_{𝑡}^{2} 𝑑 𝑡 .

29.3 Extension by isometry

The Itô isometry extends the integral from simple predictable processes to all predictable processes satisfying

𝐸 \int_{0}^{𝑇} 𝐻_{𝑡}^{2} 𝑑 𝑡 < \infty .

Choose simple $𝐻^{𝑛} \to 𝐻$ in $𝐿^{2} (Ω \times [0, 𝑇])$ , define

\int 𝐻 𝑑 𝐵 = 𝐿^{2} - \lim_{𝑛} \int 𝐻^{𝑛} 𝑑 𝐵 .

The isometry guarantees the definition is independent of approximating sequence. This is the measure-theoretic construction of stochastic integration: build on simple predictable objects, prove an exact norm identity, then complete the space.

29.4 Predictable integrands

An Itô integrand must be predictable or at least progressively measurable with suitable integrability, depending on formulation. Predictability encodes nonanticipation: $𝐻_{𝑡}$ may depend on past and present information but not future Brownian increments.

This condition is structural. If one uses right-endpoint values or future-dependent integrands, different integrals arise, such as Stratonovich-type integrals or anticipative integrals requiring Malliavin calculus. Itô integration is tied to adapted prediction and martingale structure.

29.5 Local martingales

The stochastic integral

𝑀_{𝑡} = \int_{0}^{𝑡} 𝐻_{𝑠} 𝑑 𝐵_{𝑠}

is a martingale when $𝐻$ is square-integrable on finite horizons. Under local square integrability, it is a local martingale. Its quadratic variation is

[𝑀]_{𝑡} = \int_{0}^{𝑡} 𝐻_{𝑠}^{2} 𝑑 𝑠 .

Local martingales are the natural output of stochastic integration. They behave like martingales up to localization times but may fail to be true martingales globally. This distinction matters for expectation identities, change of measure, and financial modeling.

29.6 Quadratic variation

For Itô integrals,

{[\int_{0}^{\cdot} 𝐻_{𝑠} 𝑑 𝐵_{𝑠}]}_{𝑡} = \int_{0}^{𝑡} 𝐻_{𝑠}^{2} 𝑑 𝑠 .

For two integrals,

{[\int 𝐻 𝑑 𝐵, \int 𝐾 𝑑 𝐵]}_{𝑡} = \int_{0}^{𝑡} 𝐻_{𝑠} 𝐾_{𝑠} 𝑑 𝑠 .

Quadratic variation is the second-order bookkeeping of stochastic calculus. It replaces ordinary differential products. The symbolic rule is

(𝑑 𝐵_{𝑡})^{2} = 𝑑 𝑡, 𝑑 𝐵_{𝑡} 𝑑 𝑡 = 0, (𝑑 𝑡)^{2} = 0.

This rule is mnemonic for rigorous quadratic variation calculations.

29.7 Itô formula

If $𝑋_{𝑡}$ satisfies

𝑑 𝑋_{𝑡} = 𝑏_{𝑡} 𝑑 𝑡 + 𝜎_{𝑡} 𝑑 𝐵_{𝑡}

and $𝑓 \in 𝐶^{2}$ , then

𝑑 𝑓 (𝑋_{𝑡}) = 𝑓^{'} (𝑋_{𝑡}) 𝑑 𝑋_{𝑡} + \frac{1}{2} 𝑓^{''} (𝑋_{𝑡}) 𝜎_{𝑡}^{2} 𝑑 𝑡 .

Equivalently,

𝑓 (𝑋_{𝑡}) = 𝑓 (𝑋_{0}) + \int_{0}^{𝑡} 𝑓^{'} (𝑋_{𝑠}) 𝑏_{𝑠} 𝑑 𝑠 + \int_{0}^{𝑡} 𝑓^{'} (𝑋_{𝑠}) 𝜎_{𝑠} 𝑑 𝐵_{𝑠} + \frac{1}{2} \int_{0}^{𝑡} 𝑓^{''} (𝑋_{𝑠}) 𝜎_{𝑠}^{2} 𝑑 𝑠 .

The second derivative term is the Itô correction caused by quadratic variation. Ordinary chain rule fails because Brownian increments have square size comparable to $𝑑 𝑡$ . Itô formula is the fundamental calculus rule for diffusion processes.

29.8 Stochastic integration by parts

For semimartingales $𝑋, 𝑌$ ,

𝑑 (𝑋_{𝑡} 𝑌_{𝑡}) = 𝑋_{𝑡} 𝑑 𝑌_{𝑡} + 𝑌_{𝑡} 𝑑 𝑋_{𝑡} + 𝑑 [𝑋, 𝑌]_{𝑡} .

Integrated,

𝑋_{𝑡} 𝑌_{𝑡} = 𝑋_{0} 𝑌_{0} + \int_{0}^{𝑡} 𝑋_{𝑠} 𝑑 𝑌_{𝑠} + \int_{0}^{𝑡} 𝑌_{𝑠} 𝑑 𝑋_{𝑠} + [𝑋, 𝑌]_{𝑡}

with predictable convention refinements depending on exact formulation.

The extra quadratic covariation term is the stochastic correction. If one process has finite variation, its quadratic covariation with a continuous martingale is zero. Integration by parts is essential for exponential martingales, product processes, and SDE transformations.

29.9 Exponential martingales

For Brownian motion,

𝐸_{𝑡} (𝜃 𝐵) = \exp (𝜃 𝐵_{𝑡} - \frac{1}{2} 𝜃^{2} 𝑡)

is a martingale. More generally, for suitable $𝐻$ ,

𝐸_{𝑡} (\int 𝐻 𝑑 𝐵) = \exp (\int_{0}^{𝑡} 𝐻_{𝑠} 𝑑 𝐵_{𝑠} - \frac{1}{2} \int_{0}^{𝑡} 𝐻_{𝑠}^{2} 𝑑 𝑠)

is a local martingale and under conditions such as Novikov’s condition is a true martingale.

Exponential martingales are density processes for changing measure. They also yield concentration bounds and solve linear stochastic equations. The correction $- \frac{1}{2} \int 𝐻^{2}$ compensates quadratic variation.

29.10 Girsanov theorem

Girsanov’s theorem describes how Brownian drift changes under an equivalent measure. If

𝑍_{𝑡} = \exp (- \int_{0}^{𝑡} 𝜃_{𝑠} 𝑑 𝐵_{𝑠} - \frac{1}{2} \int_{0}^{𝑡} 𝜃_{𝑠}^{2} 𝑑 𝑠)

is a martingale and defines $𝑑 𝑄 = 𝑍_{𝑇} 𝑑 𝑃$ , then

{\tilde{𝐵}}_{𝑡} = 𝐵_{𝑡} + \int_{0}^{𝑡} 𝜃_{𝑠} 𝑑 𝑠

is Brownian motion under $𝑄$ .

The theorem is a change-of-measure engine. It transforms drift into likelihood ratio. It is central in filtering, finance, diffusion theory, rare-event simulation, and stochastic control. The martingale condition on $𝑍$ is the integrability gate; without it, the measure change may fail.

29.11 Martingale representation theorem

In a Brownian filtration, every square-integrable martingale $𝑀_{𝑡}$ can be represented as

𝑀_{𝑡} = 𝑀_{0} + \int_{0}^{𝑡} 𝐻_{𝑠} 𝑑 𝐵_{𝑠}

for some predictable $𝐻$ with suitable square integrability. This says Brownian stochastic integrals span martingales in the Brownian information carrier.

The theorem is foundational for hedging in mathematical finance, filtering, and stochastic control. It is filtration-specific. In a larger filtration or with jump noise, representation may require additional martingale drivers. The theorem is not a universal statement about all martingales; it is a carrier theorem for Brownian filtrations.