Probability Theory — Chapters 1–10
Probability Theory — Chapters 1–10
Chapter 1 — The Problem of Uncertainty
1.1 Deterministic statements versus probabilistic statements
A deterministic statement has a truth value fixed by the state of the system and the rules of the model: “the derivative of is ,” “this algorithm halts on this input,” or “the particle is at coordinate ” once the model’s state is fully specified. A probabilistic statement does not assert a single realized truth in advance; it assigns weights to possible truth-values or outcomes. The statement “the coin lands heads with probability ” is not equivalent to “the coin will land heads,” nor is it a weaker deterministic statement. It is a statement about a measure over alternatives.
The primitive move in probability is therefore not prediction but carrier construction. One must specify what can happen, which propositions about what can happen are legally measurable, and how much probability mass they carry. The deterministic claim has the form or . The probabilistic claim has the form
where is an event in some event space. Without the event space, the expression is syntactically suggestive but mathematically incomplete.
1.2 Events, outcomes, experiments, observations
An outcome is a primitive realized possibility in a model. In a die roll, an outcome may be . An event is a set of outcomes satisfying some property, such as “the die is even,” represented by . An experiment is the procedure or formal random mechanism generating outcomes. An observation is the information actually registered; it may be coarser than the outcome. For example, if a die is rolled but only parity is observed, the observable events are , , not necessarily the full six singleton outcomes.
This distinction matters because probability attaches to events, not directly to linguistic descriptions. Two descriptions may denote the same event, and the same informal description may denote different events under different sample-space models. A formal probability model therefore separates , the outcome space, from , the admissible event family, and from , the probability law. The triple is the carrier; events are legal only when they belong to .
1.3 Probability as weight, frequency, belief, symmetry, and measure
Probability has several interpretations. As weight, it is a normalized mass assigned to events. As frequency, it describes the limiting proportion of occurrence in repeated trials. As belief, it quantifies rational degrees of uncertainty. As symmetry, it assigns equal mass to indistinguishable alternatives. As measure, it becomes a mathematical function satisfying normalization and countable additivity:
for pairwise disjoint .
The measure interpretation is the formal engine, not necessarily the philosophical interpretation. A Bayesian may use to encode belief; a frequentist may use to model long-run sampling behavior; a physicist may use to describe ensemble uncertainty. All still need a calculus that supports events, random variables, products, conditioning, expectation, and limits. Measure theory supplies that common calculus.
1.4 Why finite probability is insufficient
Finite probability handles dice, cards, urns, and finite games well. If , every event is a subset of , and a probability law is determined by masses with and . Then
This is clean, but it cannot model continuous random variables, infinite sequences, stochastic processes, Brownian motion, or limit theorems without extension.
The fatal issue is continuous probability. A uniform random point should satisfy for each individual , but . This probability cannot be recovered by summing point masses. The model must assign mass to sets such as intervals and then extend to a suitable event family. Countable additivity, measurable sets, and integration become unavoidable once probability must survive infinite limiting operations.
1.5 The need for a carrier: sample space, event space, probability law
A probability claim requires three layers. The sample space lists possible outcomes. The event space specifies which subsets of are measurable events. The probability law assigns mass to those events. The formal object is
Leaving out is harmless only in finite or countable models where is usually acceptable. In continuous models, the full power set may include nonmeasurable subsets, so must be restricted.
The carrier also determines which random variables exist. A random variable is not just any function ; it must be measurable, meaning inverse images of measurable target events are events in . For real-valued , this means
for every . This condition ensures that statements about have probabilities.
1.6 Common category errors: outcome ≠ event, random variable ≠ value, law ≠ sample
An outcome is a point . An event is a measurable subset . Confusing the two causes errors such as assigning probability to a realized value instead of to the event that the value occurs. A random variable is a function on , not the number obtained after realization. Its value is random before is fixed and deterministic after is fixed.
The law of is also not the same thing as . The law is the pushforward measure
Two random variables can have the same law while living on different spaces or while being dependent in different ways on other variables. Equality in distribution,
does not imply pointwise or almost surely. That implication requires a common carrier and an equality statement inside that carrier.
Chapter 2 — Finite Probability Spaces
2.1 Sample spaces
In finite probability, the sample space is a finite set
Each represents a possible outcome. For one die, . For two ordered dice, . The choice of encodes what distinctions the model regards as real. Ordered dice and unordered dice are different sample spaces; using the wrong one changes the probabilities unless the law is adjusted.
Finite sample spaces are useful because every subset can be treated as measurable:
This removes measurability issues and lets probability be introduced as weighted counting. But the simplicity hides the later distinction between outcome space and event space, which becomes load-bearing in infinite models.
2.2 Events as subsets
An event is a subset . If , then “even” is , “greater than four” is , and “even and greater than four” is . Logical operations translate into set operations:
This translation is the first formalization step of probability.
In finite spaces, the event algebra is Boolean. It is closed under complements, finite unions, and finite intersections. Since the space is finite, countable operations reduce to finite operations after repetitions are removed. Thus finite probability avoids the countability boundary that later forces the use of σ-algebras.
2.3 Probability mass functions
A probability mass function assigns a number to each outcome such that
The probability of an event is then
This is the finite version of integration: events are measured by summing masses over their elements.
The mass function determines the law completely. If all masses are equal, , the model is uniform. If the masses differ, the model is weighted. For a biased coin with , one may set , . The formalism is the same; only the mass function changes.
2.4 Uniform probability and counting
Uniform probability is the special case where all outcomes are equally likely. Then
This is the bridge between counting and probability. For two fair dice, , and the event “sum equals seven” has six outcomes:
so its probability is .
Uniform models are powerful but dangerous. Equal likelihood must be justified by the modeling carrier, not by aesthetic symmetry alone. Two ways of parametrizing outcomes may produce different apparent uniform distributions. For example, choosing a random chord by choosing endpoints uniformly is not the same as choosing a midpoint uniformly. Uniformity is always uniformity with respect to a declared carrier.
2.5 Complements, unions, intersections
The complement rule is
For two events, the union rule is
The subtraction prevents double-counting the overlap. If , the events are disjoint and the formula reduces to additivity:
Intersections encode simultaneous occurrence. If is “even” and is “greater than four,” then is “even and greater than four.” Probability theory often reduces verbal problems to algebra over complements, unions, and intersections. The event grammar is not optional; it is the syntax of probabilistic reasoning.
2.6 Inclusion–exclusion
For three events,
The general inclusion–exclusion formula is
Inclusion–exclusion is exact but can be computationally expensive. Its truncated versions give bounds, such as the union bound
This inequality is often more important than the exact formula because it scales to large systems where full overlap data are unavailable.
2.7 Conditional probability
Conditional probability restricts attention to a known event with . It is defined by
This is not a new primitive in finite probability; it is a renormalized probability law on . The conditional law satisfies and assigns zero mass outside .
Conditioning changes the carrier. The relevant sample space becomes , and probabilities are rescaled. If a die roll is known to be even, the probability that it is greater than four is
The calculation is simple; the important point is that conditioning is event-space restriction plus normalization.
2.8 Bayes’ theorem
Bayes’ theorem follows directly from the symmetry of intersection:
Thus
If is a partition of , then
The theorem transports probability from causal or generative direction into diagnostic direction . The numerator combines likelihood and prior; the denominator normalizes over all alternatives. The main error is to confuse with . Bayes’ theorem states exactly how they differ.
2.9 Independence of events
Events and are independent if
If , this is equivalent to
Independence means learning does not change the probability of . It does not mean the events are disjoint. In fact, disjoint positive-probability events are negatively dependent, since while .
Independence is a structural certificate, not a feeling of unrelatedness. It must be derived from the model. In product spaces, events depending on different coordinates are independent when the probability law factors. Without such factorization, independence is an assertion requiring proof.
2.10 Pairwise versus mutual independence
A family is pairwise independent if every pair satisfies
It is mutually independent if for every subfamily ,
Mutual independence is stronger.
The distinction is not technical decoration. Three events can be pairwise independent but not mutually independent. For two fair coin tosses, let be “first toss heads,” be “second toss heads,” and be “the tosses agree.” Each pair is independent, but , so the triple intersection does not factor as the product of the three probabilities. Pairwise checks do not certify full product structure.
2.11 Finite random variables
A finite random variable is a function , often with . Its distribution is
The variable compresses outcomes into values. Multiple outcomes may produce the same value, so the law of is generally coarser than the law on .
A random variable is not random because the function changes; is fixed. Randomness enters through the random choice of . Once is realized, is deterministic. This distinction is essential when defining functions of random variables: if , then , and its law is obtained by pushing through .
2.12 Expectation as weighted average
For a real-valued finite random variable , expectation is
Equivalently, summing over values,
Expectation is the center of mass of the distribution, not necessarily a typical or possible value. A die has expectation , though no roll equals .
Expectation is linear:
whether or not and are independent. This is one of the most powerful facts in elementary probability because it lets global averages be computed without full distributional knowledge.
2.13 Variance, covariance, correlation
Variance measures squared deviation from the mean:
The computational identity is
Variance is nonnegative and equals zero exactly when is almost surely constant.
Covariance is
Correlation normalizes covariance:
Independence implies zero covariance when expectations exist, but zero covariance does not imply independence except in special families such as jointly Gaussian variables.
2.14 Indicator variables
The indicator of an event is
Its expectation is
This turns probability questions into expectation questions and counting questions into sums of indicators.
If counts how many events occur, then
No independence is required. This method is the backbone of probabilistic combinatorics, occupancy estimates, random graph thresholds, and existence arguments.
2.15 Linearity of expectation
Linearity states that for finite random variables,
It does not require independence. If are indicators, the formula says that the expected count of successful events equals the sum of their individual probabilities.
The power of linearity is that it avoids dependence. To compute the expected number of fixed points in a random permutation , define . Then , so
The indicators are dependent, but expectation ignores that dependence for first-moment purposes.
2.16 First probabilistic existence arguments
The first moment method uses expectation to prove existence. If and , then there exists an outcome with ; otherwise everywhere would force . Similarly, if for a nonnegative integer-valued , then there is an outcome with .
This converts random construction into deterministic existence. One defines a random object, counts desired or bad features by a random variable, computes its expectation, and concludes that some object has at least or at most the expected performance. The method gives existence without necessarily giving an efficient construction.
Chapter 3 — Counting and Discrete Models
3.1 Permutations and combinations
A permutation is an ordering. The number of permutations of distinct objects is
The number of ordered selections of distinct objects from is
A combination is an unordered selection. The number of -element subsets of an -element set is
The division by removes the ordering of the selected elements. These formulas underlie finite uniform probability, since many events are counted by counting favorable selections divided by total selections.
3.2 Multinomial coefficients
The multinomial coefficient counts ways to split labeled objects into labeled boxes of sizes , where :
It appears in the expansion
In probability, multinomial coefficients describe counts from repeated categorical trials. If each trial produces category with probability , then the probability of counts is
3.3 Occupancy problems
Occupancy problems distribute balls into boxes. If balls are placed independently and uniformly into boxes, the sample space has size . The occupancy number of box has binomial distribution:
The joint distribution of occupancies is multinomial:
Occupancy models encode hashing, allocation, collisions, load balancing, coupon collection, and random mappings. They are simple enough for exact counting but rich enough to display threshold behavior. For example, the expected number of empty boxes is
3.4 Balls into bins
The balls-into-bins model studies the load profile after random allocation. With balls and bins, each bin load has mean . Indicator methods compute many structural statistics. Let indicate that bin is empty. Then
The maximum load is subtler because it depends on tail probabilities and dependence. When , the maximum load is of order
with high probability. This is one of the first places where expectation alone is insufficient; concentration and tail bounds are needed to control extremes.
3.5 Coupon collector
The coupon collector problem asks how many independent uniform samples from coupon types are needed to see all types. Let be the completion time. Decompose
where is the waiting time to collect a new coupon after distinct coupons have already appeared. Then is geometric with success probability , so
The problem shows the difference between mean scale and high-probability completion. Around , the probability of completion approaches a nontrivial limit related to . The final coupons dominate the waiting time.
3.6 Birthday problem
The birthday problem asks for the probability of at least one collision among independent uniform samples from possible birthdays. The no-collision probability is
For , use :
The threshold occurs when is order one, so . This square-root threshold appears in hashing, cryptography, random graphs, and collision search. The lesson is that rare pairwise coincidences become likely when the number of pairs is large.
3.7 Hypergeometric model
The hypergeometric distribution describes sampling without replacement. If a population has objects, successes, and failures, then drawing objects without replacement gives
The expectation is
Unlike the binomial model, draws are dependent. Observing a success slightly reduces the chance of another success. Nevertheless, the expectation matches the binomial expectation with . When is large compared with , the hypergeometric distribution is close to binomial because sampling without replacement approximates sampling with replacement.
3.8 Binomial model
The binomial distribution counts successes in independent Bernoulli trials with success probability :
Its mean and variance are
The binomial model is the canonical finite independent-sum model. Its generating function is
For large , it has different asymptotic regimes: normal approximation when is large, Poisson approximation when , , and , and large-deviation behavior far from the mean.
3.9 Geometric and negative binomial models
The geometric distribution models the waiting time until the first success in independent Bernoulli trials. With success probability ,
and
It has the memoryless property:
The negative binomial distribution models the number of trials needed to obtain successes. If is the trial of the -th success,
It is a sum of independent geometric waiting times.
3.10 Poisson approximation in finite settings
The Poisson distribution with parameter is
It approximates sums of many rare, weakly dependent indicators. The simplest limit is binomial:
Indeed,
Finite Poisson approximation is often justified by Chen–Stein methods or by bounding dependence neighborhoods. The heuristic is that if events are individually rare and do not cluster too strongly, their count behaves approximately Poisson. The failure mode is hidden dependence that creates clumping.
3.11 Random graphs: first finite examples
The Erdős–Rényi graph has vertex set , with each possible edge included independently with probability . The number of edges is
so
Subgraph counts are sums of indicators. For triangles,
Random graphs demonstrate thresholds. A property may be unlikely below a scale of and likely above it. For example, isolated vertices disappear around , and graph connectivity emerges at the same scale. Random graphs turn combinatorial existence into probabilistic mass.
3.12 Counting as probability and probability as counting
Counting and probability are dual in finite uniform spaces:
This means a probability estimate can imply a counting estimate and a counting estimate can imply a probability estimate. Many combinatorial arguments choose a random object uniformly, estimate the probability it has a property, and then multiply by the total number of objects.
The duality weakens in nonuniform spaces but survives through weighted counting. A probability law is a weighted counting scheme. The conceptual shift is from cardinality to measure: finite probability is counting with weights, while measure-theoretic probability is infinite weighted event geometry.
Chapter 4 — Why Measure Theory Enters Probability
4.1 Failure of purely discrete models
Purely discrete models assign probability by summing atomic masses:
This works when the sample space is finite or countable and the total mass is distributed over atoms. It fails for distributions such as uniform measure on , where each singleton should have mass zero but the whole interval has mass one. A countable sum of zero masses remains zero, so nonatomic probability cannot be represented by pointwise mass summation.
The issue is not merely continuous variables. Infinite sequences also strain discrete intuition. A sequence of independent fair coin flips has sample space , uncountable in size. Each individual infinite sequence has probability zero, yet the full space has probability one. Events such as “infinitely many heads occur” require countable intersections and unions. Measure theory is the carrier that handles such events.
4.2 Continuous variables and zero-probability points
For a continuous random variable with density , probabilities are assigned by integration:
For a point,
Thus individual outcomes can be impossible in the measure sense while one of them must occur. This is not a contradiction; probability zero does not mean logical impossibility.
The event has zero mass for each fixed , but
is an uncountable union. Countable additivity does not license summing over uncountably many null events. This is one of the fundamental reasons the σ-algebra and countability boundary matter.
4.3 Countable additivity versus finite additivity
Finite additivity states that for disjoint ,
Countable additivity extends this to countably many disjoint events:
This property is essential for limits. If , then countable additivity implies continuity from below:
If , then
provided , automatically true in probability spaces.
Finite additivity can support some decision-theoretic frameworks, but it does not give the same convergence machinery. Limit theorems, Borel–Cantelli, product processes, Lebesgue integration, and conditional expectation all depend on countable additivity.
4.4 Legal events and illegal events
In finite spaces, every subset is legal. In continuous spaces, not every subset can be assigned probability while preserving desirable properties such as countable additivity and translation invariance. Nonmeasurable sets exist under standard set-theoretic assumptions. Therefore probability is defined only on a σ-algebra .
A legal event is an element of . An illegal event is a subset of outside . The statement is meaningful only if . This restriction is not a defect but a consistency condition. It prevents the probability calculus from being forced to assign incompatible masses to pathological sets.
4.5 The σ-algebra as event grammar
A σ-algebra on is a family of subsets satisfying:
It follows that is also closed under countable intersections. Thus σ-algebras support countable logical operations.
The σ-algebra is the grammar of probabilistic propositions. It determines which statements about the outcome are admissible. If is a random variable, then events such as , , and must belong to before their probabilities can be discussed.
4.6 The probability measure as normalized measure
A probability measure is a measure with total mass one. Formally,
satisfies , , and countable additivity over disjoint events. The normalization distinguishes probability from general measure theory, where total mass may be finite, infinite, or σ-finite.
This normalization makes probability a calculus of relative mass. Conditional probability, expectation, variance, and distribution all depend on the measure. The probability measure is not merely a list of event weights; it is the structural object that supports integration, pushforward, product construction, and convergence.
4.7 Null sets and almost sure reasoning
A null set is an event with . A property holds almost surely if it fails only on a null set. Thus almost surely means
Almost sure equality is not pointwise equality. It is equality in the quotient space where null differences are ignored.
This quotient is essential in analysis. spaces identify random variables that differ only on null sets. Conditional expectation is unique only almost surely. Sample-path properties of stochastic processes often hold almost surely but not for every . The null-set firewall prevents exporting measure-level statements as deterministic statements.
4.8 Probability theory as measure theory plus independence
Measure theory gives events, measures, functions, integration, products, and convergence. Probability theory adds independence, conditioning, stochastic processes, and asymptotic laws. Independence is not native to arbitrary measure theory as a philosophical concept, but it is formalized by product and factorization:
or for random variables,
The slogan “probability is measure theory plus independence” is accurate as a structural compression. Measure theory supplies the carrier; independence supplies probabilistic product structure; conditioning supplies information-relative projection; limit theorems supply asymptotic transport. Without measure theory, the infinite and continuous parts collapse into ad hoc rules.
Chapter 5 — Measurable Spaces and Events
5.1 Sets, algebras, σ-algebras
An algebra of sets is closed under finite unions and complements. A σ-algebra is closed under countable unions and complements. Every σ-algebra is an algebra, but not every algebra is a σ-algebra. Finite probability can often use algebras because only finite operations are needed. Modern probability needs σ-algebras because limiting events are naturally countable.
For example, if is the event that a process satisfies a condition at time , then “the condition occurs infinitely often” is
This expression requires countable unions and intersections. A finite algebra is not enough.
5.2 Generated σ-algebras
Given a collection of subsets , the σ-algebra generated by , denoted , is the smallest σ-algebra containing . It is the intersection of all σ-algebras that contain :
Generated σ-algebras let probability specify elementary observable events and then close them under legal countable operations. For a real random variable , the information it generates is
This is the σ-algebra of events determined by observing .
5.3 Borel σ-algebra on topological spaces
For a topological space , the Borel σ-algebra is generated by the open sets:
On , it is also generated by open intervals, closed intervals, half-lines , or rational intervals. This flexibility is useful for proving measurability.
The Borel σ-algebra is the standard event space for random variables taking values in metric or topological spaces. A real-valued random variable is measurable if for every Borel set . It suffices to check
for all .
5.4 Product σ-algebras
If and are measurable spaces, the product σ-algebra is
It is the smallest σ-algebra making coordinate projections measurable.
Product σ-algebras are required for joint random variables. If and are measurable, then is measurable with respect to . The joint law of is a probability measure on this product space. Independence is then expressed by factorization of that joint law.
5.5 Completion of a measure space
A measure space is complete if every subset of every null set is measurable. Given , its completion adds all subsets of null sets to . The completed σ-algebra is
The measure extends by .
Completion is natural because null subsets cannot affect probabilities. However, completion can interact subtly with product spaces and regular conditional probabilities. One must track whether the model uses Borel σ-algebras, Lebesgue completions, or completed filtrations.
5.6 Measurable subsets
A measurable subset is an event belonging to the chosen σ-algebra. In , Borel sets include open sets, closed sets, countable intersections of open sets, countable unions of closed sets, and many more. Lebesgue measurable sets further include completions of Borel sets by null modifications.
Measurability is a legality condition, not a size condition. A set can be dense, fractal, uncountable, or topologically complicated and still be measurable. Conversely, nonmeasurable sets are not necessarily visually exotic; their pathology lies in incompatibility with the desired measure properties. In probability, no event probability exists until measurability is established.
5.7 Countable operations on events
If , then
belong to . This allows one to define events such as “at least one occurs,” “all occur,” “infinitely many occur,” and “eventually all occur.”
The two standard limiting events are
and
The first means infinitely many occur; the second means all but finitely many occur.
5.8 Uncountable operation traps
A σ-algebra is not generally closed under uncountable unions or intersections. If and is uncountable, it does not follow that
is measurable. Sometimes it is measurable for structural reasons, but not automatically.
This matters in continuous probability. For each , the event may be measurable and null, but the union over all is if is real-valued. Countable additivity does not apply to that union. Any argument that sums probabilities over uncountably many events is invalid unless replaced by integration, separability, or a countable reduction.
5.9 Tail events
For a sequence of random variables , the tail σ-algebra is
It contains events unaffected by changing finitely many initial coordinates. Examples include convergence of averages, infinitely many occurrences, and limiting frequencies.
For independent sequences, Kolmogorov’s zero-one law states that every tail event has probability or . This is a structural theorem: events depending only on the infinite tail cannot have intermediate probability under full independence. Tail σ-algebras encode asymptotic information stripped of finite initial noise.
5.10 Germ σ-algebras
A germ σ-algebra records information arbitrarily close to a point, time, or boundary. For a stochastic process, the germ at time may be represented as an intersection of σ-algebras over shrinking neighborhoods:
It captures infinitesimal local information rather than global trajectory information.
Germ σ-algebras arise in Markov processes, Brownian motion, stochastic calculus, and local field behavior. Their analysis often requires right-continuity, completion, separability, or regularity assumptions. The danger is to assume that infinitesimal information is trivial or maximal without proving the relevant zero-one or regularity law.
5.11 Event equivalence modulo null sets
Events and are equivalent modulo null sets if
where . In probability, such events are often indistinguishable because they have the same probability and differ only on a null set.
This quotient viewpoint is essential for spaces and conditional expectation. However, modulo-null equivalence must not be exported as literal equality when pointwise structure matters. In stochastic processes, two versions may agree at each fixed time almost surely but fail to have indistinguishable sample paths unless stronger conditions are imposed.
Chapter 6 — Probability Spaces
6.1 Probability space (Ω, 𝓕, P)
A probability space consists of a sample space , a σ-algebra , and a probability measure . It is the formal carrier for all events and random variables in a model. The axioms are:
for pairwise disjoint .
Every probability expression must be interpretable in this carrier or in a declared extension/quotient. If and are random variables on different spaces, then is undefined until a joint space or coupling is specified. The probability space is not background decoration; it is the domain of legal probabilistic syntax.
6.2 Atoms and nonatomic spaces
An atom is a measurable set with such that every measurable has or . Discrete probability spaces are atomic; the atoms are often singleton outcomes with positive mass. A nonatomic space has no atoms. Lebesgue probability on is nonatomic.
Atomic and nonatomic spaces behave differently. In atomic spaces, probabilities decompose into point masses. In nonatomic spaces, mass can be split continuously: for many standard nonatomic spaces, every is the probability of some event. Continuous randomization, uniform variables, and many coupling constructions rely on nonatomic structure.
6.3 Discrete probability measures
A discrete probability measure is concentrated on a finite or countable set. If , then
For an event ,
Discrete measures are computationally transparent. Expectation becomes summation:
However, discrete methods fail when no atoms carry mass, as with continuous distributions. Many probability models combine discrete and continuous components, so one must not assume all laws have densities or all laws have mass functions.
6.4 Continuous probability measures
A continuous probability measure has no atoms, or more narrowly, may admit a density with respect to Lebesgue measure:
The density must satisfy and . If has density , then
Not every continuous measure has a density. The Cantor distribution is nonatomic but singular with respect to Lebesgue measure. Thus “continuous” and “has a density” are different properties. The correct carrier distinction is atomic, absolutely continuous, singular, and mixtures thereof.
6.5 Singular measures
A measure is singular with respect to another measure , written , if there exists a measurable set such that and . The Cantor measure is singular with respect to Lebesgue measure: it is concentrated on the Cantor set, which has Lebesgue measure zero, while having no atoms.
Singular measures show why densities are not universal. A random variable may have a continuous distribution function but no density. Any argument that differentiates a CDF or writes probabilities as must verify absolute continuity. Otherwise it silently changes carrier.
6.6 Mixed distributions
A mixed distribution has both discrete and continuous components. For example,
where is uniform measure on . Then , while conditional on the continuous component, spreads over .
The general Lebesgue decomposition separates a measure into absolutely continuous, singular continuous, and atomic components relative to a reference measure. Mixed distributions appear in survival models, censored data, spike-and-slab priors, default models, and random variables with boundary masses. Treating them as purely discrete or purely continuous loses mass.
6.7 Pushforward measures
If is measurable and is a probability measure on , the pushforward law on is defined by
This is the distribution of . It allows one to study without retaining the entire original sample space.
Pushforward is the correct abstraction behind transformation of variables. If , then
When densities exist and is smooth and invertible, this yields the familiar Jacobian formula. But the pushforward definition works more generally, including discrete, singular, and mixed laws.
6.8 Pullback of events
Given a measurable map , an event pulls back to
Measurability of means Borel or measurable events in the target pull back to legal events in . Thus statements about become events in the original probability space.
Pullback and pushforward are dual. Pullback turns target propositions into source events; pushforward transports source probability to target laws. The equation
is the bridge. Probability of random-variable statements is always computed by pulling back the statement to the sample space or by using the pushed-forward law.
6.9 Model extension and sample-space enlargement
A probability model may need enlargement to include new randomness. If models , then adding an independent may require
Old events lift by projection:
Their probabilities are preserved:
Model extension shows that sample spaces are representations, not the probabilistic objects themselves. The same event may have different set-theoretic representatives in different carriers while preserving probability. What must be invariant are the laws and joint relations explicitly required by the claim.
6.10 Product probability spaces
Given probability spaces and , the product space is
where
Product measure extends this rectangle rule to the product σ-algebra.
Product spaces provide the canonical carrier for independent random objects. Coordinate variables and are independent because their joint law factors. Dependence requires a non-product joint law.
6.11 Infinite product spaces
For countably many spaces , the infinite product space is
with product σ-algebra generated by cylinder sets. A cylinder event depends on finitely many coordinates. The product measure is determined by
Infinite products model independent sequences. Events such as convergence, limiting frequencies, and infinitely many occurrences are not cylinder events, but they belong to the σ-algebra generated by cylinders through countable operations. Infinite product spaces are the foundation for laws of large numbers, coin-flip sequences, and many stochastic processes.
6.12 Kolmogorov extension theorem
The Kolmogorov extension theorem constructs a probability measure on an infinite product or path space from consistent finite-dimensional distributions. If for every finite index set there is a law on , and these laws are compatible under marginalization, then under suitable state-space hypotheses there exists a process with those finite-dimensional laws.
The theorem is the bridge from finite data to process-level probability. It lets one define stochastic processes by specifying all finite joint distributions. However, it does not automatically provide regular sample paths. Continuity, càdlàg paths, and other path properties require additional arguments such as Kolmogorov continuity criteria.
6.13 Standard Borel spaces
A standard Borel space is a measurable space isomorphic to the Borel space of a complete separable metric space. Examples include , Polish spaces with their Borel σ-algebras, countable discrete spaces, and many function spaces.
Standard Borel spaces are the safe operating environment for regular conditional probabilities, disintegration, measurable selection, and extension theorems. Many pathologies disappear in this category. When probability theory states a theorem requiring “regularity of the state space,” standard Borel or Polish assumptions are often the hidden carrier condition.
6.14 Regularity of probability measures
On well-behaved topological spaces, probability measures can be approximated by compact and open sets. A Borel probability measure on a metric space is often regular:
and
for Borel .
Regularity lets measure theory interact with topology. It supports weak convergence, tightness, approximation by continuous functions, and compactness arguments. Without regularity, topological probability loses much of its analytic machinery.
Chapter 7 — Random Variables
7.1 Random variables as measurable maps
A random variable is a measurable map from a probability space into a measurable state space:
Measurability means
for every . This ensures that every legal target event has a probability.
The terminology “variable” can mislead. is a fixed function. Randomness comes from the random input . The law describes the distribution of values. The carrier may be changed or enlarged without changing the law of , provided the pushforward measure is preserved.
7.2 Real-valued random variables
A real-valued random variable is a measurable function , where has its Borel σ-algebra. It suffices to check that
for every . The distribution function is
Real-valued variables are central because they support ordering, integration, moments, quantiles, and convergence modes. Many non-real random objects are studied through real-valued probes. For a random vector , linear functionals often determine distributional behavior.
7.3 Vector-valued random variables
A vector-valued random variable is a measurable map . Its law is a probability measure on . The coordinate variables are real-valued random variables, and is measurable iff all coordinates are measurable.
Joint distributions are naturally vector-valued laws. Covariance matrices, multivariate normal distributions, concentration inequalities, and random matrix theory all use vector-valued random variables. The key object is not merely the list of marginal laws but the joint law, which encodes dependence.
7.4 Random elements in measurable spaces
A random element generalizes random variables to arbitrary measurable spaces:
Here may be a function space, graph space, space of measures, manifold, metric space, or distribution space. The law is a probability measure on .
Random elements allow probability to model stochastic processes as single random objects taking values in path spaces. For example, Brownian motion can be treated as a random element of once continuity is established. This viewpoint shifts attention from coordinate distributions to process-level laws.
7.5 Simple random variables
A simple random variable takes finitely many values:
where are measurable. Its expectation is
Simple variables are the building blocks of Lebesgue integration.
Every nonnegative measurable function can be approximated increasingly by simple functions. This is the construction route from finite weighted averages to general expectation. Simple variables therefore form the bridge between elementary probability and measure-theoretic probability.
7.6 Positive random variables
A positive or nonnegative random variable satisfies . Its expectation is always defined in the extended sense:
It may be infinite. This avoids undefined expressions from subtracting infinities.
For nonnegative variables, Markov’s inequality holds:
This simple inequality is a fundamental bridge from expectation to tail probability. It is often the first step in concentration, moment methods, and existence proofs.
7.7 Extended real-valued random variables
An extended real-valued random variable takes values in . Such variables appear naturally as limits, suprema, hitting times, logarithms of zero, or infima of random sets. Measurability is defined using the Borel σ-algebra on the extended real line.
Expectation of an extended real-valued is handled by positive and negative parts:
The expectation is defined if and are not both infinite. Otherwise is undefined.
7.8 Distribution of a random variable
The distribution or law of is
For real ,
If is discrete, is determined by masses . If has density , then .
The law forgets the original sample space and retains only the probabilities of events determined by . This quotient is useful but loses coupling information. Knowing and separately does not determine the joint law .
7.9 Cumulative distribution functions
For real-valued , the cumulative distribution function is
It is nondecreasing, right-continuous, and satisfies
Conversely, every function with these properties is the CDF of a probability measure on .
The CDF determines the law. Point masses appear as jumps:
If is differentiable with derivative and absolutely continuous, then is a density. Differentiability alone is not enough globally; absolute continuity is the correct condition.
7.10 Quantile functions
For a CDF , a quantile function may be defined by
If , then has CDF . This is inverse-transform sampling.
Quantiles are law-level objects. They support simulation, stochastic ordering, coupling, and distributional construction. In general, need not equal exactly when has jumps or flat regions, but the pushforward law is still correct. The quantile construction gives a canonical coupling of many distributions using a shared uniform variable.
7.11 Equality almost surely
Random variables and on the same probability space are equal almost surely if
They may differ on a null set. In measure-theoretic probability, many objects are defined only up to almost sure equality. For example, elements of are equivalence classes of random variables modulo a.s. equality.
Almost sure equality requires a common sample space. If and live on different spaces, the expression is meaningless until a coupling is supplied. Equality almost surely is therefore stronger than equality in distribution and more carrier-dependent.
7.12 Equality in distribution
Random variables and , possibly on different probability spaces, are equal in distribution if
For real variables, this is equivalent to
for every or at all continuity points, depending on context.
Equality in distribution permits comparison without a joint carrier. It is central to weak convergence and limit theorems. But it says nothing about pointwise equality, dependence, correlation, or joint relations. Exporting as is a category error unless a coupling with equality is constructed.
7.13 Joint distributions
For random variables and , the joint distribution is the law of :
It determines the marginals:
The marginals do not determine the joint distribution. Dependence lives in the gap between marginals and joint law. Coupling theory studies all possible joint laws with specified marginals. Independence is the special joint law .
7.14 Marginals
Marginals are projections of joint laws. If is a probability measure on , its marginals are
For random variables, these are the laws of each coordinate.
Marginalization loses dependence information. Many different couplings share the same marginals: independent coupling, perfectly correlated coupling, antimonotone coupling, optimal transport coupling, and others. Any inference from marginal laws to joint behavior requires an additional coupling certificate.
7.15 Transformations of random variables
If , then the law of is the pushforward
For discrete ,
For continuous with density and smooth invertible ,
In higher dimensions, the Jacobian determinant appears. But the pushforward formula is more fundamental than density formulas. It works even when no density exists. The density formula is a special coordinate representation of measure transport.
7.16 Measurability traps
Common measurability traps include defining by a supremum over an uncountable family, projecting a measurable subset of a product space, or assuming every subset of a continuous space is measurable. Supremum over countably many measurable functions is measurable; supremum over uncountably many requires additional structure such as separability or joint measurability.
Another trap is confusing pointwise-defined modifications with measurable versions. A function equal almost everywhere to a measurable function need not be measurable unless the measure space is complete or the modification is controlled. In stochastic processes, path properties often require choosing versions with measurable or regular sample paths. Measurability is the legality gate for probability.
Chapter 8 — Expectation as Lebesgue Integration
8.1 Simple-function integration
For a nonnegative simple random variable
define
If the representation is refined, the value remains the same. This is the finite weighted-average formula expressed in measure language.
Simple-function integration is the primitive construction of the Lebesgue integral. General nonnegative measurable functions are approximated from below by simple functions. Thus expectation is not introduced by density or Riemann area; it is built by monotone approximation from event probabilities.
8.2 Nonnegative random variables
For , define
The value may be . This definition is stable under monotone limits and does not require cancellation.
A useful identity for nonnegative is the tail integral formula:
For integer-valued nonnegative ,
These formulas convert expectation into tail probabilities.
8.3 Integrable random variables
A real-valued random variable is integrable if
Then
where both terms are finite. Integrability prevents the undefined expression .
Integrability is the gate for many operations. Linearity of expectation, conditional expectation in , convergence of averages, and martingale theory all require appropriate integrability. A random variable may be finite almost surely but not integrable; heavy-tailed distributions provide standard examples.
8.4 Positive and negative parts
Every real decomposes as
where
Both and are nonnegative measurable functions when is measurable.
This decomposition is not cosmetic. Lebesgue integration handles nonnegative functions first; signed integration is defined by subtracting two nonnegative integrals only when the subtraction is meaningful. If both and are infinite, is undefined.
8.5 Expectation as integral
Expectation is Lebesgue integration against probability:
If has law , then
whenever the integral is defined. If has density ,
The law-level formula shows expectation depends only on the distribution of , not on the particular sample-space representation. But expectations of functions involving multiple variables depend on the joint law, not only the marginals.
8.6 Linearity of expectation
If are integrable and , then
If , linearity also holds in the extended sense:
allowing .
Linearity does not require independence. This remains one of probability’s most effective tools. Independence is needed for multiplicative identities such as
not for additive identities.
8.7 Change of variables / pushforward formula
If has law , then for measurable ,
This is the abstract change-of-variables formula. It states that integrating a function of over the source space equals integrating that function over the distribution of .
Classical density transformations are special cases. If , then
The pushforward law contains the transformed probabilities.
8.8 Expectation under distribution law
For a real random variable with distribution function , expectation can be written as a Lebesgue–Stieltjes integral:
when integrable. If is discrete,
If has density ,
These are not different concepts of expectation; they are different representations of the same law-level integral. The correct representation depends on the measure type. For mixed or singular distributions, forcing a density or a mass function loses information.
8.9 Infinite expectations
A nonnegative random variable may have infinite expectation:
For example, a Pareto-type tail with has divergent expectation since
diverges logarithmically. Infinite expectation is a legitimate value for nonnegative .
For signed variables, infinite positive and negative parts cannot be subtracted. A Cauchy random variable has no expectation in the Lebesgue sense, even though symmetric principal value calculations may give zero. Principal value is not expectation; it is a different limiting operation.
8.10 Integrability conditions
Integrability is often certified by tail bounds, domination, or moment estimates. If and is integrable, then is integrable. If for large , then
when , by the tail integral formula.
For ,
This formula gives moment criteria from tail decay. Moment assumptions in limit theorems are therefore tail assumptions in disguised integral form.
8.11 Expectation versus typical value
Expectation is an average, not necessarily a typical value. Heavy-tailed variables may have means dominated by rare extreme events. A variable may have expectation far from its median or mode. In skewed distributions, , median, and most likely value can be very different.
This distinction matters in risk, algorithms, and probabilistic method arguments. Expected runtime may be finite while typical runtime is much smaller, or median performance may be good while expectation is ruined by rare catastrophes. Concentration inequalities are needed when one wants typical behavior, not just average behavior.
8.12 Expectation under model extension
If a probability space is extended by adding auxiliary randomness, old random variables lift by composition with projection. If preserves probability and , then
Thus expectation is invariant under probability-preserving model extension.
This invariance justifies changing sample spaces for convenience. One may add independent uniforms, construct couplings, or realize random variables on canonical spaces. The invariant object is the law and the relevant joint structure, not the raw sample-space representation.
Chapter 9 — Core Convergence Theorems
9.1 Monotone convergence theorem
If pointwise, then
This is the monotone convergence theorem. It is one of the foundational results of Lebesgue integration and depends on countable additivity.
The theorem licenses passing limits through expectations for increasing nonnegative sequences without domination. It is used to define integrals, prove Tonelli’s theorem, derive tail integral formulas, and handle stopping times by approximation. The nonnegative monotone structure is the certificate; without monotonicity, the conclusion may fail.
9.2 Fatou’s lemma
For nonnegative random variables ,
Fatou’s lemma gives a lower-semicontinuity principle for expectations. It is weaker than full convergence exchange but requires minimal hypotheses.
Fatou is often the correct tool when limits are available but domination is absent. It prevents mass from appearing in the limit without being accounted for, but it allows mass to escape. In probability, this “escape” corresponds to lack of uniform integrability or tightness.
9.3 Dominated convergence theorem
If almost surely and for an integrable , then
The integrable dominating variable prevents mass from escaping into rare large spikes.
Dominated convergence is one of the main liftback theorems from pointwise convergence to expectation convergence. The domination hypothesis is load-bearing. Pointwise convergence alone does not imply convergence of expectations. A common counterexample is on : almost everywhere, but .
9.4 Bounded convergence theorem
If almost surely and for a constant , then
This is a special case of dominated convergence with .
Bounded convergence is often used with probabilities because indicators are bounded. If almost surely, then
However, indicator convergence must be verified; set convergence can mean different things depending on limsup and liminf behavior.
9.5 Uniform integrability
A family is uniformly integrable if
Uniform integrability prevents mass from escaping to infinity uniformly over the family.
It is the correct bridge from convergence in probability to convergence in . If in probability and is uniformly integrable, then
under standard hypotheses. Without uniform integrability, convergence in probability does not imply convergence of expectations.
9.6 Vitali convergence theorem
Vitali’s theorem states that in if and only if in probability and the family is uniformly integrable, with appropriate inclusion of . More generally, -versions use uniform integrability of .
This theorem identifies the missing payload in many false expectation arguments. Convergence in probability controls typical deviations; uniform integrability controls rare large deviations. Both are needed for convergence of means. The theorem is therefore a precise decomposition of convergence into typical behavior plus tail control.
9.7 Interchanging limits and expectations
The formal question is when
Monotone convergence, dominated convergence, bounded convergence, and Vitali’s theorem are sufficient frameworks. Fatou gives one-sided control for nonnegative sequences.
The error pattern is to treat expectation as a finite sum after limits have entered. Infinite probability spaces allow mass to move, concentrate, vanish, or escape. Every interchange of limit and expectation requires a certificate: monotonicity, domination, boundedness, uniform integrability, or another convergence theorem.
9.8 Failure modes for limit-expectation exchange
A standard failure is rare spikes. On , define
Then almost everywhere, but . The pointwise limit misses mass that escapes into narrower and taller regions.
Another failure is lack of integrability in the limit. Variables may converge pointwise to a nonintegrable random variable while expectations fail to converge finitely. Oscillation can also break convergence if positive and negative parts are not controlled. The general counterkernel is missing tail control.
9.9 Tightness versus integrability
Tightness controls where probability mass lies. A family of probability measures on a metric space is tight if for every , there exists compact such that
Integrability controls the magnitude of random variables:
Tightness is about mass not escaping spatially; uniform integrability is about weighted mass not escaping in expectation. A family can be tight without uniformly integrable first moments. For convergence of laws, tightness is central; for convergence of expectations, uniform integrability is the stronger bridge.
9.10 Expectation convergence counterexamples
Counterexamples are not peripheral; they define the gates. Let with probability and otherwise. Then in probability, but . Thus convergence in probability does not imply expectation convergence.
Let with probability and otherwise. Then in probability, but . This shows even boundedness of probabilities of large deviations is insufficient; the magnitude of rare deviations matters. Uniform integrability is the exact missing condition.
Chapter 10 — Moments and Inequalities
10.1 Moments
The -th raw moment of is
when it exists. The -th absolute moment is
Moments summarize distributional information. The first moment is the mean; the second raw moment contributes to variance; higher moments measure tail weight and shape.
Moment existence is not automatic. Heavy-tailed distributions may have some finite moments and some infinite moments. If , then for on probability spaces. Higher finite moments imply lower finite moments, but not conversely.
10.2 Absolute moments
Absolute moments avoid cancellation. The condition
defines . For signed variables, may exist by cancellation in some improper sense while fails. Probability theory generally uses absolute integrability to certify legal operations.
The tail formula is
This makes clear that -th moments are tail decay conditions. They are not just algebraic averages; they control rare large values.
10.3 Variance and standard deviation
Variance is the second central moment:
The standard deviation is
Variance exists when . It measures quadratic spread around the mean.
The identity
is computationally useful. For sums,
If the variables are pairwise uncorrelated, the covariance terms vanish.
10.4 Covariance and correlation
Covariance is
It measures linear co-fluctuation. Correlation normalizes covariance:
By Cauchy–Schwarz, .
Covariance is not a general dependence measure. Variables can be dependent with zero covariance. For example, if is symmetric about zero and , then under symmetry, but is determined by . Covariance detects linear dependence, not arbitrary dependence.
10.5 Jensen’s inequality
If is convex and is integrable with integrable or nonnegative, then
For concave , the inequality reverses. Jensen’s inequality says expectation commutes with convex functions only in an inequality direction.
Examples include
and for positive ,
because is concave. Jensen is a convexity certificate inside probability: averaging before applying a convex function gives a smaller value than applying the convex function before averaging.
10.6 Markov’s inequality
For and ,
Proof: , so taking expectations gives .
Markov’s inequality is crude but universal. Applied to , it gives
This converts moment bounds into tail bounds. It is often the first inequality in probabilistic estimates and the base of many concentration arguments.
10.7 Chebyshev’s inequality
If has finite variance, then
This is Markov’s inequality applied to .
Chebyshev gives a general concentration bound using only variance. For averages of independent variables with common variance ,
so
This proves a weak law of large numbers under finite variance.
10.8 Hölder’s inequality
If with , then
More generally, products of several variables are bounded by corresponding norms whose reciprocal exponents sum to one.
Hölder is the core multiplicative inequality of spaces. It proves duality bounds, integrability of products, and moment interpolation. Cauchy–Schwarz is the case :
10.9 Minkowski’s inequality
For ,
where
This is the triangle inequality in .
Minkowski turns spaces into normed spaces. It allows probability to use functional analysis: completeness, projections, duality, and compactness methods. For , is not a norm and Minkowski fails; this changes the geometry of the space.
10.10 Lyapunov inequality
On a probability space, if , then
Thus . Higher moments control lower moments.
Lyapunov’s inequality is a monotonicity principle for moment norms. It follows from Jensen or Hölder. It is frequently used to downgrade assumptions: if a theorem gives a fourth-moment bound, then second and first moments are automatically finite and bounded.
10.11 Paley–Zygmund inequality
For with finite second moment and ,
This gives a lower bound on the probability that is not too small.
Paley–Zygmund is a second-moment existence tool. If is comparable to , then is positive with nontrivial probability. It is central in probabilistic combinatorics, random graphs, and branching processes, where first moment alone may not prove existence with positive probability.
10.12 Moment generating functions
The moment generating function of is
where finite. Derivatives at zero, when justified, give moments:
MGFs transform sums of independent variables into products:
when are independent.
MGFs support Chernoff bounds:
Optimizing over gives exponential tail estimates. The existence domain of is load-bearing; heavy-tailed variables may have infinite MGF for all .
10.13 Characteristic functions
The characteristic function of is
It always exists because . Characteristic functions determine laws and convert convolution into multiplication:
for independent .
Characteristic functions are central to the central limit theorem. If
pointwise and is continuous at zero, then is the characteristic function of a probability law and converges in distribution to that law. This is Lévy’s continuity theorem.
10.14 Cumulants
The cumulant generating function is
when the MGF exists near zero. The -th cumulant is
The first cumulant is the mean; the second is variance; higher cumulants encode skewness, kurtosis, and non-Gaussian structure.
Cumulants add under independence:
so
Gaussian variables have cumulants of order equal to zero. This makes cumulants useful in normal approximation, Edgeworth expansions, statistical mechanics, and random matrix theory.
10.15 Moment determinacy and indeterminacy
A distribution is moment-determinate if its sequence of moments uniquely determines the law. Compactly supported distributions are moment-determinate. A sufficient condition on is Carleman’s condition:
where .
Moment indeterminacy means two different distributions can share all moments. The lognormal distribution is a standard example of a law not determined by its moments. Therefore “all moments match” is not always a distribution certificate unless determinacy is proved. Characteristic functions avoid this issue because they always determine the law.
Comments
Post a Comment