ADVANCED Topics in Probabilistic Machine Learning

Contents

Preface xxix

1 Introduction 1

I Fundamentals 3

2 Probability 5

2.1 Introduction 5

2.1.1 Basic rules of probability 5

2.1.2 Exchangeability and de Finetti’s theorem 8

2.2 Some common probability distributions 9

2.2.1 Discrete distributions 9

2.2.2 Continuous distributions on R 11

2.2.3 Continuous distributions on R+ 13

2.2.4 Continuous distributions on [0, 1] 17

2.2.5 Multivariate continuous distributions 17

2.3 Gaussian joint distributions 22

2.3.1 The multivariate normal 22

2.3.2 Linear Gaussian systems 29

2.3.3 A general calculus for linear Gaussian systems 31

2.4 The exponential family 34

2.4.1 Definition 34

2.4.2 Examples 35

2.4.3 Log partition function is cumulant generating function 40

2.4.4 Canonical (natural) vs mean (moment) parameters 41

2.4.5 MLE for the exponential family 42

2.4.6 Exponential dispersion family 43

2.4.7 Maximum entropy derivation of the exponential family 43

2.5 Transformations of random variables 44

2.5.1 Invertible transformations (bijections) 44

2.5.2 Monte Carlo approximation 45

2.5.3 Probability integral transform 45

2.6 Markov chains 46

2.6.1 Parameterization 47

2.6.2 Application: language modeling 49

2.6.3 Parameter estimation 50

2.6.4 Stationary distribution of a Markov chain 52

2.7 Divergence measures between probability distributions 55

2.7.1 f -divergence 56

2.7.2 Integral probability metrics 58

2.7.3 Maximum mean discrepancy (MMD) 59

2.7.4 Total variation distance 61

2.7.5 Density ratio estimation using binary classifiers 61

3 Statistics 65

3.1 Introduction 65

3.2 Bayesian statistics 65

3.2.1 Tossing coins 66

3.2.2 Modeling more complex data 72

3.2.3 Selecting the prior 73

3.2.4 Computational issues 73

3.3 Frequentist statistics 73

3.3.1 Sampling distributions 74

3.3.2 Bootstrap approximation of the sampling distribution 74

3.3.3 Asymptotic normality of the sampling distribution of the MLE 76

3.3.4 Fisher information matrix 76

3.3.5 Counterintuitive properties of frequentist statistics 81

3.3.6 Why isn’t everyone a Bayesian? 83

3.4 Conjugate priors 84

3.4.1 The binomial model 85

3.4.2 The multinomial model 85

3.4.3 The univariate Gaussian model 86

3.4.4 The multivariate Gaussian model 91

3.4.5 The exponential family model 97

3.4.6 Beyond conjugate priors 100

3.5 Noninformative priors 103

3.5.1 Maximum entropy priors 103

3.5.2 Jeﬀreys priors 104

3.5.3 Invariant priors 107

3.5.4 Reference priors 108

3.6 Hierarchical priors 109

3.6.1 A hierarchical binomial model 109

3.6.2 A hierarchical Gaussian model 112

3.6.3 Hierarchical conditional models 115

3.7 Empirical Bayes 115

3.7.1 EB for the hierarchical binomial model 116

3.7.2 EB for the hierarchical Gaussian model 117

3.7.3 EB for Markov models (n-gram smoothing) 117

3.7.4 EB for non-conjugate models 119

3.8 Model selection 119

3.8.1 Bayesian model selection 120

3.8.2 Bayes model averaging 122

3.8.3 Estimating the marginal likelihood 122

3.8.4 Connection between cross validation and marginal likelihood 123

3.8.5 Conditional marginal likelihood 124

3.8.6 Bayesian leave-one-out (LOO) estimate 125

3.8.7 Information criteria 126

3.9 Model checking 128

3.9.1 Posterior predictive checks 129

3.9.2 Bayesian p-values 131

3.10 Hypothesis testing 132

3.10.1 Frequentist approach 132

3.10.2 Bayesian approach 132

3.10.3 Common statistical tests correspond to inference in linear models 137

3.11 Missing data 142

4 Graphical models 145

4.1 Introduction 145

4.2 Directed graphical models (Bayes nets) 145

4.2.1 Representing the joint distribution 145

4.2.2 Examples 146

4.2.3 Gaussian Bayes nets 150

4.2.4 Conditional independence properties 151

4.2.5 Generation (sampling) 156

4.2.6 Inference 157

4.2.7 Learning 157

4.2.8 Plate notation 163

4.3 Undirected graphical models (Markov random fields) 166

4.3.1 Representing the joint distribution 167

4.3.2 Fully visible MRFs (Ising, Potts, Hopfield, etc.) 168

4.3.3 MRFs with latent variables (Boltzmann machines, etc.) 174

4.3.4 Maximum entropy models 176

4.3.5 Gaussian MRFs 179

4.3.6 Conditional independence properties 181

4.3.7 Generation (sampling) 183

4.3.8 Inference 183

4.3.9 Learning 184

4.4 Conditional random fields (CRFs) 187

4.4.1 1d CRFs 188

4.4.2 2d CRFs 191

4.4.3 Parameter estimation 194

4.4.4 Other approaches to structured prediction 195

4.5 Comparing directed and undirected PGMs 195

4.5.1 CI properties 195

4.5.2 Converting between a directed and undirected model 197

4.5.3 Conditional directed vs undirected PGMs and the label bias problem 198

4.5.4 Combining directed and undirected graphs 199

4.5.5 Comparing directed and undirected Gaussian PGMs 201

4.6 PGM extensions 203

4.6.1 Factor graphs 203

4.6.2 Probabilistic circuits 206

4.6.3 Directed relational PGMs 207

4.6.4 Undirected relational PGMs 209

4.6.5 Open-universe probability models 212

4.6.6 Programs as probability models 212

4.7 Structural causal models 213

4.7.1 Example: causal impact of education on wealth 214

4.7.2 Structural equation models 215

4.7.3 Do operator and augmented DAGs 215

4.7.4 Counterfactuals 216

5 Information theory 219

5.1 KL divergence 219

5.1.1 Desiderata 220

5.1.2 The KL divergence uniquely satisfies the desiderata 221

5.1.3 Thinking about KL 224

5.1.4 Minimizing KL 225

5.1.5 Properties of KL 228

5.1.6 KL divergence and MLE 230

5.1.7 KL divergence and Bayesian inference 231

5.1.8 KL divergence and exponential families 232

5.1.9 Approximating KL divergence using the Fisher information matrix 233

5.1.10 Bregman divergence 233

5.2 Entropy 234

5.2.1 Definition 235

5.2.2 Diﬀerential entropy for continuous random variables 235

5.2.3 Typical sets 236

5.2.4 Cross entropy and perplexity 237

5.3 Mutual information 238

5.3.1 Definition 238

5.3.2 Interpretation 239

5.3.3 Data processing inequality 239

5.3.4 Suﬃcient statistics 240

5.3.5 Multivariate mutual information 241

5.3.6 Variational bounds on mutual information 244

5.3.7 Relevance networks 246

5.4 Data compression (source coding) 247

5.4.1 Lossless compression 247

5.4.2 Lossy compression and the rate-distortion tradeoﬀ 248

5.4.3 Bits back coding 250

5.5 Error-correcting codes (channel coding) 251

5.6 The information bottleneck 252

5.6.1 Vanilla IB 252

5.6.2 Variational IB 253

5.6.3 Conditional entropy bottleneck 254

6 Optimization 257

6.1 Introduction 257

6.2 Automatic diﬀerentiation 257

6.2.1 Diﬀerentiation in functional form 257

6.2.2 Diﬀerentiating chains, circuits, and programs 262

6.3 Stochastic optimization 267

6.3.1 Stochastic gradient descent 267

6.3.2 SGD for optimizing a finite-sum objective 269

6.3.3 SGD for optimizing the parameters of a distribution 269

6.3.4 Score function estimator (likelihood ratio trick) 270

6.3.5 Reparameterization trick 271

6.3.6 Gumbel softmax trick 273

6.3.7 Stochastic computation graphs 274

6.3.8 Straight-through estimator 275

6.4 Natural gradient descent 275

6.4.1 Defining the natural gradient 276

6.4.2 Interpretations of NGD 277

6.4.3 Benefits of NGD 278

6.4.4 Approximating the natural gradient 278

6.4.5 Natural gradients for the exponential family 280

6.5 Bound optimization (MM) algorithms 283

6.5.1 The general algorithm 283

6.5.2 Example: logistic regression 284

6.5.3 The EM algorithm 285

6.5.4 Example: EM for an MVN with missing data 287

6.5.5 Example: robust linear regression using Student likelihood 289

6.5.6 Extensions to EM 291

6.6 Bayesian optimization 293

6.6.1 Sequential model-based optimization 294

6.6.2 Surrogate functions 294

6.6.3 Acquisition functions 296

6.6.4 Other issues 299

6.7 Derivative-free optimization 300

6.7.1 Local search 300

6.7.2 Simulated annealing 303

6.7.3 Evolutionary algorithms 303

6.7.4 Estimation of distribution (EDA) algorithms 306

6.7.5 Cross-entropy method 307

6.7.6 Evolutionary strategies 308

6.8 Optimal transport 309

6.8.1 Warm-up: matching optimally two families of points 309

6.8.2 From optimal matchings to Kantorovich and Monge formulations 310

6.8.3 Solving optimal transport 313

6.9 Submodular optimization 317

6.9.1 Intuition, examples, and background 318

6.9.2 Submodular basic definitions 320

6.9.3 Example submodular functions 322

6.9.4 Submodular optimization 324

6.9.5 Applications of submodularity in machine learning and AI 328

6.9.6 Sketching, coresets, distillation, and data subset and feature Selection 329

6.9.7 Combinatorial information functions 332

6.9.8 Clustering, data partitioning, and parallel machine learning 334

6.9.9 Active and semi-supervised learning 334

6.9.10 Probabilistic modeling 335

6.9.11 Structured norms and loss functions 336

6.9.12 Conclusions 337

II Inference 339

7 Inference algorithms: an overview 341

7.1 Introduction 341

7.2 Common inference patterns 341

7.2.1 Global latents 342

7.2.2 Local latents 342

7.2.3 Global and local latents 343

7.3 Exact inference algorithms 343

7.4 Approximate inference algorithms 344

7.4.1 The MAP approximation and its problems 344

7.4.2 Grid approximation 346

7.4.3 Laplace (quadratic) approximation 347

7.4.4 Variational inference 348

7.4.5 Markov chain Monte Carlo (MCMC) 349

7.4.6 Sequential Monte Carlo 351

7.4.7 Challenging posteriors 351

7.5 Evaluating approximate inference algorithms 352

8 Gaussian filtering and smoothing 353

8.1 Introduction 353

8.1.1 Inferential goals 353

8.1.2 Bayesian filtering equations 355

8.1.3 Bayesian smoothing equations 356

8.1.4 The Gaussian ansatz 357

8.2 Inference for linear-Gaussian SSMs 357

8.2.1 Examples 358

8.2.2 The Kalman filter 359

8.2.3 The Kalman (RTS) smoother 363

8.2.4 Information form filtering and smoothing 366

8.3 Inference based on local linearization 369

8.3.1 Taylor series expansion 369

8.3.2 The extended Kalman filter (EKF) 370

8.3.3 The extended Kalman smoother (EKS) 373

8.4 Inference based on the unscented transform 373

8.4.1 The unscented transform 373

8.4.2 The unscented Kalman filter (UKF) 376

8.4.3 The unscented Kalman smoother (UKS) 376

8.5 Other variants of the Kalman filter 376

8.5.1 General Gaussian filtering 376

8.5.2 Conditional moment Gaussian filtering 379

8.5.3 Iterated filters and smoothers 380

8.5.4 Ensemble Kalman filter 382

8.5.5 Robust Kalman filters 383

8.5.6 Dual EKF 383

8.6 Assumed density filtering 383

8.6.1 Connection with Gaussian filtering 385

8.6.2 ADF for SLDS (Gaussian sum filter) 386

8.6.3 ADF for online logistic regression 387

8.6.4 ADF for online DNNs 390

8.7 Other inference methods for SSMs 390

8.7.1 Grid-based approximations 390

8.7.2 Expectation propagation 391

8.7.3 Variational inference 392

8.7.4 MCMC 392

8.7.5 Particle filtering 392

9 Message passing algorithms 395

9.1 Introduction 395

9.2 Belief propagation on chains 395

9.2.1 Hidden Markov Models 396

9.2.2 The forwards algorithm 397

9.2.3 The forwards-backwards algorithm 398

9.2.4 Forwards filtering backwards smoothing 401

9.2.5 Time and space complexity 402

9.2.6 The Viterbi algorithm 403

9.2.7 Forwards filtering backwards sampling 406

9.3 Belief propagation on trees 406

9.3.1 Directed vs undirected trees 406

9.3.2 Sum-product algorithm 408

9.3.3 Max-product algorithm 409

9.4 Loopy belief propagation 411

9.4.1 Loopy BP for pairwise undirected graphs 412

9.4.2 Loopy BP for factor graphs 412

9.4.3 Gaussian belief propagation 413

9.4.4 Convergence 415

9.4.5 Accuracy 417

9.4.6 Generalized belief propagation 418

9.4.7 Convex BP 418

9.4.8 Application: error correcting codes 418

9.4.9 Application: aﬃnity propagation 420

9.4.10 Emulating BP with graph neural nets 421

9.5 The variable elimination (VE) algorithm 422

9.5.1 Derivation of the algorithm 422

9.5.2 Computational complexity of VE 424

9.5.3 Picking a good elimination order 426

9.5.4 Computational complexity of exact inference 426

9.5.5 Drawbacks of VE 427

9.6 The junction tree algorithm (JTA) 428

9.7 Inference as optimization 429

9.7.1 Inference as backpropagation 429

9.7.2 Perturb and MAP 430

10 Variational inference 433

10.1 Introduction 433

10.1.1 The variational objective 433

10.1.2 Form of the variational posterior 435

10.1.3 Parameter estimation using variational EM 436

10.1.4 Stochastic VI 438

10.1.5 Amortized VI 438

10.1.6 Semi-amortized inference 439

10.2 Gradient-based VI 439

10.2.1 Reparameterized VI 440

10.2.2 Automatic diﬀerentiation VI 446

10.2.3 Blackbox variational inference 448

10.3 Coordinate ascent VI 449

10.3.1 Derivation of CAVI algorithm 450

10.3.2 Example: CAVI for the Ising model 452

10.3.3 Variational Bayes 453

10.3.4 Example: VB for a univariate Gaussian 454

10.3.5 Variational Bayes EM 457

10.3.6 Example: VBEM for a GMM 458

10.3.7 Variational message passing (VMP) 464

10.3.8 Autoconj 465

10.4 More accurate variational posteriors 465

10.4.1 Structured mean field 465

10.4.2 Hierarchical (auxiliary variable) posteriors 465

10.4.3 Normalizing flow posteriors 466

10.4.4 Implicit posteriors 466

10.4.5 Combining VI with MCMC inference 466

10.5 Tighter bounds 467

10.5.1 Multi-sample ELBO (IWAE bound) 467

10.5.2 The thermodynamic variational objective (TVO) 468

10.5.3 Minimizing the evidence upper bound 468

10.6 Wake-sleep algorithm 469

10.6.1 Wake phase 469

10.6.2 Sleep phase 470

10.6.3 Daydream phase 471

10.6.4 Summary of algorithm 471

10.7 Expectation propagation (EP) 472

10.7.1 Algorithm 472

10.7.2 Example 474

10.7.3 EP as generalized ADF 474

10.7.4 Optimization issues 475

10.7.5 Power EP and ↵-divergence 475

10.7.6 Stochastic EP 475

11 Monte Carlo methods 477

11.1 Introduction 477

11.2 Monte Carlo integration 477

11.2.1 Example: estimating ⇡ by Monte Carlo integration 478

11.2.2 Accuracy of Monte Carlo integration 478

11.3 Generating random samples from simple distributions 480

11.3.1 Sampling using the inverse cdf 480

11.3.2 Sampling from a Gaussian (Box-Muller

11.4 Rejection sampling 481

11.4.1 Basic idea 482

11.4.2 Example 483

11.4.3 Adaptive rejection sampling 483

11.4.4 Rejection sampling in high dimensions

method)

484 481

11.5 Importance sampling 484

11.5.1 Direct importance sampling 485

11.5.2 Self-normalized importance sampling

485

11.5.3 Choosing the proposal 486

11.5.4 Annealed importance sampling (AIS)

486

11.6 Controlling Monte Carlo variance 488

11.6.1 Common random numbers 488

11.6.2 Rao-Blackwellization 488

11.6.3 Control variates 489

11.6.4 Antithetic sampling 490

11.6.5 Quasi-Monte Carlo (QMC) 491

12 Markov chain Monte Carlo 493

12.1 Introduction 493

12.2 Metropolis-Hastings algorithm 494

12.2.1 Basic idea 494

12.2.2 Why MH works 495

12.2.3 Proposal distributions 496

12.2.4 Initialization 498

12.3 Gibbs sampling 499

12.3.1 Basic idea 499

12.3.2 Gibbs sampling is a special case of MH

499

12.3.3 Example: Gibbs sampling for Ising models 500

12.3.4 Example: Gibbs sampling for Potts models 502

12.3.5 Example: Gibbs sampling for GMMs 502

12.3.6 Metropolis within Gibbs 504

12.3.7 Blocked Gibbs sampling 504

12.3.8 Collapsed Gibbs sampling 505

12.4 Auxiliary variable MCMC 507

12.4.1 Slice sampling 507

12.4.2 Swendsen-Wang 509

12.5 Hamiltonian Monte Carlo (HMC) 510

12.5.1 Hamiltonian mechanics 511

12.5.2 Integrating Hamilton’s equations 511

12.5.3 The HMC algorithm 513

12.5.4 Tuning HMC 514

12.5.5 Riemann manifold HMC 515

12.5.6 Langevin Monte Carlo (MALA) 515

12.5.7 Connection between SGD and Langevin sampling 516

12.5.8 Applying HMC to constrained parameters 517

12.5.9 Speeding up HMC 518

12.6 MCMC convergence 518

12.6.1 Mixing rates of Markov chains 519

12.6.2 Practical convergence diagnostics 520

12.6.3 Eﬀective sample size 523

12.6.4 Improving speed of convergence 525

12.6.5 Non-centered parameterizations and Neal’s funnel 525

12.7 Stochastic gradient MCMC 526

12.7.1 Stochastic gradient Langevin dynamics (SGLD) 527

12.7.2 Preconditionining 527

12.7.3 Reducing the variance of the gradient estimate 528

12.7.4 SG-HMC 529

12.7.5 Underdamped Langevin dynamics 529

12.8 Reversible jump (transdimensional) MCMC 530

12.8.1 Basic idea 531

12.8.2 Example 531

12.8.3 Discussion 533

12.9 Annealing methods 533

12.9.1 Simulated annealing 533

12.9.2 Parallel tempering 536

13 Sequential Monte Carlo 537

13.1 Introduction 537

13.1.1 Problem statement 537

13.1.2 Particle filtering for state-space models 537

13.1.3 SMC samplers for static parameter estimation 539

13.2 Particle filtering 539

13.2.1 Importance sampling 539

13.2.2 Sequential importance sampling 541

13.2.3 Sequential importance sampling with resampling 542

13.2.4 Resampling methods 545

13.2.5 Adaptive resampling 547

13.3 Proposal distributions 547

13.3.1 Locally optimal proposal 548

13.3.2 Proposals based on the extended and unscented Kalman filter 549

13.3.3 Proposals based on the Laplace approximation 549

13.3.4 Proposals based on SMC (nested SMC) 551

13.4 Rao-Blackwellized particle filtering (RBPF) 551

13.4.1 Mixture of Kalman filters 551

13.4.2 Example: tracking a maneuvering object 553

13.4.3 Example: FastSLAM 554

13.5 Extensions of the particle filter 557

13.6 SMC samplers 557

13.6.1 Ingredients of an SMC sampler 558

13.6.2 Likelihood tempering (geometric path) 559

13.6.3 Data tempering 561

13.6.4 Sampling rare events and extrema 562

13.6.5 SMC-ABC and likelihood-free inference 563

13.6.6 SMC2 563

13.6.7 Variational filtering SMC 563

13.6.8 Variational smoothing SMC 564

III Prediction 567

14 Predictive models: an overview 569

14.1 Introduction 569

14.1.1 Types of model 569

14.1.2 Model fitting using ERM, MLE, and MAP 570

14.1.3 Model fitting using Bayes, VI, and generalized Bayes 571

14.2 Evaluating predictive models 572

14.2.1 Proper scoring rules 572

14.2.2 Calibration 572

14.2.3 Beyond evaluating marginal probabilities 576

14.3 Conformal prediction 579

14.3.1 Conformalizing classification 581

14.3.2 Conformalizing regression 581

15 Generalized linear models 583

15.1 Introduction 583

15.1.1 Some popular GLMs 583

15.1.2 GLMs with noncanonical link functions 586

15.1.3 Maximum likelihood estimation 587

15.1.4 Bayesian inference 587

15.2 Linear regression 588

15.2.1 Ordinary least squares 588

15.2.2 Conjugate priors 589

15.2.3 Uninformative priors 591

15.2.4 Informative priors 593

15.2.5 Spike and slab prior 595

15.2.6 Laplace prior (Bayesian lasso) 596

15.2.7 Horseshoe prior 597

15.2.8 Automatic relevancy determination 598

15.2.9 Multivariate linear regression 600

15.3 Logistic regression 602

15.3.1 Binary logistic regression 602

15.3.2 Multinomial logistic regression 603

15.3.3 Dealing with class imbalance and the long tail 604

15.3.4 Parameter priors 604

15.3.5 Laplace approximation to the posterior 605

15.3.6 Approximating the posterior predictive distribution 607

15.3.7 MCMC inference 609

15.3.8 Other approximate inference methods 610

15.3.9 Case study: is Berkeley admissions biased against women? 611

15.4 Probit regression 613

15.4.1 Latent variable interpretation 613

15.4.2 Maximum likelihood estimation 614

15.4.3 Bayesian inference 616

15.4.4 Ordinal probit regression 616

15.4.5 Multinomial probit models 617

15.5 Multilevel (hierarchical) GLMs 617

15.5.1 Generalized linear mixed models (GLMMs) 618

15.5.2 Example: radon regression 618

16 Deep neural networks 623

16.1 Introduction 623

16.2 Building blocks of diﬀerentiable circuits 623

16.2.1 Linear layers 624

16.2.2 Nonlinearities 624

16.2.3 Convolutional layers 625

16.2.4 Residual (skip) connections 626

16.2.5 Normalization layers 627

16.2.6 Dropout layers 627

16.2.7 Attention layers 628

16.2.8 Recurrent layers 630

16.2.9 Multiplicative layers 631

16.2.10 Implicit layers 632

16.3 Canonical examples of neural networks 632

16.3.1 Multilayer perceptrons (MLPs) 632

16.3.2 Convolutional neural networks (CNNs) 633

16.3.3 Autoencoders 634

16.3.4 Recurrent neural networks (RNNs) 636

16.3.5 Transformers 636

16.3.6 Graph neural networks (GNNs) 637

17 Bayesian neural networks 639

17.1 Introduction 639

17.2 Priors for BNNs 639

17.2.1 Gaussian priors 640

17.2.2 Sparsity-promoting priors 642

17.2.3 Learning the prior 642

17.2.4 Priors in function space 642

17.2.5 Architectural priors 643

17.3 Posteriors for BNNs 643

17.3.1 Monte Carlo dropout 643

17.3.2 Laplace approximation 644

17.3.3 Variational inference 645

17.3.4 Expectation propagation 646

17.3.5 Last layer methods 646

17.3.6 SNGP 647

17.3.7 MCMC methods 647

17.3.8 Methods based on the SGD trajectory 648

17.3.9 Deep ensembles 649

17.3.10 Approximating the posterior predictive distribution 653

17.3.11 Tempered and cold posteriors 656

17.4 Generalization in Bayesian deep learning 657

17.4.1 Sharp vs flat minima 657

17.4.2 Mode connectivity and the loss landscape 658

17.4.3 Eﬀective dimensionality of a model 658

17.4.4 The hypothesis space of DNNs 660

17.4.5 PAC-Bayes 660

17.4.6 Out-of-distribution generalization for BNNs 661

17.4.7 Model selection for BNNs 663

17.5 Online inference 663

17.5.1 Sequential Laplace for DNNs 664

17.5.2 Extended Kalman filtering for DNNs 665

17.5.3 Assumed density filtering for DNNs 667

17.5.4 Online variational inference for DNNs 668

17.6 Hierarchical Bayesian neural networks 669

17.6.1 Example: multimoons classification 670

18 Gaussian processes 673

18.1 Introduction 673

18.1.1 GPs: what and why? 673

18.2 Mercer kernels 675

18.2.1 Stationary kernels 676

18.2.2 Nonstationary kernels 681

18.2.3 Kernels for nonvectorial (structured) inputs

682

18.2.4 Making new kernels from old 682

18.2.5 Mercer’s theorem 683

18.2.6 Approximating kernels with random features

684

18.3 GPs with Gaussian likelihoods 685

18.3.1 Predictions using noise-free observations

685

18.3.2 Predictions using noisy observations 686

18.3.3 Weight space vs function space 687

18.3.4 Semiparametric GPs 688

18.3.5 Marginal likelihood 689

18.3.6 Computational and numerical issues 689

18.3.7 Kernel ridge regression 690

18.4 GPs with non-Gaussian likelihoods 693

18.4.1 Binary classification 694

18.4.2 Multiclass classification 695

18.4.3 GPs for Poisson regression (Cox process) 696

18.4.4 Other likelihoods 696

18.5 Scaling GP inference to large datasets 697

18.5.1 Subset of data 697

18.5.2 Nyström approximation 698

18.5.3 Inducing point methods 699

18.5.4 Sparse variational methods 702

18.5.5 Exploiting parallelization and structure via kernel matrix multiplies 706

18.5.6 Converting a GP to an SSM 708

18.6 Learning the kernel 709

18.6.1 Empirical Bayes for the kernel parameters 709

18.6.2 Bayesian inference for the kernel parameters 712

18.6.3 Multiple kernel learning for additive kernels 713

18.6.4 Automatic search for compositional kernels 714

18.6.5 Spectral mixture kernel learning 717

18.6.6 Deep kernel learning 718

18.7 GPs and DNNs 720

18.7.1 Kernels derived from infinitely wide DNNs (NN-GP) 721

18.7.2 Neural tangent kernel (NTK) 723

18.7.3 Deep GPs 723

18.8 Gaussian processes for time series forecasting 724

18.8.1 Example: Mauna Loa 724

19 Beyond the iid assumption 727

19.1 Introduction 727

19.2 Distribution shift 727

19.2.1 Motivating examples 727

19.2.2 A causal view of distribution shift 729

19.2.3 The four main types of distribution shift 730

19.2.4 Selection bias 732

19.3 Detecting distribution shifts 732

19.3.1 Detecting shifts using two-sample testing 733

19.3.2 Detecting single out-of-distribution (OOD) inputs 733

19.3.3 Selective prediction 736

19.3.4 Open set and open world recognition 737

19.4 Robustness to distribution shifts 737

19.4.1 Data augmentation 738

19.4.2 Distributionally robust optimization 738

19.5 Adapting to distribution shifts 738

19.5.1 Supervised adaptation using transfer learning 738

19.5.2 Weighted ERM for covariate shift 740

19.5.3 Unsupervised domain adaptation for covariate shift 741

19.5.4 Unsupervised techniques for label shift 742

19.5.5 Test-time adaptation 742

19.6 Learning from multiple distributions 743

19.6.1 Multitask learning 743

19.6.2 Domain generalization 744

19.6.3 Invariant risk minimization 746

19.6.4 Meta learning 747

19.7 Continual learning 750

19.7.1 Domain drift 750

19.7.2 Concept drift 751

19.7.3 Task incremental learning 752

19.7.4 Catastrophic forgetting 753

19.7.5 Online learning 755

19.8 Adversarial examples 756

19.8.1 Whitebox (gradient-based) attacks 758

19.8.2 Blackbox (gradient-free) attacks 759

19.8.3 Real world adversarial attacks 760

19.8.4 Defenses based on robust optimization 760

19.8.5 Why models have adversarial examples 761

IV Generation 763

20 Generative models: an overview 765

20.1 Introduction 765

20.2 Types of generative model 765

20.3 Goals of generative modeling 767

20.3.1 Generating data 767

20.3.2 Density estimation 769

20.3.3 Imputation 770

20.3.4 Structure discovery 771

20.3.5 Latent space interpolation 771

20.3.6 Latent space arithmetic 773

20.3.7 Generative design 774

20.3.8 Model-based reinforcement learning 774

20.3.9 Representation learning 774

20.3.10 Data compression 774

20.4 Evaluating generative models 774

20.4.1 Likelihood-based evaluation 775

20.4.2 Distances and divergences in feature space 776

20.4.3 Precision and recall metrics 777

20.4.4 Statistical tests 778

20.4.5 Challenges with using pretrained classifiers 779

20.4.6 Using model samples to train classifiers 779

20.4.7 Assessing overfitting 779

20.4.8 Human evaluation 780

21 Variational autoencoders 781

21.1 Introduction 781

21.2 VAE basics 781

21.2.1 Modeling assumptions 782

21.2.2 Model fitting 783

21.2.3 Comparison of VAEs and autoencoders 783

21.2.4 VAEs optimize in an augmented space 784

21.3 VAE generalizations 786

21.3.1 ,8-VAE 787

21.3.2 InfoVAE 789

21.3.3 Multimodal VAEs 790

21.3.4 Semisupervised VAEs 793

21.3.5 VAEs with sequential encoders/decoders 794

21.4 Avoiding posterior collapse 796

21.4.1 KL annealing 797

21.4.2 Lower bounding the rate 798

21.4.3 Free bits 798

21.4.4 Adding skip connections 798

21.4.5 Improved variational inference 798

21.4.6 Alternative objectives 799

21.5 VAEs with hierarchical structure 799

21.5.1 Bottom-up vs top-down inference 800

21.5.2 Example: very deep VAE 801

21.5.3 Connection with autoregressive models 802

21.5.4 Variational pruning 804

21.5.5 Other optimization diﬃculties 804

21.6 Vector quantization VAE 805

21.6.1 Autoencoder with binary code 805

21.6.2 VQ-VAE model 805

21.6.3 Learning the prior 807

21.6.4 Hierarchical extension (VQ-VAE-2) 807

21.6.5 Discrete VAE 808

21.6.6 VQ-GAN 809

22 Autoregressive models 811

22.1 Introduction 811

22.2 Neural autoregressive density estimators (NADE) 812

22.3 Causal CNNs 812

22.3.1 1d causal CNN (convolutional Markov models) 813

22.3.2 2d causal CNN (PixelCNN) 813

22.4 Transformers 814

22.4.1 Text generation (GPT, etc.) 815

22.4.2 Image generation (DALL-E, etc.) 816

22.4.3 Other applications 817

23 Normalizing flows 819

23.1 Introduction 819

23.1.1 Preliminaries 819

23.1.2 How to train a flow model 821

23.2 Constructing flows 822

23.2.1 Aﬃne flows 822

23.2.2 Elementwise flows 822

23.2.3 Coupling flows 825

23.2.4 Autoregressive flows 826

23.2.5 Residual flows 832

23.2.6 Continuous-time flows 834

23.3 Applications 836

23.3.1 Density estimation 836

23.3.2 Generative modeling 836

23.3.3 Inference 837

24 Energy-based models 839

24.1 Introduction 839

24.1.1 Example: products of experts (PoE) 840

24.1.2 Computational diﬃculties 840

24.2 Maximum likelihood training 841

24.2.1 Gradient-based MCMC methods 842

24.2.2 Contrastive divergence 842

24.3 Score matching (SM) 846

24.3.1 Basic score matching 846

24.3.2 Denoising score matching (DSM) 847

24.3.3 Sliced score matching (SSM) 848

24.3.4 Connection to contrastive divergence 849

24.3.5 Score-based generative models 850

24.4 Noise contrastive estimation 850

24.4.1 Connection to score matching 852

24.5 Other methods 852

24.5.1 Minimizing Diﬀerences/Derivatives of KL Divergences 853

24.5.2 Minimizing the Stein discrepancy 853

24.5.3 Adversarial training 854

25 Diﬀusion models 857

25.1 Introduction 857

25.2 Denoising diﬀusion probabilistic models (DDPMs) 857

25.2.1 Encoder (forwards diﬀusion) 858

25.2.2 Decoder (reverse diﬀusion) 859

25.2.3 Model fitting 860

25.2.4 Learning the noise schedule 861

25.2.5 Example: image generation 863

25.3 Score-based generative models (SGMs) 864

25.3.1 Example 864

25.3.2 Adding noise at multiple scales 864

25.3.3 Equivalence to DDPM 866

25.4 Continuous time models using diﬀerential equations 867

25.4.1 Forwards diﬀusion SDE 867

25.4.2 Forwards diﬀusion ODE 868

25.4.3 Reverse diﬀusion SDE 869

25.4.4 Reverse diﬀusion ODE 870

25.4.5 Comparison of the SDE and ODE approach 871

25.4.6 Example 871

25.5 Speeding up diﬀusion models 871

25.5.1 DDIM sampler 872

25.5.2 Non-Gaussian decoder networks 872

25.5.3 Distillation 873

25.5.4 Latent space diﬀusion 874

25.6 Conditional generation 875

25.6.1 Conditional diﬀusion model 875

25.6.2 Classifier guidance 875

25.6.3 Classifier-free guidance 876

25.6.4 Generating high resolution images 876

25.7 Diﬀusion for discrete state spaces 877

25.7.1 Discrete Denoising Diﬀusion Probabilistic Models 877

25.7.2 Choice of Markov transition matrices for the forward processes 878

25.7.3 Parameterization of the reverse process 879

25.7.4 Noise schedules 880

25.7.5 Connections to other probabilistic models for discrete sequences 880

26 Generative adversarial networks 883

26.1 Introduction 883

26.2 Learning by comparison 884

26.2.1 Guiding principles 885

26.2.2 Density ratio estimation using binary classifiers 886

26.2.3 Bounds on f -divergences 888

26.2.4 Integral probability metrics 890

26.2.5 Moment matching 892

26.2.6 On density ratios and diﬀerences 892

26.3 Generative adversarial networks 894

26.3.1 From learning principles to loss functions 894

26.3.2 Gradient descent 895

26.3.3 Challenges with GAN training 897

26.3.4 Improving GAN optimization 898

26.3.5 Convergence of GAN training 898

26.4 Conditional GANs 902

26.5 Inference with GANs 903

26.6 Neural architectures in GANs 904

26.6.1 The importance of discriminator architectures 904

26.6.2 Architectural inductive biases 905

26.6.3 Attention in GANs 905

26.6.4 Progressive generation 906

26.6.5 Regularization 907

26.6.6 Scaling up GAN models 908

26.7 Applications 908

26.7.1 GANs for image generation 908

26.7.2 Video generation 911

26.7.3 Audio generation 912

26.7.4 Text generation 912

26.7.5 Imitation learning 913

26.7.6 Domain adaptation 914

26.7.7 Design, art and creativity 914

V Discovery 915

27 Discovery methods: an overview 917

27.1 Introduction 917

27.2 Overview of Part V 918

28 Latent factor models 919

28.1 Introduction 919

28.2 Mixture models 919

28.2.1 Gaussian mixture models (GMMs) 920

28.2.2 Bernoulli mixture models 922

28.2.3 Gaussian scale mixtures (GSMs) 922

28.2.4 Using GMMs as a prior for inverse imaging problems 924

28.2.5 Using mixture models for classification problems 927

28.3 Factor analysis 929

28.3.1 Factor analysis: the basics 929

28.3.2 Probabilistic PCA 934

28.3.3 Mixture of factor analyzers 936

28.3.4 Factor analysis models for paired data 943

28.3.5 Factor analysis with exponential family likelihoods 945

28.3.6 Factor analysis with DNN likelihoods (VAEs) 948

28.3.7 Factor analysis with GP likelihoods (GP-LVM) 948

28.4 LFMs with non-Gaussian priors 949

28.4.1 Non-negative matrix factorization (NMF) 949

28.4.2 Multinomial PCA 950

28.5 Topic models 953

28.5.1 Latent Dirichlet allocation (LDA) 953

28.5.2 Correlated topic model 957

28.5.3 Dynamic topic model 957

28.5.4 LDA-HMM 958

28.6 Independent components analysis (ICA) 962

28.6.1 Noiseless ICA model 962

28.6.2 The need for non-Gaussian priors 963

28.6.3 Maximum likelihood estimation 964

28.6.4 Alternatives to MLE 965

28.6.5 Sparse coding 966

28.6.6 Nonlinear ICA 967

29 State-space models 969

29.1 Introduction 969

29.2 Hidden Markov models (HMMs) 970

29.2.1 Conditional independence properties 970

29.2.2 State transition model 970

29.2.3 Discrete likelihoods 971

29.2.4 Gaussian likelihoods 972

29.2.5 Autoregressive likelihoods 972

29.2.6 Neural network likelihoods 973

29.3 HMMs: applications 974

29.3.1 Time series segmentation 974

29.3.2 Protein sequence alignment 976

29.3.3 Spelling correction 978

29.4 HMMs: parameter learning 980

29.4.1 The Baum-Welch (EM) algorithm 980

29.4.2 Parameter estimation using SGD 983

29.4.3 Parameter estimation using spectral methods 984

29.4.4 Bayesian HMMs 985

29.5 HMMs: generalizations 987

29.5.1 Hidden semi-Markov model (HSMM) 987

29.5.2 Hierarchical HMMs 989

29.5.3 Factorial HMMs 991

29.5.4 Coupled HMMs 992

29.5.5 Dynamic Bayes nets (DBN) 992

29.5.6 Changepoint detection 993

29.6 Linear dynamical systems (LDSs) 996

29.6.1 Conditional independence properties 996

29.6.2 Parameterization 996

29.7 LDS: applications 997

29.7.1 Object tracking and state estimation 997

29.7.2 Online Bayesian linear regression (recursive least squares) 998

29.7.3 Adaptive filtering 1000

29.7.4 Time series forecasting 1000

29.8 LDS: parameter learning 1001

29.8.1 EM for LDS 1001

29.8.2 Subspace identification methods 1003

29.8.3 Ensuring stability of the dynamical system 1003

29.8.4 Bayesian LDS 1004

29.9 Switching linear dynamical systems (SLDSs) 1005

29.9.1 Parameterization 1005

29.9.2 Posterior inference 1006

29.9.3 Application: Multitarget tracking 1006

29.10 Nonlinear SSMs 1009

29.10.1 Example: object tracking and state estimation 1010

29.10.2 Posterior inference 1010

29.11 Non-Gaussian SSMs 1010

29.11.1 Example: spike train modeling 1011

29.11.2 Example: stochastic volatility models 1012

29.11.3 Posterior inference 1012

29.12 Structural time series models 1012

29.12.1 Introduction 1013

29.12.2 Structural building blocks 1013

29.12.3 Model fitting 1016

29.12.4 Forecasting 1016

29.12.5 Examples 1016

29.12.6 Causal impact of a time series intervention 1020

29.12.7 Prophet 1024

29.12.8 Neural forecasting methods 1024

29.13 Deep SSMs 1025

29.13.1 Deep Markov models 1026

29.13.2 Recurrent SSM 1027

29.13.3 Improving multistep predictions 1027

29.13.4 Variational RNNs 1028

30 Graph learning 1031

30.1 Introduction 1031

30.2 Latent variable models for graphs 1031

30.3 Graphical model structure learning 1031

31 Nonparametric Bayesian models 1035

31.1 Introduction 1035

32 Representation learning 1037

32.1 Introduction 1037

32.2 Evaluating and comparing learned representations 1037

32.2.1 Downstream performance 1038

32.2.2 Representational similarity 1040

32.3 Approaches for learning representations 1044

32.3.1 Supervised representation learning and transfer 1045

32.3.2 Generative representation learning 1047

32.3.3 Self-supervised representation learning 1049

32.3.4 Multiview representation learning 1052

32.4 Theory of representation learning 1057

32.4.1 Identifiability 1057

32.4.2 Information maximization 1058

33 Interpretability 1061

33.1 Introduction 1061

33.1.1 The role of interpretability: unknowns and under-specifications 1062

33.1.2 Terminology and framework 1063

33.2 Methods for interpretable machine learning 1066

33.2.1 Inherently interpretable models: the model is its explanation 1067

33.2.2 Semi-inherently interpretable models: example-based methods 1069

33.2.3 Post-hoc or joint training: the explanation gives a partial view of the model 1069

33.2.4 Transparency and visualization 1073

33.3 Properties: the abstraction between context and method 1074

33.3.1 Properties of explanations from interpretable machine learning 1074

33.3.2 Properties of explanations from cognitive science 1076

33.4 Evaluation of interpretable machine learning models 1077

33.4.1 Computational evaluation: does the method have desired properties? 1078

33.4.2 User study-based evaluation: does the method help a user perform a target task? 1082

33.5 Discussion: how to think about interpretable machine learning 1086

VI Action 1091

34 Decision making under uncertainty 1093

34.1 Statistical decision theory 1093

34.1.1 Basics 1093

34.1.2 Frequentist decision theory 1093

34.1.3 Bayesian decision theory 1094

34.1.4 Frequentist optimality of the Bayesian approach 1095

34.1.5 Examples of one-shot decision making problems 1095

34.2 Decision (influence) diagrams 1099

34.2.1 Example: oil wildcatter 1100

34.2.2 Information arcs 1101

34.2.3 Value of information 1101

34.2.4 Computing the optimal policy 1102

34.3 A/B testing 1103

34.3.1 A Bayesian approach 1103

34.3.2 Example 1106

34.4 Contextual bandits 1107

34.4.1 Types of bandit 1108

34.4.2 Applications 1109

34.4.3 Exploration-exploitation tradeoﬀ 1109

34.4.4 The optimal solution 1110

34.4.5 Upper confidence bounds (UCBs) 1111

34.4.6 Thompson sampling 1113

34.4.7 Regret 1114

34.5 Markov decision problems 1116

34.5.1 Basics 1116

34.5.2 Partially observed MDPs 1117

34.5.3 Episodes and returns 1117

34.5.4 Value functions 1119

34.5.5 Optimal value functions and policies 1119

34.6 Planning in an MDP 1120

34.6.1 Value iteration 1121

34.6.2 Policy iteration 1122

34.6.3 Linear programming 1123

34.7 Active learning 1124

34.7.1 Active learning scenarios 1124

34.7.2 Relationship to other forms of sequential decision making 1125

34.7.3 Acquisition strategies 1126

34.7.4 Batch active learning 1128

35 Reinforcement learning 1133

35.1 Introduction 1133

35.1.1 Overview of methods 1133

35.1.2 Value-based methods 1135

35.1.3 Policy search methods 1135

35.1.4 Model-based RL 1135

35.1.5 Exploration-exploitation tradeoﬀ 1136

35.2 Value-based RL 1138

35.2.1 Monte Carlo RL 1138

35.2.2 Temporal diﬀerence (TD) learning 1138

35.2.3 TD learning with eligibility traces 1139

35.2.4 SARSA: on-policy TD control 1140

35.2.5 Q-learning: oﬀ-policy TD control 1141

35.2.6 Deep Q-network (DQN) 1142

35.3 Policy-based RL 1144

35.3.1 The policy gradient theorem 1145

35.3.2 REINFORCE 1146

35.3.3 Actor-critic methods 1146

35.3.4 Bound optimization methods 1148

35.3.5 Deterministic policy gradient methods 1150

35.3.6 Gradient-free methods 1151

35.4 Model-based RL 1151

35.4.1 Model predictive control (MPC) 1151

35.4.2 Combining model-based and model-free 1153

35.4.3 MBRL using Gaussian processes 1154

35.4.4 MBRL using DNNs 1155

35.4.5 MBRL using latent-variable models 1156

35.4.6 Robustness to model errors 1158

35.5 Oﬀ-policy learning 1158

35.5.1 Basic techniques 1159

35.5.2 The curse of horizon 1162

35.5.3 The deadly triad 1163

35.6 Control as inference 1165

35.6.1 Maximum entropy reinforcement learning 1165

35.6.2 Other approaches 1167

35.6.3 Imitation learning 1168

36 Causality 1171

36.1 Introduction 1171

36.2 Causal formalism 1173

36.2.1 Structural causal models 1173

36.2.2 Causal DAGs 1174

36.2.3 Identification 1176

36.2.4 Counterfactuals and the causal hierarchy 1178

36.3 Randomized control trials 1180

36.4 Confounder adjustment 1181

36.4.1 Causal estimand, statistical estimand, and identification 1181

36.4.2 ATE estimation with observed confounders 1184

36.4.3 Uncertainty quantification 1189

36.4.4 Matching 1189

36.4.5 Practical considerations and procedures 1190

36.4.6 Summary and practical advice 1193

36.5 Instrumental variable strategies 1195

36.5.1 Additive unobserved confounding 1196

36.5.2 Instrument monotonicity and local average treatment eﬀect 1198

36.5.3 Two stage least squares 1201

36.6 Diﬀerence in diﬀerences 1202

36.6.1 Estimation 1205

36.7 Credibility checks 1206

36.7.1 Placebo checks 1206

36.7.2 Sensitivity analysis to unobserved confounding 1207

36.8 The do-calculus 1215

36.8.1 The three rules 1215

36.8.2 Revisiting backdoor adjustment 1216

36.8.3 Frontdoor adjustment 1217

36.9 Further reading 1218

Index 1221

Bibliography 123

CHAPTER 1: Introduction

The field of probabilistic machine learning (PML) integrates Bayesian reasoning, graphical models, and stochastic computation to build models that can reason under uncertainty. This text moves beyond foundational material (covered in Murphy's first volume) to tackle complex inference, deep generative models, structured predictions, and decision-making under uncertainty.

Key themes:

Emphasis on uncertainty quantification
A probabilistic perspective on modern deep learning
Focus on modularity and scalability in PML systems
Bridging symbolic reasoning with neural approaches
Heavy use of variational inference, MCMC, and Bayesian optimization

This book assumes familiarity with core PML and builds a bridge to cutting-edge research in theory and practice.

CHAPTER 2: Modular Probabilistic Programming

Probabilistic programming languages (PPLs) abstract away from low-level inference details and enable compositional model design.

Topics include:

Factor graph compilation and automatic differentiation
Compositionality in models: nested submodules and hierarchical priors
Evaluation of PPLs like Pyro, NumPyro, TensorFlow Probability, and Stan

Example:


def model(x):
    z = pyro.sample("z", dist.Normal(0, 1))
    y = pyro.sample("y", dist.Normal(z * x, 1))
    return y

Benefits of modular PPLs:

Rapid prototyping
Separation of model and inference
Reusability across domains

CHAPTER 3: Exponential Families and Conjugate Priors

This chapter revisits the exponential family class, where the posterior remains analytically tractable under certain conditions.

Key content:

Canonical form:
$p(x|\eta) = h(x)\exp\left( \eta^T T(x) - A(\eta) \right)$
Properties:
- Log-concavity
- Moment matching
- Conjugacy (e.g., Beta-Binomial, Gaussian-Normal)

Applications:

Simplifying variational updates
Designing prior-likelihood pairs that yield closed-form posteriors
Enabling natural gradient methods

CHAPTER 4: Structured Representations

Probabilistic models need structured latent spaces to handle:

Sequences (HMMs, RNNs)
Trees (dependency parsing)
Graphs (Bayesian networks, GNNs)

Topics:

Factorization of joint distributions using graphical models
Message passing (belief propagation) in trees vs loopy graphs
Use of plate notation and tensors to efficiently scale inference

Example:

CRFs for sequence tagging
Structured VAEs for graphs and molecules

The theme: Combine statistical rigor with domain-specific structure.

CHAPTER 5: Structured Inference

Inference in structured models often requires:

Exact methods: forward-backward, Viterbi
Approximate methods: variational message passing (VMP), expectation propagation (EP)

Methods are compared in terms of:

Convergence
Computational cost
Tightness of approximation

Advanced examples:

Amortized inference for latent trees
Structured mean-field for image segmentation

CHAPTER 6: Neural Approaches to Structured Prediction

Here, neural nets augment or replace classic structured models.

Topics:

Structured decoders (e.g., LSTM+CRF for NER)
Attention-based architectures in probabilistic sequence modeling
Neural scoring functions for structured outputs

Key idea:

Use flexible function approximators to encode prior structure, while retaining probabilistic uncertainty via Monte Carlo or variational methods.

CHAPTER 7: Amortized Inference

Amortized inference replaces per-datapoint optimization with a learned inference network:

q_\phi(z|x) \approx p(z|x)

This allows efficient, scalable inference in:

VAEs
Deep Bayesian nets
Probabilistic meta-learning

Techniques:

Inference amortization via neural networks
Stochastic gradient estimation: reparameterization trick, score function estimator
ELBO optimization under amortized settings

Trade-offs:

Fast but approximate
Requires careful regularization (e.g., KL annealing)

CHAPTER 8: Normalizing Flows

Normalizing flows transform a simple distribution (e.g., Gaussian) into a complex one via a sequence of invertible, differentiable mappings:

z_0 \sim p_0(z), \quad z_k = f_k \circ \cdots \circ f_1(z_0), \quad x = z_K

Using the change-of-variables formula, the exact density is:

\log p(x) = \log p(z_0) - \sum_{k=1}^K \log |\det J_{f_k}(z_{k-1})|

Flow Architectures:

Planar/radial flows (simple but limited)
RealNVP / coupling layers (scalable)
Masked Autoregressive Flows (MAF) for tractable conditionals
Spline flows, allowing piecewise-polynomial flexibility

Use cases:

Exact likelihood models (vs VAEs)
Posterior approximation
Bayesian sampling accelerators

Flaws: computationally expensive Jacobians, invertibility constraints limit expressiveness.

CHAPTER 9: Score-Based Generative Models

Also called diffusion models, these learn to reverse a stochastic process that gradually adds noise:

x_t = \sqrt{\alpha_t} x_0 + \sqrt{1 - \alpha_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Key Idea:

Train a model to predict the score function (gradient of the log-density):

s_\theta(x_t, t) \approx \nabla_{x_t} \log p(x_t)

Then, sample via reverse-time SDE or ODE using learned score estimates.

Benefits:

High-fidelity samples (e.g., Imagen, StableDiffusion)
Likelihood estimation via score-based ODE
Stable training vs GANs

Challenges: expensive sampling (1000+ steps), engineering score parametrization.

CHAPTER 10: Variational Inference

Variational Inference (VI) approximates intractable posteriors $p(z|x)$ with a tractable family $q_\phi(z|x)$ by minimizing:

\text{KL}(q_\phi(z|x) \parallel p(z|x)) = \mathbb{E}_{q}[\log q - \log p]

Optimization:

Maximize ELBO:

\mathcal{L} = \mathbb{E}_q[\log p(x|z)] - \text{KL}(q(z|x) \parallel p(z))

Use reparameterization trick:

z = \mu(x) + \sigma(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0,1)

Advanced VI Techniques:

Amortized VI (Chapter 7)
Importance-weighted VI (IWAE)
Hierarchical VI (deep latent variables)
Variational Boosting

Limitations: mode undercoverage, KL asymmetry, sensitivity to variational family choice.

CHAPTER 11: Black Box Variational Inference

BBVI makes VI applicable to non-conjugate, arbitrary models by using stochastic gradient estimators.

Two main gradient types:

Score function (REINFORCE):
$\nabla_\phi \mathcal{L} = \mathbb{E}_q[\nabla_\phi \log q(z) (\log p(x, z) - \log q(z))]$
- High variance → needs baselines
Pathwise (reparam):
- Works for differentiable models with reparametrizable latent variables

Tools:

Control variates to reduce gradient variance
Monte Carlo ELBO estimation
Auto-diff in probabilistic programming frameworks

Used in frameworks like Edward, Pyro, Turing.jl

CHAPTER 12: Markov Chain Monte Carlo

MCMC generates samples from $p(z|x)$ via Markov chains that converge to the posterior.

Core algorithms:

Metropolis-Hastings
Gibbs Sampling
Hamiltonian Monte Carlo (HMC)
- Uses Hamiltonian dynamics:
  $\frac{dz}{dt} = \nabla_p H, \quad \frac{dp}{dt} = -\nabla_z H$
- Needs gradient of log-probability

MCMC is asymptotically exact but:

Slow for high dimensions
Sensitive to tuning (step sizes, proposals)
Expensive burn-in and autocorrelation

CHAPTER 13: MCMC Diagnostics

To ensure correctness, MCMC chains need diagnostics:

Convergence checks:
- Gelman-Rubin $\hat{R}$ statistic (compare between-chain to within-chain variance)
- Trace plots for visual mixing
- Effective Sample Size (ESS)
Autocorrelation:
- High autocorrelation reduces information content per sample
- Lag-1 autocorrelation diagnostic
Posterior predictive checks:
- Simulate data from sampled posterior
- Check fit to observed data
Comparison to VI:
- MCMC serves as gold standard for assessing variational approximations

Modern frameworks (e.g., Stan, NumPyro) include these diagnostics automatically.

CHAPTER 14: Monte Carlo Estimation

Monte Carlo methods approximate expectations:

\mathbb{E}_{p(x)}[f(x)] \approx \frac{1}{N} \sum_{i=1}^N f(x_i), \quad x_i \sim p(x)

Core Techniques:

Plain Monte Carlo: Unbiased, slow convergence $O(1/\sqrt{N})$
Importance Sampling:
$\mathbb{E}_{p(x)}[f(x)] = \mathbb{E}_{q(x)}\left[\frac{p(x)}{q(x)}f(x)\right]$
where $q(x)$ is proposal distribution. Works best when $q \approx p$ .
Resampling: For non-normalized weights → sequential Monte Carlo, particle filtering

Applications:

Estimating gradients in variational inference
Bayesian posterior integration
Uncertainty quantification in deep learning

Limitations: variance grows with dimension; bias if proposal is poorly matched.

CHAPTER 15: Bayesian Deep Learning

Bayesian Deep Learning (BDL) places probability distributions over weights $W$ or activations.

Approaches:

Bayesian Neural Nets (BNNs):
- Priors over weights: $W \sim \mathcal{N}(0, I)$
- Posterior: $p(W|D)$ , approximated via VI or MCMC
Dropout as Approximate Bayesian Inference (Gal & Ghahramani, 2016)
Deep Ensembles: Train multiple networks → use as proxy for posterior samples

Benefits:

Predictive uncertainty
Calibration
Robustness to OOD data

Challenges:

Expensive posterior estimation
Scaling to deep nets
Prior specification is hard

CHAPTER 16: Uncertainty Quantification

In probabilistic models, uncertainty comes in two flavors:

Epistemic (model uncertainty)
Aleatoric (data noise)

Techniques:

Posterior predictive variance:
$\text{Var}[y^* | x^*, D] = \mathbb{E}_{W}[\text{Var}[y^* | x^*, W]] + \text{Var}_{W}[\mathbb{E}[y^* | x^*, W]]$
MC Dropout / Ensembles: sample predictions
Quantile regression and prediction intervals

Applications:

Medical diagnosis (confidence matters)
Robotics (safe exploration)
Bayesian optimization (balancing exploitation vs exploration)

Bad practice: mistaking pointwise standard deviation for credible uncertainty.

CHAPTER 17: Predictive Inference

Given training data $D$ , compute predictive distribution:

p(y^*|x^*, D) = \int p(y^*|x^*, \theta) p(\theta|D) d\theta

If $p(\theta|D)$ is intractable:

Use variational posterior
Or MCMC samples of $\theta$

Predictive metrics:

Log-likelihood: $\log p(y^*|x^*)$
RMSE / classification accuracy
Brier score, expected calibration error

Predictive inference bridges modeling and decision-making. It’s where uncertainty becomes actionable.

CHAPTER 18: Posterior Predictive Checks

Posterior predictive checks (PPCs) assess model fit by simulating new data:

x^{\text{sim}} \sim p(x | \theta), \quad \theta \sim p(\theta | D)

Compare to observed $x^{\text{obs}}$ via:

Test statistics: discrepancy between real and simulated (e.g., mean, max)
Visual checks: histograms, calibration plots

Goal: detect model misfit, hidden misspecifications, unmodeled structure.

Example: if residuals have non-zero mean → likely model misspecification.

CHAPTER 19: Marginal Likelihood Estimation

The marginal likelihood:

p(D) = \int p(D|\theta)p(\theta)d\theta

is essential for model selection, Bayes factors, and hyperparameter learning.

Estimation techniques:

Annealed Importance Sampling (AIS)
Bridge sampling
Bayesian Quadrature
ELBO as lower bound in VI

Challenges:

High variance in estimates
Not robust in high-dim settings
Sensitive to prior misspecification

Despite its central role, $p(D)$ is often estimated poorly — yet many Bayesian model comparisons still rely on it.

CHAPTER 20: Score-Based Generative Models (SBGMs)

SBGMs model data via learned score functions $\nabla_x \log p(x)$ , enabling sample generation by reversing a diffusion process.

Core Concepts:

Forward SDE (adds noise):
$dx = f(x, t) dt + g(t) dW_t$
Reverse-time SDE (uses learned score):
$dx = [f(x,t) - g^2(t)\nabla_x \log p_t(x)] dt + g(t) d\bar{W}_t$
Trained via denoising score matching:
$\mathbb{E}_{x_0, t, \epsilon} \left[\|\nabla_x \log p_t(x) - s_\theta(x, t)\|^2\right]$

Benefits:

High-quality generation
Likelihood estimation (via probability flow ODEs)
Better mode coverage than GANs

Limitations: slow sampling (1000+ steps), sensitivity to noise schedule.

CHAPTER 21: Energy-Based Models (EBMs)

EBMs define an unnormalized density via an energy function:

p(x) = \frac{e^{-E_\theta(x)}}{Z_\theta}

where $Z_\theta$ is intractable.

Learning:

Contrastive divergence
Score matching
Noise-contrastive estimation
Stein contrastive divergence

Applications:

Out-of-distribution detection
Multi-modal generation
Disentangled representation learning

Limitation: requires MCMC at training and sampling time. Often slow or unstable unless hybridized with amortized inference or auxiliary models.

CHAPTER 22: Autoregressive Models

Factorize joint distribution as product of conditionals:

p(x) = \prod_{i=1}^D p(x_i | x_{<i})

Architectures:

MADE, PixelCNN, WaveNet — for images, audio
Transformer-based (GPT): token-level sequence modeling

Properties:

Exact likelihood
Fast training
Slow sampling (no parallel decoding)

Used in:

NLP (text generation)
Audio synthesis
Language modeling

Extensions:

Parallel sampling via knowledge distillation
Self-attention + masking for flexible conditioning

CHAPTER 23: Latent Variable Models

Augment observations $x$ with latent variables $z$ :

p(x) = \int p(x|z) p(z) dz

Key examples:

Variational Autoencoders (VAEs): combine amortized VI with neural generative networks
Factor analysis and ICA
Deep latent Gaussian models

Challenges:

Posterior collapse in VAEs
Latent space regularization
Disentanglement vs entanglement

Latent models excel at compressive representations, missing data imputation, and controlled generation.

CHAPTER 24: Disentangled Representations

Disentanglement separates latent factors of variation:

z = [z_1, z_2, ..., z_K], \quad z_i \perp z_j

Approaches:

β-VAE: increases KL weight
FactorVAE, TC-VAE: penalize total correlation
Supervised disentanglement via labeled factors

Benefits:

Interpretability
Few-shot generalization
Structured control

But: perfect disentanglement often requires inductive biases or supervision. Purely unsupervised disentanglement is ill-posed under general assumptions.

CHAPTER 25: Structured Generation

Incorporate domain structure in generative models:

Grammar-based VAEs
Graph VAEs / GNN decoders
Autoregressive + latent hybrids (e.g., VQ-VAE)

Use cases:

Molecule generation
Scene graphs
Syntactic program generation

Architecture: combine probabilistic priors with structured decoders (e.g., tree traversal, constraint satisfaction).

CHAPTER 26: Causal Generative Models

Extend generative modeling to encode causal structure:

x_i = f_i(pa_i, \epsilon_i)

where $pa_i$ are parents of node $i$ in a DAG.

Methods:

Causal VAEs
SCMs with generative decoders
Flow-based causal models

Applications:

Counterfactual generation
Causal discovery from data
Robust generalization

Constraints: Identifiability is hard without interventions or strong assumptions. Structure learning adds another layer of computational cost.

CHAPTER 27: Bayesian Optimization

Bayesian Optimization (BO) is a sample-efficient method for optimizing expensive black-box functions.

Core Algorithm:

Surrogate model: $f(x) \sim \mathcal{GP}(\mu(x), k(x, x'))$
Acquisition function: $a(x)$ guides selection of the next point
- Expected Improvement (EI)
- Upper Confidence Bound (UCB)
- Probability of Improvement (PI)
Next point: $x_{\text{next}} = \arg\max a(x)$

Used in:

Hyperparameter tuning
Scientific simulation control
Design of experiments

Advanced topics:

Batch BO
Multi-objective BO
BO with deep kernel surrogates (e.g., Deep Ensembles or BNNs)

CHAPTER 28: Active Learning

Active learning aims to minimize label acquisition by querying the most informative examples.

Strategies:

Uncertainty sampling (entropy, margin, least confidence)
Expected model change
Core-set construction
Bayesian acquisition (e.g., BALD for BNNs)

Application domains:

NLP, medical imaging, rare-event detection

Challenges:

Model bias → misleading queries
Need for fast retraining
Uncertainty quantification crucial for query selection

CHAPTER 29: Bayesian Deep Learning (Revisited)

This chapter connects deep neural networks with Bayesian principles.

Highlights:

Posterior predictive estimation with BNNs
Variational approximations in deep nets:
- Mean-field
- Low-rank + local reparameterizations
Laplace approximation and deep ensembles as alternatives

Calibration matters:

Expected Calibration Error (ECE) used to quantify miscalibration
Ensemble averaging improves sharpness and reliability

CHAPTER 30: Meta-Learning

Meta-learning, or “learning to learn,” trains models to adapt quickly using few examples.

Formulations:

Gradient-based: MAML, Reptile
Bayesian meta-learning: Learn posterior distributions over model weights
Metric-based: Prototypical networks, Matching networks
Amortized meta-learners: Predict initialization or learning rules

Probabilistic approaches:

Uncertainty propagation across tasks
Task embeddings as latent variables
Hierarchical priors across task distributions

CHAPTER 31: Probabilistic Reinforcement Learning

Probabilistic RL integrates uncertainty into both:

Dynamics modeling (model-based)
Policy/value estimation (model-free)

Key ideas:

Use Bayesian posterior over dynamics models to plan robustly
Thompson sampling and posterior sampling for RL
Entropy-regularized RL → SAC, maximum entropy methods

Benefits:

Safer exploration
Reduced sample complexity
Explicit modeling of epistemic vs aleatoric risk

CHAPTER 32: Offline and Safe RL

Offline RL learns from static datasets $D = \{s, a, r, s'\}$ without environment interaction.

Challenges:

Distributional shift: policies deviate from data-generating behavior
Extrapolation error

Solutions:

Conservative Q-Learning (CQL)
Offline actor-critic with behavior regularization
Uncertainty-aware value estimation (e.g., ensembles)

Safe RL adds constraints:

Avoid catastrophic actions
Ensure performance guarantees under uncertainty

CHAPTER 33: Causal Inference

Probabilistic models can integrate with causal frameworks:

Basics:

Structural Causal Models (SCMs): $x_i = f_i(pa_i, \epsilon_i)$
Do-calculus: $p(y | \text{do}(x))$
Counterfactual queries: $p(y_{x'} | x, y)$

Probabilistic causal models:

Bayesian networks with interventions
Causal discovery using score-based or constraint-based search
Causal effect estimation with Bayesian methods

Causal reasoning supports:

Robust decision-making
Transferability
Explanation

CHAPTER 34: Planning and Control

This chapter bridges probabilistic modeling with decision-making over time — foundational to robotics, automated reasoning, and sequential systems.

Key Concepts:

Stochastic control: modeling transitions with uncertainty
$s_{t+1} \sim p(s_{t+1} | s_t, a_t)$
Value-based methods: estimating $V(s)$ or $Q(s,a)$
Model-based planning:
- Learn a probabilistic dynamics model
- Use sampling (e.g., Monte Carlo Tree Search) or optimization (e.g., CEM) to plan actions

Bayesian aspects:

Belief MDPs: maintain posterior over latent states or models
Risk-sensitive planning: encode preferences for uncertainty, tail-risk, etc.

Frameworks:

POMDPs, PILCO, MPPI, Deep PILCO, DreamerV2

CHAPTER 35: Exploration and Information Gain

Classic RL focuses on reward maximization. This chapter centers exploration as an information-theoretic problem.

Exploration Strategies:

Count-based bonuses (tabular, pseudo-count)
Bayesian surprise: maximize KL divergence between posteriors
Information gain: maximize mutual information between belief and future observation
Intrinsic reward: use prediction error as proxy (e.g., curiosity)

Mathematical tools:

Bayesian regret bounds
Entropy regularization
Variational information maximization

Probabilistic exploration ensures sample efficiency, robustness, and epistemic safety — key for high-stakes domains (robotics, medicine).

CHAPTER 36: Causal Reinforcement Learning

Causal RL integrates causal inference into the RL framework, enabling:

Robust generalization
Policy transfer across environments
Better handling of spurious correlations

Key Insights:

Actions as interventions:
$\text{do}(a_t) \rightarrow s_{t+1}$
Learn or leverage causal graphs to deconfound decisions
Use counterfactual reasoning to simulate “what-if” scenarios:
$\mathbb{E}[R | \text{do}(a)] \ne \mathbb{E}[R | a]$

Applications:

Imitation learning with unobserved confounders
Off-policy evaluation using Inverse Propensity Scoring
Learning transportable policies across domains

Emerging techniques:

SCMs + dynamic programming
Causal world models
Counterfactual data augmentation for policy learning

🔍 Contextual Review (2025 Edition)

🧠 1. Relevance Amid the Deep Learning Plateau

By 2025, large-scale foundation models (e.g., GPT-5, Gemini Ultra, Claude 4.5) dominate in NLP, vision, and multi-modal AI. However, their opaque inference, data-hungry training, and limited calibration have revived deep interest in probabilistic reasoning, especially for:

Scientific AI (e.g., physics, biology, climate)
Autonomous systems (requiring uncertainty-aware planning)
Low-data and safe AI

Murphy’s volume positions itself as the bridge between statistical rigor and neural flexibility, emphasizing calibrated, compositional, and Bayesian thinking — now critical as the post-hype era demands trustworthy and generalizable AI.

🔬 2. Enduring Strengths

Modularity & Structure: Chapters on PPLs, structured VAEs, and graphical models remain unmatched in clarity. Their importance has only grown as AI pipelines shift from monoliths to composable agents and modules.
Inference under Uncertainty: Variational methods, MCMC, score-based modeling — these are now indispensable as we move beyond static models into continual learning, adaptation, and simulation-aware AI.
Causal Modeling: Chapters 33–36 provide perhaps the most forward-looking insights. In an era where AI systems must act, explain, and adapt under distribution shift, causal RL and counterfactuals are the frontier.

⚠️ 3. Where It Lags (2025-Specific Gaps)

Foundation Models: The book predates the GPT-4 era explosion. While probabilistic insights apply, it lacks direct integration of PLMs with Bayesian priors, log-prob calibration techniques, and LLM-based amortized inference.
AI Systems Engineering: No coverage of tool-augmented reasoning, retrieval-augmented generation (RAG), or LLM-based simulators — all now central to 2025 agentic AI.
Alignment & Interpretability: While it handles uncertainty, it does not tackle Bayesian interpretability, uncertainty-aware alignment, or risk-sensitive policy design for AGI-adjacent systems.

🧩 4. Integration Potential with 2025 AI

Where this book shines is in hybrid architectures — especially:

Probabilistic modules for LLM introspection
Causal inference for behavior grounding
Variational surrogates for high-cost simulation domains
Bayesian optimization for model selection and fine-tuning

These methods are not obsolete — they're becoming critical scaffolding for the next wave of trustworthy, interpretable, and adaptable AI.

🔧 Verdict (2025 AI Grade)

Criterion	Rating (1–10)	Notes
Theoretical Depth	10	Still canonical
Practical Integration (2025)	8	Needs update on LLMs, systems
Causal Modeling & RL	9	Strong, forward-compatible
Applicability to AI Safety	7	Needs bridging with alignment literature
Future-Proofing	8.5	Strong base, but needs neural-symbolic extensions

🧠 Summary:
Kevin Murphy’s Advanced Topics is not a relic — it’s a core substrate for intelligent systems that must reason, plan, and adapt under uncertainty.
Its utility in 2025 AI is not as the whole stack — but as the probabilistic engine inside agentic intelligence.

❌ What the Book Ignores (Semiotic/Symbolic Architecture in 2025)

1. No Treatment of Symbolic Abstraction

2025 AI has rediscovered the value of symbols and structured knowledge representations, particularly in:

Reasoning over graphs, ontologies, programs
Generalization across tasks/domains (via symbolic priors)
Controlling large models through symbolic planning layers

Murphy’s book never engages with symbolic grammars, program synthesis, or logic-based inference, which are now core components of systems like:

OpenCog Hyperon
DeepMind’s MEME/Genesis
Modular cognitive architectures (e.g., ACT-R, LIDA in hybrid form)

2. Semantic Models Are Not Probabilistic Alone

Murphy’s probabilistic framing is too narrow for capturing semantic grounding, which in 2025 AI includes:

Embodied language grounding (simulation-based semantics)
Percept-symbol linkages (e.g., image-language alignment with conceptual maps)
Semantic compression in memory (structured episodic abstraction)

The book never addresses how meaning emerges, nor how to encode or manipulate meaning structures — essential for explainability and compositionality.

3. No Semiotic Theory of Models

The semiotic lens — how signs (symbols, indices, icons) represent reality — is absent. This limits the book’s relevance to:

Human-aligned representations
Multi-agent communication
Conceptual generalization beyond statistical patterns

Systems like GATO 2, GPT-5X with tool-use, or Agent-Swarm architectures now operate across sign systems, not just data distributions.

🔧 Why It Matters in 2025

Architecture Layer	Relevance in 2025	Book Coverage
Probabilistic Modeling	Crucial (sub-layer)	✅ Deep & rigorous
Semantic Compositionality	Foundational	❌ Ignored
Symbolic Planning / Logic	Core in hybrid agents	❌ Ignored
Semiotic Representation	Key to grounding & generalization	❌ Absent

Murphy’s work remains vital for probabilistic scaffolding but is now insufficient for building:

Generalist agents
Explainable LLMs
World-model-based cognition

🧠 Summary

Murphy’s book is a pillar of “statistical intelligence,”
but it fails to engage with the symbolic and semiotic turn that defines post-2024 AI systems.