I’m honestly not a huge fan of the book — I think it’s pretty confusing. But I don’t know of a better book and it does seem quite comprehensive.

Exponential Families

An exponential family is a set of distributions that can be written in the form

is the sufficient statistics, is the natural parameter, and is the log partition function.

This captures a lot of natural distributions, like Gaussians.

5 info theory

KL(p||q) is the expected weight of evidence for p over q, given true distribution

weight of evidence:

A weird form of KL divergence that I don’t get yet:

data processing for KL: KL decreases if you process rvs

Fisher information matrix:

KL divergence between is approximately by a taylor expansion.

Bregman divergence:

how far off is the first order Taylor expansion of around from the true answer at ? i.e., .

Entropy is defined as .

7.5.3 EM

I found the books exposition extremely confusing. The below discussion will clarify what’s going on a lot!

Setup

We are going to realize some phenomenon several times. Let’s suppose the phenomenon is where is the temperature, and is Alek’s happiness level. Let’s realize this for days, generating pairs. Let’s suppose we just observe and not . We’re trying to fit a model to explain the data. Our loss function is

Suppose you wanted to optimize this function. It might be hard because the integral could be intractable.

So, we are going to have a fancy technique for computing this.

First, we need to define the ELBO:

It’s immediate by Jenson’s inequality that And it’s also clear that setting

We also define

ok, so why do we care?

It’s going to give us this “EM” algorithm. Which apparently is pretty good. Although I’m not sure yet what it’s theoretical guarantees are in terms of how fast it converges.

E step

  • Choose

M step

  • Choose .

If satisfies some conditions then these are both tractable problems — whereas the original problem might not have been tractable.

EXAMPLE

Suppose are sampled iid from . The MLE estimate of the should be the sample mean + covar.

Now suppose we have missing data. #todo you do something else.

VI

Variational inference is the following problem: You have some model and . You’d like to compute the posterior . This is intractable, so you approximate it with some . We’re going to try to find from this parameterized family to minimize the KL divergence between and .

It turns out that this is equivalent to maximizing the ELBO, which is defined as

VAEs

A comment on the word “variational”:

  • it sounds really fancy.
  • if it means anything at all, I guess it means “optimizing over a space of fns”.

Now we’re going to discuss VAEs. My understanding is that they have some really clever trick to improve sample efficiency.

I guess we can think of this as a generative model. At least, that’s one reason you might want to build a VAE.

Anyways, we’ll have

  • Prior

  • and

  • We’re going to learn to approximate

We can fit a VAE via VI.

Hmm, ok they didn’t really have anything to say.