I’m tentatively planning to grind through MIT’s “inference” courses.

Specifically, this is

  • algorithms for inference
  • inference and information

These cover some things like

  • graphical models
  • information theory
  • inference
  • information geometry
  • thinking about structure of random variables and independence and stuff
  • probably other stuff too

These topics seem

  1. pretty intellectually interesting
  2. the most relevant stuff to ARC’s research agenda (where I’ll be interning over the summer)

I reserve the right to quit if I find some cool research projects / make up some cool research questions myself. However I don’t anticipate quitting currently.

I’ll record my journey of doing this here. I’ll also tentatively plan to make this into a dialogue because I love JJ+shatar+blobby, but right now I just need to write some stuff down.

Day 0 — Dec 12: Likelihood ratio test: If you have two hypothesis, and want to select one to minimize the probability that you select the wrong one, then you should compute Pr(hypothesis X is true given the data) and choose whichever hypothesis maximizes this chance. This assumes that you have some priors on the chance of the hypothesis being correct, and that under each hypothesis you can compute Pr(data).

Day 1 — Dec 13 Read the first couple Alg for Inf notes. First some definitions, then a neat theorem.

Directed Graphical Model In a directed graphical model, we can factor the probability distribution into terms of the form . So any distribution obeys the conditional independence implied by the complete graph (which is not so impressive since that’s the empty set).

But my digraph implies more independences. For instance, based on my picture.

The general hope for this class is that if we have a sparse graph then we can do much much more efficient things than are possible in the dense case.

A general algorithm for listing off some conditional independences is:

  • topo-sort the DAG
  • is independent of non-parent things before it in the topo-order, conditional on ‘s parents

Bayes Ball Algorithm There’s this pretty handy trick for checking whether a conditional independence relation is implied by a directed graphpical model — it’s called the bouncy ball method.

The method answers the question is where can all be sets of variables.

  1. Shade the variables .
  2. Put a ball at all vertices in .
  3. Bounce the balls.
  4. If a ball ever hits , then the conditional independence relation is false. Otherwise it is true.

The bouncing is unfortunately a little bit counter-intuitive — it’s not just “when you hit a shaded circle, bounce backwards”. Instead there are three cases:

  1. Bounce if shaded on chain. else pass.
  2. Bounce if shaded up tree. else pass
  3. Pass if shaded in vee. else bounce. The only way I can think of to remember this is to keep these three examples in mind.

Undirected graphical model

Vertices for vars.

Cond. indep

if when you cut , can’t reach from .

This is great and intuitive.

Theorem (Ham-Clifford) (1) If is a pr distr where every state has nonzero pr, and is described by undirected graphical model , then can be expressed as

where is the set of (maximal iyw) cliques in .

(2) If is a pr dist that factors into terms that only look at cliques in , then satisfies the conditional independences specified by cuts ( if cutting separates ).

Proof (2) is a straightforward computation. (1) is quite tricky to prove — I’ll sketch pf below.

We just prove for binary random vars. Probably could generalize. If rvs are binary, say , we can identify with . We’ll conflate and where is the set of coordinates in with value . Now, define

It turns out this function has some nice properties. For instance,

Now we’re going to show that if is not a clique. This will conclude the proof by (*).

To understand why, let’s just consider a graph on vertices with connnections . In this graph we have . Say we’re trying to compute .

By the magic of conditional indepdence we have:

Hence

AND

So,

So You can do the same thing for the terms without the . Basically we’re just using spatial markov property. Anyways, I thought this was neat.

Factor Graph

  • bipartite
  • one side corresponds to vars, the other side is ways to combine the vars. If the LHS has the vars , and the RHS has vertices then the pr dist must factor as

where is neighbors.

remark: sometimes the models can’t express all relevant properties of your prdist

e.g., directed models seem to be super lossy about preserving conditional independence relations!

Q: can you convert btwn models? A: sometimes.

Day 2 — Dec 17 Wrote up day 1.

Did some of the psets. Note to self — only do the interesting problems: my goal is to be exposed to the subject and understand the basic ideas. If something seems like a technical not interesting or intuitively useful idea, ignore it.

some hw qs

Q1. Undirected graphical model for sampling an independent set according to ? A: the graph is the graphical model.

Q2. Undirected graphical model for sampling a matching according to ? A: orig graph is . Make new graph on the edges where

Day 3 — Dec 18 Talked more about converting between directed and undirected graphical models. Talked about conditions for perfect conversion. Talked about “minimal I maps” i.e., converting to a different graphical model which doens’t imply any false structure but would if you removed any edges.

main takeaways:

  • digraph to undigraph conversion is perfect if moralization (connecting parents) doesn’t add edges.
  • undigraph has a perfect digraph model iff is “chordal” i.e., all cycles of len > 3 have a chord.

Lec 6 — Gaussian Graphical Models Gaussian:

  • for iid Gaussians
  • is Gaussian for all
  • PDF = covariance rep
  • could also write PDF as information rep

Interesting fact: is linear in for Gaussians.

Talked about how to marginalize and condition wrt covariance rep and information rep.

x_{i+1} = Ax_i + Bv_i.

p_{x_{1}}(x_{1}) = \sum_{x_{2},\dots,x_n} p(x).