I’m mostly not focusing on school right now — spending time on research / projects seems more useful and more interesting and also better for learning things. However occasionally some of the school stuff seems pretty cool actually! So I’ll write down some of that stuff here to remember.

AIRR

I actually enjoy this class quite a bit. So far this class is mostly about inference. Usually you have a set of local constraints and define a joint distribution

Important fact: scaling a potential doesn’t impact the pr distribution that you get out of this.

Another observation: the distribution is obtained by adding a factor 𝟙.

BP is cool. Think of BP on a tree as “collapsing factor sub-trees”.

Another important fact: scaling messages in BP doesn’t change the marginals or MAP assignment.

Belief Propagation

Message from Variable to Factor: For a variable , the message it sends to a connected factor is the product of all incoming messages from other factors connected to (excluding ):

Where:

  • denotes the set of factors connected to .
  • is the message from factor to variable .

Message from Factor to Variable: For a factor , the message it sends to a connected variable is the product of the factor’s value (which depends on all its connected variables) and the messages from all other connected variables (excluding ), marginalized over all possible values of those variables:

Where:

  • is the factor function that depends on the variables connected to the factor .
  • denotes the set of variables connected to factor .

Max-product lets you find MAP assignments


Sampling in a Bayes net:

  • just do it — toposort

Conditional sampling in a Bayes net:

  • rejection sampling — works, but slow
  • importance sampling:
  • If :
  • They recommend: use which is “do ancestral sampling but with the observations fixed”.

todo: question for me: when do we expect this to converge faster than just sampling from ? Maybe could run an experiment about this with some Bayes nets. this seems pretty weird / magical.

Gibbs Sampling

  • Repeatedly:
    • Choose a random variable
    • Resample its value conditional on the other variables settings
    • Note that in a factor graph it suffices to look at the neighbors to do this computation
    • So it should be relatively cheap. O(number of factors the var is involved in).

Q: Can we say how fast it converges?

  • Vibes are that it’s about for most reasonable -bit systems.

Remark: Metropolis-Hastings alg is something with a proposal distribution which proposes a new sample given some old samples.

feb 25 Here is a formula:

Here is an example: let be independent unit Gaussians

More generally,

Conditioning Gaussians full

  • note to self — if there is a midterm and I get to bring notes, I should print this out
  • another note: always submit regrade requests when you’re actually right

It turns out that if you have are jointly Gaussian then is also Gaussian and there are some simple formulas for the mean and covar of . These formulas are:

The fact that conditioning, marginalizing, and adding jointly Gaussian rvs gives Gaussian rvs is very nice.

Markov Blanket of a vertex is the set of nodes whose value you must fix in order to make the value of independent of all the other values. Surprisingly, the answer is: (1) parents, (2) children, (3) AND parents of children. This is not so surprising if you’ve ever seen the fact that the V Bayes net has the property that but .


Gibbs Sampling

Question: why does it work?

Answer:

You can show that

ie that is a fixed point of the Markov chain. Thus, if the Markov chain is guaranteed to converge to a unique value, then it must be this value.

Gibbs sampling is only going to work if the chain is ergodic anyways so fine.


algorithm / forwards backwards / sum-product on HMM:

Getting — called “smoothing” — ie updating estimates of the past based on new observations is more tricky than just computing pr dist over the next state. In particular it requires storing history of forwards messages.

Here’s how you do it: if you care about smoothing at lag L

  • Store the last forwards messages .
  • When a new observation arrives:
    • Compute the new forward message (just one step of the forward algorithm)
    • Run the backward algorithm from the current time t back to whatever past time points we want to smooth
    • Combine the stored forward messages with the new backward messages
  • Cost: where is the “lag”

todo: can you prove bounds on when particle method works?

Particle methods for sampling

particle filter

I really like these “smart sampling” approaches. It feels like maybe they should say something interesting about neural networks. But idrk.

Inf + Info

LRTs are good. Sometimes need randomness to get a full ROC curve.

They gave some nice motivation for what “decision rules” are:

  • maybe you have priors and costs can choose rule to minimize E(cost).
  • maybe have priors but no costs, maybe just have some tolerable FPR can see what ROC curve you can get
  • maybe have costs but no priors take worst case prior (“minimax formulation”)

What if you don’t have a prior? you can graph where . that’s what we could achieve if prior is and we knew that.

but ofc we don’t know the prior. So we can instead achieve something more like where .

clearly in the adversarial setting we choose this line to have slope zero.

equivalently this says look for intersection of OCLRT and line .

more generally with some costs, the minimax optimal decision rule (ie decision rule with best performance on whatever prior is worst for this decision rule) could be intersection of OC-LRT and .

don’t even ask about the case like come on thats just gross.

i think you can tell if you need a mixed strat based on if the intersection of the curves is in a part of the curve that required a mixed strat

Fact: in fact, the optimal decision rule satisfies By thinking of it as a two player game. This turns out to be really great because RHS is much more tractable.

Credit to Kevin for explaining all this to me much better than the lecture notes!

Moreover, the expected cost if the prior was and you knew it should be the same conditional on the hypothesis being or , where is the optimum in the RHS thing.

something to be careful of:

If the curves don’t intersect then the answer is either or . hopefully you can just think about this to figure out which one is correct.

remark: minimax inequality is true in general and is pretty good.

I think I get the basic point although might read their examples at some point.

Fact: Every point on efficient frontier is achieved by a (possibly randomized) LRT test.

cost formulations:

If then you should set . If penalizes all errors the same above a small threshold you should basically use the MAP rule.

MSE loss: choose

fun fact about MSE estimator:

is uncorrelated with every function

Recall: curly geq means is PSD ie for all .

Fact: error covariance matrix for BLS estimator is better than for any other estimator

and something like

so on average more info is helpful for estimation — although some info can hurt your estimate by being misleading.


Exponential families

  • : natural parameter

  • : natural statistic

  • log base function

  • — log-partition function

  • note not unique

  • not everything is expressible in this way — supposed to be bad if domain depends on param

”canonical” exp family is one with . natural param space — set of such that the dist is normalizable

natural exp family

Proposition: Log partition fn generates cumulants:

  • .
  • If follows that monotonic and thus invertible fn of

Score function of a canonical expo fam .

Fisher info of canonical expo fam: .

In the general (non-canonical case) we have the following expressions:

  • and so on

Minimum-variance unbiased estimators (MVU)

  • a valid (doesn’t depend on param; just on visible data) unbiased estimator with uniformly lower variance than all other estimators
    • may well not exist

Cramer-Rao bound: Suppose is positive and differentiable on and satisfies

(which it seems like any moderately reasonable distribution will satsify) then for any unbiased , where fisher info and .

remark — large fisher info means we expect to be able to better resolve the value of from observations.

If you are tight with Cramer Rao bound then you get where the dependence on should be fake to be valid.

Remark:

  • if tight then unique!
  • clear by the fact that we can just write down the above expression for what it is.

in the below ML means Maximum liklihood not machine learning

Sometimes the ML estimate happens to be “efficient” — ie make the cramer rao bound tight — ie be the MVU estimator. note that you can generally find the ML estimator by finding which makes .

note: “ML estimate commutes with invertible maps”.


Sufficient statistics

  • We say is a sufficient statistic if for any
  • equivalently, the condition is where

Neyman Factorization Theorem: is suffic statistic iff exists functions such that

  • Example: iid dist as , then is suffic statistic.
  • Example: is suffic statistic for
  • Example: if is in finite set then listing for all possible is a suffic stat.

minimal suffic statistic:

  • a suffic stat is minimal, if for any suffic stat , there exists a function such .

ex: likelihood ratio is minimal suffic stat

Bayesian setting

  • .

minimal suffic statistic:

If is finite, clear that minimal suffic stat exists. Can find by normalizing the likelihood functions, and then sorting the likelihood functions into buckets based on which ones are the same

ppl don’t know how to check whether a suffic stat is minimal efficiently. But,

a suffic stat is complete if for any with for all has . (not necc condition)

Theorem: Complete suffic stats are minimal.

  • you can use this to show that for exp fam is suffic stat

remark:

  • completeness is saying that are all lin indep

  • a suffic stat is minimal iff for each distinct we have are lin indep

EM reprise

Networks

  • Measure “importance” of a vertex as “average importance of neighbors”.
    • Can interpret these as steady-state probabilities if it’s a Markov Chain.
  • Can also give people some base importance.
    • This is like a Markov Chain but with some teleportation probability.

todo — figure out what else is happening in networks