I’m mostly not focusing on school right now — spending time on research / projects seems more useful and more interesting and also better for learning things. However occasionally some of the school stuff seems pretty cool actually! So I’ll write down some of that stuff here to remember.

AIRR

I actually enjoy this class quite a bit. So far this class is mostly about inference. Usually you have a set of local constraints and define a joint distribution

Important fact: scaling a potential doesn’t impact the pr distribution that you get out of this.

Another observation: the distribution is obtained by adding a factor $𝟙$ .

BP is cool. Think of BP on a tree as “collapsing factor sub-trees”.

Another important fact: scaling messages in BP doesn’t change the marginals or MAP assignment.

Belief Propagation

Message from Variable to Factor: For a variable , the message it sends to a connected factor is the product of all incoming messages from other factors connected to (excluding ):

Where:

denotes the set of factors connected to .
is the message from factor to variable .

Message from Factor to Variable: For a factor , the message it sends to a connected variable is the product of the factor’s value (which depends on all its connected variables) and the messages from all other connected variables (excluding ), marginalized over all possible values of those variables:

Where:

is the factor function that depends on the variables connected to the factor .
denotes the set of variables connected to factor .

Max-product lets you find MAP assignments

Sampling in a Bayes net:

just do it — toposort

Conditional sampling in a Bayes net:

rejection sampling — works, but slow
importance sampling:
If :

They recommend: use which is “do ancestral sampling but with the observations fixed”.

todo: question for me: when do we expect this to converge faster than just sampling from ? Maybe could run an experiment about this with some Bayes nets. this seems pretty weird / magical.

Gibbs Sampling

Repeatedly:
- Choose a random variable
- Resample its value conditional on the other variables settings
- Note that in a factor graph it suffices to look at the neighbors to do this computation
- So it should be relatively cheap. O(number of factors the var is involved in).

Q: Can we say how fast it converges?

Vibes are that it’s about for most reasonable -bit systems.

Remark: Metropolis-Hastings alg is something with a proposal distribution which proposes a new sample given some old samples.

feb 25 Here is a formula:

Here is an example: let be independent unit Gaussians

More generally,

Conditioning Gaussians full

note to self — if there is a midterm and I get to bring notes, I should print this out
another note: always submit regrade requests when you’re actually right

It turns out that if you have are jointly Gaussian then is also Gaussian and there are some simple formulas for the mean and covar of . These formulas are:

The fact that conditioning, marginalizing, and adding jointly Gaussian rvs gives Gaussian rvs is very nice.

Markov Blanket of a vertex is the set of nodes whose value you must fix in order to make the value of independent of all the other values. Surprisingly, the answer is: (1) parents, (2) children, (3) AND parents of children. This is not so surprising if you’ve ever seen the fact that the V Bayes net has the property that but .

Gibbs Sampling

Question: why does it work?

Answer:

You can show that

ie that is a fixed point of the Markov chain. Thus, if the Markov chain is guaranteed to converge to a unique value, then it must be this value.

Gibbs sampling is only going to work if the chain is ergodic anyways so fine.

algorithm / forwards backwards / sum-product on HMM:

Getting — called “smoothing” — ie updating estimates of the past based on new observations is more tricky than just computing pr dist over the next state. In particular it requires storing history of forwards messages.

Here’s how you do it: if you care about smoothing at lag L

Store the last forwards messages .
When a new observation arrives:
- Compute the new forward message (just one step of the forward algorithm)
- Run the backward algorithm from the current time t back to whatever past time points we want to smooth
- Combine the stored forward messages with the new backward messages
Cost: where is the “lag”

todo: can you prove bounds on when particle method works?

Particle methods for sampling

particle filter

I really like these “smart sampling” approaches. It feels like maybe they should say something interesting about neural networks. But idrk.

Inf + Info

LRTs are good. Sometimes need randomness to get a full ROC curve.

They gave some nice motivation for what “decision rules” are:

maybe you have priors and costs ⇒ can choose rule to minimize E(cost).
maybe have priors but no costs, maybe just have some tolerable FPR ⇒ can see what ROC curve you can get
maybe have costs but no priors ⇒ take worst case prior (“minimax formulation”)

What if you don’t have a prior? you can graph where . that’s what we could achieve if prior is and we knew that.

but ofc we don’t know the prior. So we can instead achieve something more like where .

clearly in the adversarial setting we choose this line to have slope zero.

equivalently this says look for intersection of OCLRT and line .

more generally with some costs, the minimax optimal decision rule (ie decision rule with best performance on whatever prior is worst for this decision rule) could be intersection of OC-LRT and .

don’t even ask about the case like come on thats just gross.

i think you can tell if you need a mixed strat based on if the intersection of the curves is in a part of the curve that required a mixed strat

Fact: in fact, the optimal decision rule satisfies By thinking of it as a two player game. This turns out to be really great because RHS is much more tractable.

Credit to Kevin for explaining all this to me much better than the lecture notes!

Moreover, the expected cost if the prior was and you knew it should be the same conditional on the hypothesis being or , where is the optimum in the RHS thing.

something to be careful of:

If the curves don’t intersect then the answer is either or . hopefully you can just think about this to figure out which one is correct.

remark: minimax inequality is true in general and is pretty good.

I think I get the basic point although might read their examples at some point.

Fact: Every point on efficient frontier is achieved by a (possibly randomized) LRT test.

cost formulations:

If then you should set . If penalizes all errors the same above a small threshold you should basically use the MAP rule.

MSE loss: choose

fun fact about MSE estimator:

is uncorrelated with every function

Recall: curly geq means is PSD ie for all .

Fact: error covariance matrix for BLS estimator is better than for any other estimator

and something like

so on average more info is helpful for estimation — although some info can hurt your estimate by being misleading.

Exponential families

: natural parameter
: natural statistic
log base function
— log-partition function
note not unique
not everything is expressible in this way — supposed to be bad if domain depends on param

”canonical” exp family is one with . natural param space — set of such that the dist is normalizable

natural exp family

Proposition: Log partition fn generates cumulants:

.

If follows that monotonic and thus invertible fn of

Score function of a canonical expo fam .

Fisher info of canonical expo fam: .

In the general (non-canonical case) we have the following expressions:

and so on

Minimum-variance unbiased estimators (MVU)

a valid (doesn’t depend on param; just on visible data) unbiased estimator with uniformly lower variance than all other estimators
- may well not exist

Cramer-Rao bound: Suppose is positive and differentiable on and satisfies

(which it seems like any moderately reasonable distribution will satsify) then for any unbiased , where fisher info and .

remark — large fisher info means we expect to be able to better resolve the value of from observations.

If you are tight with Cramer Rao bound then you get where the dependence on should be fake to be valid.

Remark:

if tight then unique!
clear by the fact that we can just write down the above expression for what it is.

in the below ML means Maximum liklihood not machine learning

Sometimes the ML estimate happens to be “efficient” — ie make the cramer rao bound tight — ie be the MVU estimator. note that you can generally find the ML estimator by finding which makes .

note: “ML estimate commutes with invertible maps”.

Sufficient statistics

We say is a sufficient statistic if for any
equivalently, the condition is where

Neyman Factorization Theorem: is suffic statistic iff exists functions such that

Example: iid dist as , then is suffic statistic.
Example: is suffic statistic for
Example: if is in finite set then listing for all possible is a suffic stat.

minimal suffic statistic:

a suffic stat is minimal, if for any suffic stat , there exists a function such .

ex: likelihood ratio is minimal suffic stat

Bayesian setting

minimal suffic statistic:

If is finite, clear that minimal suffic stat exists. Can find by normalizing the likelihood functions, and then sorting the likelihood functions into buckets based on which ones are the same

ppl don’t know how to check whether a suffic stat is minimal efficiently. But,

a suffic stat is complete if for any with for all has . (not necc condition)

Theorem: Complete suffic stats are minimal.

you can use this to show that for exp fam is suffic stat

remark:

completeness is saying that are all lin indep
a suffic stat is minimal iff for each distinct we have are lin indep

EM reprise

information geometry:

im sick atm so these notes will not be high quality sorry about that

it seemed like some pretty neat stuff though

linear family: the set of distributions that satisfy for some set of
- We write to be the linear family of distributions which agree with on the mean things.
exponential family: we’ve already defined these, but we’ll specifically define
- as the set of distributions which can be written as

for some parameters .

Interestingly, these types of families turn out to be an “orthogonal basis” for the information geometry.

Define the I-projection as follows:

It turns out that in general we have

And if is linear family then this becomes an equality and if you’re interested in which things have I-projection onto equal to it turns out that it’s exactly .

As a cherry on top, if you look at things locally the info geometry behaves like euclidean space.

Networks

Measure “importance” of a vertex as “average importance of neighbors”.
- Can interpret these as steady-state probabilities if it’s a Markov Chain.
Can also give people some base importance.
- This is like a Markov Chain but with some teleportation probability.

todo — figure out what else is happening in networks

🐱 Skyspace3.0

Explorer

school tidbits

AIRR

Belief Propagation

Inf + Info

Networks

Graph View

Table of Contents

Backlinks