I’m mostly not focusing on school right now — spending time on research / projects seems more useful and more interesting and also better for learning things. However occasionally some of the school stuff seems pretty cool actually! So I’ll write down some of that stuff here to remember.
AIRR
I actually enjoy this class quite a bit. So far this class is mostly about inference. Usually you have a set of local constraints
Important fact: scaling a potential doesn’t impact the pr distribution that you get out of this.
Another observation: the distribution
is obtained by adding a factor .
BP is cool. Think of BP on a tree as “collapsing factor sub-trees”.
Another important fact: scaling messages in BP doesn’t change the marginals or MAP assignment.
Belief Propagation
Message from Variable to Factor:
For a variable
Where:
denotes the set of factors connected to . is the message from factor to variable .
Message from Factor to Variable:
For a factor
Where:
is the factor function that depends on the variables connected to the factor . denotes the set of variables connected to factor .
Max-product lets you find MAP assignments
Sampling in a Bayes net:
- just do it — toposort
Conditional sampling in a Bayes net:
- rejection sampling — works, but slow
- importance sampling:
- If
:
- They recommend: use
which is “do ancestral sampling but with the observations fixed”.
todo: question for me: when do we expect this to converge faster than just sampling from
Gibbs Sampling
- Repeatedly:
- Choose a random variable
- Resample its value conditional on the other variables settings
- Note that in a factor graph it suffices to look at the neighbors to do this computation
- So it should be relatively cheap. O(number of factors the var is involved in).
Q: Can we say how fast it converges?
- Vibes are that it’s about
for most reasonable -bit systems.
Remark: Metropolis-Hastings alg is something with a proposal distribution
feb 25 Here is a formula:
Here is an example: let
More generally,
Conditioning Gaussians full
- note to self — if there is a midterm and I get to bring notes, I should print this out
- another note: always submit regrade requests when you’re actually right
It turns out that if you have
The fact that conditioning, marginalizing, and adding jointly Gaussian rvs gives Gaussian rvs is very nice.
Markov Blanket of a vertex
Gibbs Sampling
Question: why does it work?
Answer:
You can show that
ie that
Gibbs sampling is only going to work if the chain is ergodic anyways so fine.
Getting
Here’s how you do it: if you care about smoothing at lag L
- Store the last
forwards messages . - When a new observation
arrives: - Compute the new forward message
(just one step of the forward algorithm) - Run the backward algorithm from the current time t back to whatever past time points we want to smooth
- Combine the stored forward messages with the new backward messages
- Compute the new forward message
- Cost:
where is the “lag”
todo: can you prove bounds on when particle method works?
Particle methods for sampling
I really like these “smart sampling” approaches. It feels like maybe they should say something interesting about neural networks. But idrk.
Inf + Info
LRTs are good. Sometimes need randomness to get a full ROC curve.
They gave some nice motivation for what “decision rules” are:
- maybe you have priors and costs ⇒ can choose rule to minimize E(cost).
- maybe have priors but no costs, maybe just have some tolerable FPR ⇒ can see what ROC curve you can get
- maybe have costs but no priors ⇒ take worst case prior (“minimax formulation”)
What if you don’t have a prior?
you can graph
but ofc we don’t know the prior.
So we can instead achieve something more like
clearly in the adversarial setting we choose this line to have slope zero.
equivalently this says look for intersection of OCLRT and line
more generally with some costs, the minimax optimal decision rule (ie decision rule with best performance on whatever prior is worst for this decision rule)
could be intersection of
OC-LRT and
don’t even ask about the case
i think you can tell if you need a mixed strat based on if the intersection of the curves is in a part of the curve that required a mixed strat
Fact: in fact, the optimal decision rule satisfies
By thinking of it as a two player game. This turns out to be really great because RHS is much more tractable.
Credit to Kevin for explaining all this to me much better than the lecture notes!
Moreover, the expected cost if the prior was
and you knew it should be the same conditional on the hypothesis being or , where is the optimum in the RHS thing.
something to be careful of:
If the curves don’t intersect then the answer is either
or . hopefully you can just think about this to figure out which one is correct.
remark: minimax inequality is true in general and is pretty good.
I think I get the basic point although might read their examples at some point.
Fact: Every point on efficient frontier is achieved by a (possibly randomized) LRT test.
cost formulations:
If
MSE loss: choose
fun fact about MSE estimator:
Recall:
Fact: error covariance matrix for BLS estimator is better than for any other estimator
and something like
so on average more info is helpful for estimation — although some info can hurt your estimate by being misleading.
Exponential families
-
: natural parameter -
: natural statistic -
log base function -
— log-partition function -
note not unique
-
not everything is expressible in this way — supposed to be bad if domain depends on param
”canonical” exp family is one with
. natural param space — set of such that the dist is normalizable
natural exp family
Proposition: Log partition fn generates cumulants:
. - If
follows that monotonic and thus invertible fn of
Score function of a canonical expo fam
.
Fisher info of canonical expo fam:
.
In the general (non-canonical case) we have the following expressions:
- and so on
Minimum-variance unbiased estimators (MVU)
- a valid (doesn’t depend on param; just on visible data) unbiased estimator with uniformly lower variance than all other estimators
- may well not exist
Cramer-Rao bound:
Suppose
(which it seems like any moderately reasonable distribution will satsify)
then for any unbiased
remark — large fisher info means we expect to be able to better resolve the value of
If you are tight with Cramer Rao bound then you get
Remark:
- if tight then unique!
- clear by the fact that we can just write down the above expression for what it is.
in the below ML means Maximum liklihood not machine learning
Sometimes the ML estimate happens to be “efficient” — ie make the cramer rao bound tight — ie be the MVU estimator.
note that you can generally find the ML estimator by finding which
note: “ML estimate commutes with invertible maps”.
Sufficient statistics
- We say
is a sufficient statistic if for any - equivalently, the condition is
where
Neyman Factorization Theorem:
- Example:
iid dist as , then is suffic statistic. - Example:
is suffic statistic for - Example: if
is in finite set then listing for all possible is a suffic stat.
minimal suffic statistic:
- a suffic stat
is minimal, if for any suffic stat , there exists a function such .
ex: likelihood ratio is minimal suffic stat
Bayesian setting
.
minimal suffic statistic:
If
ppl don’t know how to check whether a suffic stat is minimal efficiently. But,
a suffic stat is complete if for any
Theorem: Complete suffic stats are minimal.
- you can use this to show that
for exp fam is suffic stat
remark:
-
completeness is saying that
are all lin indep -
a suffic stat is minimal iff for each distinct
we have are lin indep
Networks
- Measure “importance” of a vertex as “average importance of neighbors”.
- Can interpret these as steady-state probabilities if it’s a Markov Chain.
- Can also give people some base importance.
- This is like a Markov Chain but with some teleportation probability.
todo — figure out what else is happening in networks