bucket list of learning exercises

fix transformer shortest path thing
investigate superposition a bit more — some of my results in superposition were wacky
code an SAE (or maybe just do the ARENA stuff about this)
train a probe (e.g., something similar to Tegmark truth paper?)
see what activates neurons the most, e.g., in a CNN trained on MNIST
implement the IRL thing by Dylan
code a VAE, do deformations in latent space and see what happens in actual space
implement ARC’s heuristic formal verification paper
can you train a model to hide its internal state against probes?
- this question is a bit tricky — if a model has a certain internal state, and you don’t find it, then that may well just mean you didn’t try hard enough? but there are probably gradations to this, so I bet it makes sense.
implement some kind of thing where you make adversarial examples

🐱 Skyspace3.0