- fix transformer shortest path thing
- investigate superposition a bit more â some of my results in superposition were wacky
- code an SAE (or maybe just do the ARENA stuff about this)
- train a probe (e.g., something similar to Tegmark truth paper?)
- see what activates neurons the most, e.g., in a CNN trained on MNIST
- implement the IRL thing by Dylan
- code a VAE, do deformations in latent space and see what happens in actual space
- implement ARCâs heuristic formal verification paper
- can you train a model to hide its internal state against probes?
- this question is a bit tricky â if a model has a certain internal state, and you donât find it, then that may well just mean you didnât try hard enough? but there are probably gradations to this, so I bet it makes sense.
- implement some kind of thing where you make adversarial examples