AI interpretability pack


LSE AI website: quick recap of who why and what
lseai.org

I believe in learning by solving real world problems, so I will just give you the basics for LLM interpretability, feel free to dive deeper in any of these concepts via the sources referenced below:

Interpretability pre-requisites:
- Python and math (algebra, calculus, prob)
- ML basics: check https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&si=DGtKKnCnlu_fFG6X's series and https://pytorch.org/tutorials/ (try a linear regression) with https://einops.rocks/1-einops-basics/ and https://rockt.github.io/2018/04/30/einsum for easier algebra with tensors (will get bug otherwise)
- Tranformers: https://www.neelnanda.io/transformer-tutorial on transformers, fill out the https://colab.research.google.com/github/neelnanda-io/Easy-Transformer/blob/clean-transformer-demo/Clean_Transformer_Demo_Template.ipynb to get a working knowledge
- The Interpretability tools: check out http://Callum%20McDougall’s%20tutorials%20for%20TransformerLens%20+%20Induction%20Heads on TransformerLens and try to get your hans dirty with https://colab.research.google.com/github/neelnanda-io/TransformerLens/blob/main/demos/Exploratory_Analysis_Demo.ipynb based on the interpretability in the wild paper (major interpretability publication)

Sources/Readings:
- https://www.neelnanda.io/mechanistic-interpretability/prereqs by Neel Nanda bigger list, good for reference
- https://www.neelnanda.io/mechanistic-interpretability/getting-started by Neel Nanda, how to learn what our field is about
- https://www.lesswrong.com/posts/jLAvJt8wuSFySN975/mechanistic-interpretability-quickstart-guide?utm_campaign=post_share&utm_source=link by Neel Nanda, compressed form of the above, guide on how specifically to discover something useful

Intro rep: https://github.com/apartresearch/interpretability-starter?tab=readme-ov-file

Crucial papers:
https://transformer-circuits.pub/2022/toy_model/index.html
https://transformer-circuits.pub/2023/monosemantic-features/index.html#setup-autoencoder

Other papers:
https://arxiv.org/abs/2305.16765
https://www.eleuther.ai/papers-blog/sparse-autoencoders-find-highly-interpretable-features-in-language-models
https://ar5iv.labs.arxiv.org/html/2309.08600
https://arxiv.org/abs/2307.15771


Our paper:
https://openreview.net/forum?id=GI5j6OMTju


Videos:

https://www.3blue1brown.com/lessons/gpt

https://www.3blue1brown.com/lessons/attention

https://www.3blue1brown.com/lessons/mlp
Report abuse