2025-11-08 | Reproducing Papers
Goal: Reproducing Automated Interp Paper
Summary: Read up to Background Section for (Neo et al. 2024)
Work sessions
| In | Out |
|---|---|
| 23:26 | 23:59 |
Neo et al. 2024
So far, I've read up to Section 3 Background. Below are my notes and concept prerequisites needed to reproduce the paper
High Level Algorithm Steps
- Identify next token neurons; find prompts that highly activate them
- Determine Attention heads most responsible for activating each next token neuron using head attribution score
- Use GPT-4 to generate explanations for activity patterns of these attention heads
-
Evaluate response quality by using GPT-4 zero shot classifier for head activity on a new prompt
-
Very cool to see steps (3) and (4) which use GPT-4 to automate the interpretability work
Assumptions
- Associate Attention head output (new context-informed representation of a token) to next toke neuron
- Next token neuron = neuron
output weightcorresponds withembeddingof a token in vocab
- Next token neuron = neuron
Concepts to look into
- Residual Stream as Shared Information
- Attention Head attribution to input weights of a Next Token Neuron not necessarily in the same decoder layer
- Next Token Neuron