Skip to content

Doing The Thing

2025-11-08 | Reproducing Papers

2025-11-08 | Reproducing Papers

Goal: Reproducing Automated Interp Paper

Summary: Read up to Background Section for (Neo et al. 2024)

Work sessions

In	Out
23:26	23:59

Neo et al. 2024

So far, I've read up to Section 3 Background. Below are my notes and concept prerequisites needed to reproduce the paper

High Level Algorithm Steps

Identify next token neurons; find prompts that highly activate them
Determine Attention heads most responsible for activating each next token neuron using head attribution score
Use GPT-4 to generate explanations for activity patterns of these attention heads
Evaluate response quality by using GPT-4 zero shot classifier for head activity on a new prompt
Very cool to see steps (3) and (4) which use GPT-4 to automate the interpretability work

Assumptions

Associate Attention head output (new context-informed representation of a token) to next toke neuron
1. Next token neuron = neuron output weight corresponds with embedding of a token in vocab

Concepts to look into

Residual Stream as Shared Information
Attention Head attribution to input weights of a Next Token Neuron not necessarily in the same decoder layer
Next Token Neuron