Skip to content

2025-11-08 | Reproducing Papers

Goal: Reproducing Automated Interp Paper

Summary: Read up to Background Section for (Neo et al. 2024)

Work sessions

In Out
23:26 23:59

Neo et al. 2024

So far, I've read up to Section 3 Background. Below are my notes and concept prerequisites needed to reproduce the paper

High Level Algorithm Steps

  1. Identify next token neurons; find prompts that highly activate them
  2. Determine Attention heads most responsible for activating each next token neuron using head attribution score
  3. Use GPT-4 to generate explanations for activity patterns of these attention heads
  4. Evaluate response quality by using GPT-4 zero shot classifier for head activity on a new prompt

  5. Very cool to see steps (3) and (4) which use GPT-4 to automate the interpretability work

Assumptions

  1. Associate Attention head output (new context-informed representation of a token) to next toke neuron
    1. Next token neuron = neuron output weight corresponds with embedding of a token in vocab

Concepts to look into

  1. Residual Stream as Shared Information
  2. Attention Head attribution to input weights of a Next Token Neuron not necessarily in the same decoder layer
  3. Next Token Neuron