Skip to content

2026-01-24 | Reading && Thesis

Goal: Read Seminal Papers in Interpretability | Propose Master's Thesis

Summary: Deep dive into Tensor Products used in Mathematical Framework for Transformers | Start Literature Review for Thesis

Work sessions

In Out
10:40 16:30

Today is going to be a little bit weird! Travelling = No Wifi = No AI Assistance to understand papers. This will be super interesting :)

Work Session Goals

Potential Goals for work session

  • Don't fall asleep so I don't get jet lag

  • Read

  • A Mathematical Framework for Transformers (this is probably the most important one and the one that will take the longest)
  • Towards Monosemanticity
  • Scaling Monosemanticity
  • Transcoders Paper
  • Building and evaluating alignment auditing agents
  • Unfortunately can't do Toy Models of Superposition because I didn't save that paper

  • ACDC; probably better to program at a real desk and have internet connection (AI and pytorch docs)

  • Thank you emails (realized this is kind of hard without wifi)

  • Notes to family and friends (this is possible cuz it's just ink!)

Priorities and Timing

  1. Notes

  2. Write Thesis Proposal and Experiments

  3. Read Interp literature

There's 7.5 hours left in the flight (wow time goes by fast)! I assume there will be a feeding time so -1 hour + 30 mins of breakat least somewhere in between 6 hours now

Quick Jot of Common Experimental Methods in Interpretability

Circuits

  1. Attention (QK-OV) Circuits; Path Patching
  2. Attribution Graphs; use either Cross Layer Transcoders (CLTs) or per-layer Transcoders

Sparsifying Methods

Linear Methods

  1. Steering Vectors: Assume that an activation (let's say in the residual stream) \(h\) can be composed of steering vector \(v\), on-or-off features \(v\) is a vector of all 1's or -1's, and noise \(\epsilon\) such that

$$h = z \cdot w + \epsilon$

  • TODO: is v all 1's or all -1'2 with shape d_model if for residual? Does w need to be sparse? I don't think so

  • Linear Probe: \(y hat = \sigma (v \cdot x + b)\)

  • TODO: does

  • LoRA Adaptors (not too much work on interpreting LoRA adaptors to my knowledge since it comes from a training efficiency motivation): For each weight matrix in the model, learn a lower rank (usually 2-16) A and B matrix to add:

Given a weight matrix \(W \in \mathbb{R}^{d1 \times d2}\) and matrices \(A \in \mathbb{R}^{d1 \times R}\), \(B \in \mathbb{R}^{R \times d2}\), compute the transformation of vector \(x \in \mathbb{R}^{d1}\) as:

\[x' = (xW + (xA)B)^\top\]

Sparse MLP Methods

  1. Sparse Autoencoder (an MLP with an overcomplete[8x, 16x, 32x] up projection space with an L1 penalty) meant to reconstruct activations (can be anywhere; commonly residual stream, MLP activations)

  2. Transcoders: map \(MLP_{in}\) to \(MLP_{out}\) with a sparse up_projection

Levels of Abstraction

  1. Attention Circuits
  2. Feature Graphs (subgraph of Attribution Graphs)
  3. Neurons (problematic because polysemantic)
  4. Attention Head / MLP