2026-01-24 | Reading && Thesis
Goal: Read Seminal Papers in Interpretability | Propose Master's Thesis
Summary: Deep dive into Tensor Products used in Mathematical Framework for Transformers | Start Literature Review for Thesis
Work sessions
| In | Out |
|---|---|
| 10:40 | 16:30 |
Today is going to be a little bit weird! Travelling = No Wifi = No AI Assistance to understand papers. This will be super interesting :)
Work Session Goals
Potential Goals for work session
-
Don't fall asleep so I don't get jet lag
-
Read
- A Mathematical Framework for Transformers (this is probably the most important one and the one that will take the longest)
- Towards Monosemanticity
- Scaling Monosemanticity
- Transcoders Paper
- Building and evaluating alignment auditing agents
-
Unfortunately can't do Toy Models of Superposition because I didn't save that paper
-
ACDC; probably better to program at a real desk and have internet connection (AI and pytorch docs)
-
Thank you emails (realized this is kind of hard without wifi)
-
Notes to family and friends (this is possible cuz it's just ink!)
Priorities and Timing
-
Notes
-
Write Thesis Proposal and Experiments
-
Read Interp literature
There's 7.5 hours left in the flight (wow time goes by fast)! I assume there will be a feeding time so -1 hour + 30 mins of breakat least somewhere in between 6 hours now
Quick Jot of Common Experimental Methods in Interpretability
Circuits
- Attention (QK-OV) Circuits; Path Patching
- Attribution Graphs; use either Cross Layer Transcoders (CLTs) or per-layer Transcoders
Sparsifying Methods
Linear Methods
- Steering Vectors: Assume that an activation (let's say in the residual stream) \(h\) can be composed of steering vector \(v\), on-or-off features \(v\) is a vector of all 1's or -1's, and noise \(\epsilon\) such that
$$h = z \cdot w + \epsilon$
-
TODO: is v all 1's or all -1'2 with shape d_model if for residual? Does w need to be sparse? I don't think so
-
Linear Probe: \(y hat = \sigma (v \cdot x + b)\)
-
TODO: does
-
LoRA Adaptors (not too much work on interpreting LoRA adaptors to my knowledge since it comes from a training efficiency motivation): For each weight matrix in the model, learn a lower rank (usually 2-16) A and B matrix to add:
Given a weight matrix \(W \in \mathbb{R}^{d1 \times d2}\) and matrices \(A \in \mathbb{R}^{d1 \times R}\), \(B \in \mathbb{R}^{R \times d2}\), compute the transformation of vector \(x \in \mathbb{R}^{d1}\) as:
Sparse MLP Methods
-
Sparse Autoencoder (an MLP with an overcomplete[8x, 16x, 32x] up projection space with an L1 penalty) meant to reconstruct activations (can be anywhere; commonly residual stream, MLP activations)
-
Transcoders: map \(MLP_{in}\) to \(MLP_{out}\) with a sparse up_projection
Levels of Abstraction
- Attention Circuits
- Feature Graphs (subgraph of Attribution Graphs)
- Neurons (problematic because polysemantic)
- Attention Head / MLP