2026-01-24 | Reading && Thesis

Goal: Read Seminal Papers in Interpretability | Propose Master's Thesis

Summary: Deep dive into Tensor Products used in Mathematical Framework for Transformers | Start Literature Review for Thesis

Work sessions

In	Out
10:40	16:30

Today is going to be a little bit weird! Travelling = No Wifi = No AI Assistance to understand papers. This will be super interesting :)

Work Session Goals

Potential Goals for work session

Don't fall asleep so I don't get jet lag
Read
A Mathematical Framework for Transformers (this is probably the most important one and the one that will take the longest)
Towards Monosemanticity
Scaling Monosemanticity
Transcoders Paper
Building and evaluating alignment auditing agents
Unfortunately can't do Toy Models of Superposition because I didn't save that paper
ACDC; probably better to program at a real desk and have internet connection (AI and pytorch docs)
Thank you emails (realized this is kind of hard without wifi)
Notes to family and friends (this is possible cuz it's just ink!)

Priorities and Timing

Notes
Write Thesis Proposal and Experiments
Read Interp literature

There's 7.5 hours left in the flight (wow time goes by fast)! I assume there will be a feeding time so -1 hour + 30 mins of breakat least somewhere in between 6 hours now

Quick Jot of Common Experimental Methods in Interpretability

Circuits

Attention (QK-OV) Circuits; Path Patching
Attribution Graphs; use either Cross Layer Transcoders (CLTs) or per-layer Transcoders

Sparsifying Methods

Linear Methods

Steering Vectors: Assume that an activation (let's say in the residual stream) $h$ can be composed of steering vector $v$, on-or-off features $v$ is a vector of all 1's or -1's, and noise $\epsilon$ such that

$$h = z \cdot w + \epsilon$

TODO: is v all 1's or all -1'2 with shape d_model if for residual? Does w need to be sparse? I don't think so
Linear Probe: $y hat = \sigma (v \cdot x + b)$
TODO: does
LoRA Adaptors (not too much work on interpreting LoRA adaptors to my knowledge since it comes from a training efficiency motivation): For each weight matrix in the model, learn a lower rank (usually 2-16) A and B matrix to add:

Given a weight matrix $W \in \mathbb{R}^{d1 \times d2}$ and matrices $A \in \mathbb{R}^{d1 \times R}$, $B \in \mathbb{R}^{R \times d2}$, compute the transformation of vector $x \in \mathbb{R}^{d1}$ as:

\[x' = (xW + (xA)B)^\top\]

Sparse MLP Methods

Sparse Autoencoder (an MLP with an overcomplete[8x, 16x, 32x] up projection space with an L1 penalty) meant to reconstruct activations (can be anywhere; commonly residual stream, MLP activations)
Transcoders: map $MLP_{in}$ to $MLP_{out}$ with a sparse up_projection

Levels of Abstraction

Attention Circuits
Feature Graphs (subgraph of Attribution Graphs)
Neurons (problematic because polysemantic)
Attention Head / MLP