Skip to content

Doing The Thing

Running Todo List

Running Todo List

Math

Information Theory

[Added 2025-10-27] Surprisal vs. Perplexity vs. Entropy vs. Cross-Entropy

Mechanistic Interpretability

[Added 2025-10-21] Learn ARENA curriculum

Seminal Mech Interp Papers

[Added 2026-01-18] Toy Models of Superposition
[Added 2026-01-18] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
[Added 2026-01-18] Circuit Tracing: Revealing Computational Graphs in Language Models

Mech Interp Concepts to Master

From IOI Paper: (1) Logit, Head, Layer Attribution, Logit Diffs (2) Activation Patching (3) Path Patching

Mech Inter Safety Application

[Added 2026-01-18] Refusal Direction is Universal Across Safety-Aligned Languages
[Added 2026-01-18] LLMs Encode Harmfulness and Refusal Separately

Automated Circuit Discovery

[Added 2026-01-19] Transformer Circuit Faithfulness Metrics are not Robust, response to ACDC
[Added 2026-01-19] Towards Automated Circuit Discovery for Mechanistic Interpretability

Interp for Monitoring/Control

[Added 2026-01-19] Building Production-Ready Probes For Gemini

RL

[Added 2026-01-30] A Case for Model Persona Research; thanks to Daniel Tan for sharing this at LISA during the Evan Hubinger talk
[Added 2026-01-30] PERSONA VECTORS: MONITORING AND CONTROLLING CHARACTER TRAITS IN LANGUAGE MODELS

Reading

[Added 2025-12-01] How Can Interpretability Researchers Help AGI Go Well? to reading list
[Added 2025-10-27] From OpenAI Scheming Blog, Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
[Added 2025-10-26] Download and annotate March 27th, 2025 Circuit Tracing: Revealing Computational Graphs in Language Models
- [Added 2025-10-26] Perhaps try to reproduce on a smaller model?
- Additional reference: [Added 2025-10-26] On the Biology of a Large Language Model
[Added 2025-10-25] Read Tracing the thoughts of a large language model
[Added 2025-10-21] Read Progress measures for grokking via mechanistic interpretability by Nanda et al.
[Added 2025-10-21] Look into reproducing Progress measures for grokking via mechanistic interpretability by Nanda et al.

Documentation

[Added 2025-10-24] Add section on current project (Dyck Interp Probe)
[Completed 25-10-23] Document 25-10-17 to 25-10-20 range of progress