Running Todo List
Math
- [Added 2025-12-30] Area Under Curve (AUC) and Receiver Operating Characteristic (ROC) Test
- [Added 2025-12-30] Wilcoxon Signed-Rank Test
- Go through StatQuest Statistics Lessons
- [Completed 2025-10-27] [Added 2025-10-25] Probability Distributions [StatQuest Video]
- [Completed 2025-10-27] [Added 2025-10-25] Normal Distributions [StatQuest Video]
- [Added 2025-10-25] Mean, Variance, Std Deviation [StatQuest Video]
- [Added 2025-11-03] (n-1) vs. (n) in estimate of population variance [StatQuest Video]
- [Added 2025-10-25] Mathematical Models [StatQuest Video]
- [Added 2025-10-25] Covariance [StatQuest Video]
- [Added 2025-10-25] Conditional Probabilities [StatQuest Video]
- [Added 2025-10-25] Bayes Theorem [StatQuest Video]
- [Added 2025-10-25] Expected Value
- [Added 2025-10-25] Discrete [StatQuest Video]
- [Added 2025-10-25] Continuous [StatQuest Video]
- [Added 2025-10-25] Central Limit Theorem [StatQuest Video]
Information Theory
- [Added 2025-10-27] Surprisal vs. Perplexity vs. Entropy vs. Cross-Entropy
Mechanistic Interpretability
- [Added 2025-10-21] Learn ARENA curriculum
Seminal Mech Interp Papers
- [Added 2026-01-18] Toy Models of Superposition
- [Added 2026-01-18] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- [Added 2026-01-18] Circuit Tracing: Revealing Computational Graphs in Language Models
Mech Interp Concepts to Master
- From IOI Paper: (1) Logit, Head, Layer Attribution, Logit Diffs (2) Activation Patching (3) Path Patching
Mech Inter Safety Application
- [Added 2026-01-18] Refusal Direction is Universal Across Safety-Aligned Languages
- [Added 2026-01-18] LLMs Encode Harmfulness and Refusal Separately
Automated Circuit Discovery
- [Added 2026-01-19] Transformer Circuit Faithfulness Metrics are not Robust, response to ACDC
- [Added 2026-01-19] Towards Automated Circuit Discovery for Mechanistic Interpretability
Interp for Monitoring/Control
- [Added 2026-01-19] Building Production-Ready Probes For Gemini
RL
- [Added 2026-01-30] A Case for Model Persona Research; thanks to Daniel Tan for sharing this at LISA during the Evan Hubinger talk
- [Added 2026-01-30] PERSONA VECTORS: MONITORING AND CONTROLLING CHARACTER TRAITS IN LANGUAGE MODELS
Reading
- [Added 2025-12-01] How Can Interpretability Researchers Help AGI Go Well? to reading list
- [Added 2025-10-27] From OpenAI Scheming Blog, Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
- [Added 2025-10-26] Download and annotate March 27th, 2025 Circuit Tracing: Revealing Computational Graphs in Language Models
- [Added 2025-10-26] Perhaps try to reproduce on a smaller model?
- Additional reference: [Added 2025-10-26] On the Biology of a Large Language Model
- [Added 2025-10-25] Read Tracing the thoughts of a large language model
- [Added 2025-10-21] Read Progress measures for grokking via mechanistic interpretability by Nanda et al.
- [Added 2025-10-21] Look into reproducing Progress measures for grokking via mechanistic interpretability by Nanda et al.
Documentation
- [Added 2025-10-24] Add section on current project (Dyck Interp Probe)
- [Completed 25-10-23] Document 25-10-17 to 25-10-20 range of progress