NeurIPS Mech Interp Workshop 2025

Date: 25.12.07

Location: San Diego Convention Center

Time: The whole day! 4 hours of driving from LA to San Diego but definitely worth it!

Notes

Very nice to see the different perspectives in the mechanistic interpretability community specifically Curiosity and Ambitious Mechanistic Interpretability (AMI) and Pragmatic Interpretability
Some very cool people to learn from (and was really cool meeting them), Sarah from Transluce, Neel Nanda, David Bau, Leo Gao, J
Takeaways
- LLMs are probabilistic/continuous models of cognition; approaches like SAE and Linear probes for interpretability seek to find symbolic/discrete representations
- When thinking about distributed/connectionist and localist models of the world, it seems that the world is inherently continuous/probabilistic and our models of cognition can create symbolic/discrete representations as abstractions
- Pragmaticism and Ambition/Curiosity are two sides of the same coin. Similar to the understanding above, perhaps we should channel our curiosity through measurable proxy goals so that we can progress on our ambitious overarching goals
- Probing a models inverse scope quantification could be an interesting way to probe how models represent empirical language phenomena (English vs. Mandarin representations) through the intersection of multilingualism and semantics

Snippet of Cross-linguistics Inverse Scope Probe

Also, this past weekend I had the chance to attend the AI conference, NeurIPS, and started thinking about what it would mean to reproduce Scrontas et al. on Cross-linguistic scope ambiguity! So far, I've prompted a few frontier models (GPT-5, Claude, Gemini, and DeepSeek) and found that for the most part models allow for the inverse scope with a Mandarin prompt which should not be allowed. Some recent Mechanistic Interpretability research has found that "shared circuitry increases with model scale" so larger models share more latent representations (Brinkmann et al. 2025). When the winter holiday rolls around I'm curious to do a more rigorous probe of how multilingual models may represent (or not represent) inverse scope with perhaps both results being interesting!