Skip to content

Doing The Thing

Towards Automated Circuit Discovery for Mechanistic Interpretability

Towards Automated Circuit Discovery for Mechanistic Interpretability

Authors: Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso

Publication Date: 2023-10-28

Full Paper: Towards Automated Circuit Discovery for Mechanistic Interpretability

In Progress Reproduction: Github Repo

Notes

Keywords

Mech Interp -> "reverse-engineering model components into human-understandable algorithms"
Models = Computational Graphs to perform a task, circuits = "subgraphs with distinct functionality"