Towards Automated Circuit Discovery for Mechanistic Interpretability
Authors: Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, AdriĆ Garriga-Alonso
Publication Date: 2023-10-28
Full Paper: Towards Automated Circuit Discovery for Mechanistic Interpretability
In Progress Reproduction: Github Repo
Notes
Keywords
-
Mech Interp -> "reverse-engineering model components into human-understandable algorithms"
-
Models = Computational Graphs to perform a task, circuits = "subgraphs with distinct functionality"