Skip to content

2025-10-24 | Transformer Lens

Goal: Learn Transformer Lens

Summary: Learning Transformer Lens for Encoder-Based Syntax Probe

Work sessions

In Out
15:30 16:00
16:53 17:55

Dyck Language Probe

Taking a quick break today on the ARENA curriculum to work on Dyck-Interp-Probe (though it was really fun to start sharing AI Safety ideas/ARENA curriculum with the Computational Linguistics Reading Group yesterday--next week we will start with a Karpathy Zero to Hero to ramp up for ARENA curriculum).

The first version which I had worked on was programmed using Pytorch but since we're working on scaling up the experiments (binary searching number of layers needed for high performance and randomizing data partitions), it might be worthwhile to refactor for better collaboration. Super cool to see that Let's build GPT: from scratch, in code, spelled out. was where this project's programming began.

This work is done in collaboration with Prof. Khalil Iskarous (Computational Linguistics at USC).

Progress

  1. Add support to load from a config rather than specifying in the code (decoupling for better quality of life for experimentation). Also added a logger and saved the training curves.

  2. Model performance seems to change throughout different runs, for example:

Run A: Best Epoch = 147 with accuracy = 0.8352941274642944 and unconfidence = 0.12352941185235977
Run B: Best Epoch = 149 with accuracy = 0.8588235378265381 and unconfidence = 0.10000000149011612
  • The most likely culprits are (1) Data shuffling (2) Model random initialization
  • I am leaning towards (2) being the issue since the data is not shuffled (but will need to verify this)

  • In the experiments, the accuracy is hovering around 85%. Previously, I had seen accuracy close to ceiling (~96% on Validation Set).

  • A follow up task is to tune AdamW optimizer hyperparameters to increase performance

  • Initial experiment from 12 -> 6 layers shows no performance degradation

Run with 6 layers: Best Epoch = 152 with accuracy = 0.8705882430076599 and unconfidence = 0.10588235408067703