Skip to content

2026-04-13 | MOLT Followup on Auto Labelling

Goal: PCD Repro + MOLT Task Scaffolding

Summary: Planned out tasks for MOLTs to use larger GPU and 100M Tokens

Work sessions

In Out
07:30 08:30
15:45 16:45
23:00 23:30

1. Blockers from yesterday with Path to Green

  1. 🟥 Running MOLT implementation in crosslayer-transcoder has GPU setup issues
    1. 🟥 Disk too small (50GB RTX 5090 but OpenWebText is 55GB)
      • Path to Green: Use 256GB Disk RTX 5090 or larger if needed (Georg originally ran on B100)
    2. 🟥 PyTorch CUDA build on Vast.AI default does not support sm_120 (Streaming Multiprocessor)
      • Path to Green: Check if the mainline has this issue (likely the pyproject.toml CUDA wheel pointed to is wrong)

2. Tasks for the week

  1. 🟢 Fix MOLT Branch GPU Issues
  2. 🟢 Refactor MOLT branch onto mainline (additional quality of life features exist on mainline) -- use Claude Code for this

    • Success Criteria: MOLT WandB run should match exactly the one here

    • Create PR to merge into mainline when ready

  3. 🟢 Sanity check that Gemma does not collapse on new MOLTs mainline

    • Success Criteria: L0 is non-zero by end of training cycle with smaller number of tokens (e.g. 2M)
  4. 🟢 Enable Mixed Precision (precision: "16-mixed" setting in the config.yaml)
  5. 🟢 Scale MOLT Training (100M tokens, increase MOLT N-multiplier to maximum size that can fit on the GPU)
    • Success Criteria: MOLT L0 should remain low while NMSE is low; can compare to Transcoders on L0 vs. NMSE and L0 vs. Jacobian baselines
    • Non-Goal: Do not try to label the features using AutoInterp methods yet

Notes on Workflow

Claude Code usage

  • Set Effort to High (potentially always use Reasoning)
  • Dangerously Skip Permissions by creating non-root user

To stay in touch with Codebase

  • Continue to use the reproducible experiment setup detailed in Claude Code guide by Johnny Wei
  • Write core implementations by hand (if it already exists, hand-copy to understand how the math maps onto scaling -- e.g. now multiple batches and token positions)
    • Core Architecture: MOLT architecture, JumpReLU Surrogate Loss, and Loss Function (probably also Frobenius Norm)
    • Metrics: L0, Normalized MSE, Jacobian
  • Use PRs for human context / keep humans in the loop