2026-04-13 | MOLT Followup on Auto Labelling

Goal: PCD Repro + MOLT Task Scaffolding

Summary: Planned out tasks for MOLTs to use larger GPU and 100M Tokens

Work sessions

1. Blockers from yesterday with `Path to Green`

🟥 Running MOLT implementation in crosslayer-transcoder has GPU setup issues
1. 🟥 Disk too small (50GB RTX 5090 but OpenWebText is 55GB)
  - Path to Green: Use 256GB Disk RTX 5090 or larger if needed (Georg originally ran on B100)
2. 🟥 PyTorch CUDA build on Vast.AI default does not support sm_120 (Streaming Multiprocessor)
  - Path to Green: Check if the mainline has this issue (likely the pyproject.toml CUDA wheel pointed to is wrong)

🟢 Fix MOLT Branch GPU Issues
🟢 Refactor MOLT branch onto mainline (additional quality of life features exist on mainline) -- use Claude Code for this
- Success Criteria: MOLT WandB run should match exactly the one here
- Create PR to merge into mainline when ready
🟢 Sanity check that Gemma does not collapse on new MOLTs mainline
- Success Criteria: L0 is non-zero by end of training cycle with smaller number of tokens (e.g. 2M)
🟢 Enable Mixed Precision (precision: "16-mixed" setting in the config.yaml)
🟢 Scale MOLT Training (100M tokens, increase MOLT N-multiplier to maximum size that can fit on the GPU)
- Success Criteria: MOLT L0 should remain low while NMSE is low; can compare to Transcoders on L0 vs. NMSE and L0 vs. Jacobian baselines
- Non-Goal: Do not try to label the features using AutoInterp methods yet

Continue to use the reproducible experiment setup detailed in Claude Code guide by Johnny Wei
Write core implementations by hand (if it already exists, hand-copy to understand how the math maps onto scaling -- e.g. now multiple batches and token positions)
- Core Architecture: MOLT architecture, JumpReLU Surrogate Loss, and Loss Function (probably also Frobenius Norm)
- Metrics: L0, Normalized MSE, Jacobian
Use PRs for human context / keep humans in the loop