2026-04-13 | MOLT Followup on Auto Labelling
Goal: PCD Repro + MOLT Task Scaffolding
Summary: Planned out tasks for MOLTs to use larger GPU and 100M Tokens
Work sessions
| In | Out |
|---|---|
| 07:30 | 08:30 |
| 15:45 | 16:45 |
| 23:00 | 23:30 |
1. Blockers from yesterday with Path to Green
- 🟥 Running MOLT implementation in crosslayer-transcoder has GPU setup issues
- 🟥 Disk too small (50GB RTX 5090 but OpenWebText is 55GB)
- Path to Green: Use 256GB Disk RTX 5090 or larger if needed (Georg originally ran on B100)
- 🟥 PyTorch CUDA build on Vast.AI default does not support
sm_120(Streaming Multiprocessor)- Path to Green: Check if the mainline has this issue (likely the
pyproject.tomlCUDA wheel pointed to is wrong)
- Path to Green: Check if the mainline has this issue (likely the
- 🟥 Disk too small (50GB RTX 5090 but OpenWebText is 55GB)
2. Tasks for the week
- 🟢 Fix MOLT Branch GPU Issues
-
🟢 Refactor MOLT branch onto mainline (additional quality of life features exist on mainline) -- use Claude Code for this
-
Success Criteria: MOLT WandB run should match exactly the one here
-
Create PR to merge into mainline when ready
-
-
🟢 Sanity check that Gemma does not collapse on new MOLTs mainline
- Success Criteria: L0 is non-zero by end of training cycle with smaller number of tokens (e.g. 2M)
- 🟢 Enable Mixed Precision (
precision: "16-mixed"setting in theconfig.yaml) - 🟢 Scale MOLT Training (100M tokens, increase MOLT N-multiplier to maximum size that can fit on the GPU)
- Success Criteria: MOLT L0 should remain low while NMSE is low; can compare to Transcoders on L0 vs. NMSE and L0 vs. Jacobian baselines
- Non-Goal: Do not try to label the features using AutoInterp methods yet
Notes on Workflow
Claude Code usage
- Set Effort to High (potentially always use Reasoning)
- Dangerously Skip Permissions by creating non-root user
To stay in touch with Codebase
- Continue to use the reproducible experiment setup detailed in Claude Code guide by Johnny Wei
- Write core implementations by hand (if it already exists, hand-copy to understand how the math maps onto scaling -- e.g. now multiple batches and token positions)
- Core Architecture: MOLT architecture, JumpReLU Surrogate Loss, and Loss Function (probably also Frobenius Norm)
- Metrics: L0, Normalized MSE, Jacobian
- Use PRs for human context / keep humans in the loop