2026-05-08 | MOLTs
Goal: Realized Fatal Flaw in MOLTs Training
Summary: Trained MOLTs for instruction tuned model without chat data and chat template
Work sessions
| In | Out |
|---|---|
| 08:00 | 17:00 |
-
Realized a fatal flaw of MOLT training
a. GPT-2 MOLTs perform badly (incorrect)
- [https://huggingface.co/kylelovesllms/molt-sweeps/tree/main/molt-multilayer-gpt2-N50-100M-fp32weights-full](https://huggingface.co/kylelovesllms/molt-sweeps/tree/main/molt-multilayer-gpt2-N50-100M-fp32weights-full) - GPT-2 is completion only, and therefore can't replicate the translation/+3 (addition) tasks which Anthropic's original MOLTs haveb. Drawbacks on Gemma3-4B-IT MOLTs
- Carried over training script from GPT-2 to Instruction Tuned Gemma without a chat template on OpenWebText - Even though Gemma3-4B-IT has math capabilities, training MOLTs on out of distribution (not chat template text) causes full MOLT replacement to return garbage - This was not captured by feature dashboards which just showed the features active per token but not multi-step generation (or rather degeneration) -
Proposal —> recreate a plan and analysis for MOLTs
a. See follow up proposal