2026-05-07 | MOLTs
Goal: Continue Training Gemma3-4B-IT MOLTs
Summary: Use Multi-GPU training to train Gemma3-4B-IT MOLTs
Work sessions
| In | Out |
|---|---|
| 08:00 | 17:00 |
-
~4 Hour (3 hours 40 mins) to train Gemma3-1B-IT on 4xB200 (note: 4xB200 was $14.26/hr and 2xB200 was $13.7/hr at the time which is why I opted for the more expensive version)
a. $56 to train the 1B variant
-
~13 hr on 2xB200; $13.7 for Gemma3-4B-IT to reach step 19600; stopped early because MSE was increasing and average L0 flattened to ~3.0 (decreasing) but number of dead features kept increasing
a. https://wandb.ai/kgng-usc/debug-molt/runs/rh1y5m7s?nw=nwuserkgng checkpoints/molt-multilayer-gemma3-4b-it-N50-100M-ddp-2gpu-b200-resume/molt-multilayer-gemma3-4b-it-N50-100M-2gpu-b200-fp32weights/molt-multilayer-gemma3-4b-it-N50-100M-ddp-2gpu-b200_layer_0_step19600.pt