2026-05-06 | MOLTs
Goal: Train Gemma3-4B-IT MOLTs
Summary: Use Multi-GPU training to train Gemma3-4B-IT MOLTs
Work sessions
| In | Out |
|---|---|
| 08:00 | 17:00 |
- Nvidia-SMI shows 100% utilization but not necessarily true that all SM's are equally a. Checkpoints storage
- 100M runs (not that large); 1 Day a. DDP out of the box with multiple H100s (e.g. 2 H200s, $8 an hour; bump up the batch size) b. Increase batch size and tune before kicking off run c. MFU d. BS 4000 for GPT-2
- Greater than task
- Can try to train on Larger Gemma Models in parallel
- Did the reviewer learn something? a. Workshop paper is → what type of experiments; do we think this is a good tool for zoo of feature dictionaries b. E.g. Feature dashboards, MSE vs. Jacobian faithfulness
- Save the checkpoint after 10M, then develop experiments and re-run with the 100M checkpoint (5090; doesn't have as much memory but not needed)
Necessity vs. Sufficiency
- Necessity: If we ablate the; if we get rid of ot dpesmT work amu,pre
- Sufficiency: If we add it somewhere, re-elicit behavior