2026-05-06 | MOLTs

Goal: Train Gemma3-4B-IT MOLTs

Summary: Use Multi-GPU training to train Gemma3-4B-IT MOLTs

Work sessions

In	Out
08:00	17:00

Nvidia-SMI shows 100% utilization but not necessarily true that all SM's are equally a. Checkpoints storage
100M runs (not that large); 1 Day a. DDP out of the box with multiple H100s (e.g. 2 H200s, $8 an hour; bump up the batch size) b. Increase batch size and tune before kicking off run c. MFU d. BS 4000 for GPT-2
Greater than task
Can try to train on Larger Gemma Models in parallel
Did the reviewer learn something? a. Workshop paper is → what type of experiments; do we think this is a good tool for zoo of feature dictionaries b. E.g. Feature dashboards, MSE vs. Jacobian faithfulness
Save the checkpoint after 10M, then develop experiments and re-run with the 100M checkpoint (5090; doesn't have as much memory but not needed)

Necessity vs. Sufficiency