Skip to content

2026-05-06 | MOLTs

Goal: Train Gemma3-4B-IT MOLTs

Summary: Use Multi-GPU training to train Gemma3-4B-IT MOLTs

Work sessions

In Out
08:00 17:00
  1. Nvidia-SMI shows 100% utilization but not necessarily true that all SM's are equally a. Checkpoints storage
  2. 100M runs (not that large); 1 Day a. DDP out of the box with multiple H100s (e.g. 2 H200s, $8 an hour; bump up the batch size) b. Increase batch size and tune before kicking off run c. MFU d. BS 4000 for GPT-2
  3. Greater than task
  4. Can try to train on Larger Gemma Models in parallel
  5. Did the reviewer learn something? a. Workshop paper is → what type of experiments; do we think this is a good tool for zoo of feature dictionaries b. E.g. Feature dashboards, MSE vs. Jacobian faithfulness
  6. Save the checkpoint after 10M, then develop experiments and re-run with the 100M checkpoint (5090; doesn't have as much memory but not needed)

Necessity vs. Sufficiency

  1. Necessity: If we ablate the; if we get rid of ot dpesmT work amu,pre
  2. Sufficiency: If we add it somewhere, re-elicit behavior