2026-04-15 | PCD Finetuning Padding

Goal: PCD Debug Batched Finetuning

Summary: Feedback from PCD author Vincent Huang on a batched training issue

Work sessions

In	Out
21:30	22:00

Questions:

PCD Padding Issue

Section 4 Finetuning setup: Are SynthSys dataset "Decoder Question" included as part of the training loss or if the training signal is only a single multiple-choice-question token?
Practical challenges on (left/right) padding in batch fine tuning with different lengths of soft tokens:

a. While not a problem in pretraining with fixed (n_prefix, n_middle, n_suffix) lengths, SynthSys finetuning may have different length System messages and Decoder questions.

b. Masking PAD tokens fix tensor batch shape inconsistencies. However, would PAD tokens which still occupy a token index and thus calculate incorrect relative positions with RoPE?

the question tokens never appear in the training loss
you should take the combined soft token + decoder question sequence (without padding each one individually) and left-pad the combined sequence