2025-12-01 | Reproducing Neo et al.

Goal: Reproduce Neo et al. 2024

Summary: Allow tokenizer to process Batch Size > 1

Work sessions

In	Out
00:00	00:35
15:30	16:30

Reproducing Neo et al.

Tokenizer Padding ID

instead, use the end-of-sequence (EOS) token instead

tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id

Thus, when tokenizing two prompts, the attention mask/padding will extend prompts to the longest prompt in the batch

"hi there" --> [5303, 612, 50256, 50256, 50256] # See that 50256 is the padding token
"what time is it?" --> [10919, 640, 318, 340, 30]

Therefore, we see the attention mask 0-ed out for hi there

"hi there" --> [1, 1, 0, 0, 0]
"what time is it?" --> [1, 1, 1, 1, 1]

Right-truncate long prompts with truncation=True in tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)
OOM (Out of Memory) issues with high batch size inference
- Solution: support using Collab with VsCode to add GPU Support
- Ensure that tensors are on the same device during computation
- Results: 4 min = 4000 examples vs. before 2000