2026-01-16 | ARBOx Day 10 (Last Day :/ )
Goal: Day 10 of ARBOx: Cross-lingual Alignment Generalization + Refusal Task
Summary: Extract steering vectors for language (e.g. German - English), apply to models to test how entangled a concept is with the language
Work sessions
| In | Out |
|---|---|
| 10:00 | 18:00 |
Lesson learned: Do rigorous literature review! We found that the paper Refusal Direction is Universal Across Safety-Aligned Languages; however, things even seem more complicated! See LLMs Encode Harmfulness and Refusal Separately.
We also presented the results during the ARBOx final presentations! Will work on the writeup sometime next week!