Skip to content

2026-01-16 | ARBOx Day 10 (Last Day :/ )

Goal: Day 10 of ARBOx: Cross-lingual Alignment Generalization + Refusal Task

Summary: Extract steering vectors for language (e.g. German - English), apply to models to test how entangled a concept is with the language

Work sessions

In Out
10:00 18:00

Lesson learned: Do rigorous literature review! We found that the paper Refusal Direction is Universal Across Safety-Aligned Languages; however, things even seem more complicated! See LLMs Encode Harmfulness and Refusal Separately.

We also presented the results during the ARBOx final presentations! Will work on the writeup sometime next week!