Multilingual Semantics Probe
Version 1: 2025-12-18 - 2026-01-03
Timelog
To timebox this project, we'll follow a Nanda MATS stream style 16-20 hour project.
Hopefully, this rigorous/pragmatic approach can lead to some ambitious outcomes in understanding model representations of the scope ambiguity phenomenon.
| Hour |
Progress |
| 0-1 |
Scaffold Project using GPT Project |
| 1-2 |
Setup Github, generate stimuli, start understanding log probs |
| 2-5.5 |
Understand log probs, vectorized logic for Continuations Log Probs |
| 5.5-7.5 |
Setup aggregation/comparing Log Probs for surface vs. inverse Prompts |
| 7.5-10.5 |
Debug scripts to add (1) bfloat16 support large models (>27b) and (2) comparison of EN vs. ZH |
| 10.5-12 |
Read Fang et al. and Schut 2025 et al., test GPT-2-Chinese (uer_gpt2-xlarge-chinese-cluecorpussmall), aya-23-35B, Gemma2-27B, and aya-expanse-32b |
| 12-12.5 |
Verified that Existential-Universal and Universal-Existential prefer the same continuation preferences |
| 12.5-13 |
Find that English preference for inverse and Chinese preference for surface is statistically significant (p=1.948e-18) |
| 13 - 14 |
Learn and document math of Steering Vectors |
| 14 - 15 |
Setback--GPT2-Chinese is unreliable |
| 15 - 16 |
Qwen2.5 Surface/Inverse Scope Judgements; continue documenting Steering Vector math |
| 17 - 20 |
Look for Steering Vectors |
High Level Summary
I am interested in investigating how (multilingual) LLMs represent Quantifier Scope Ambiguity cross-linguistically.
Examples:
| Language |
Sentence |
Expected Interpretation |
| English |
A shark ate every pirate |
Ambiguous (1) There is exactly 1 shark (Surface Scope) (2) There are 1 or more sharks (Inverse Scope) |
| Mandarin |
yǒu yì tiáo shāyú chī le měi yí gè hǎidào
有一條鯊魚吃了每一個海盜 |
Unambiguous (2) There is exactly 1 shark (Surface Scope) |
Research Questions
- Do LLMs allow for Ambiguous (Surface/Inverse) quantifier scope in English but only Surface Scope in Mandarin?
- Recent research on model size has found "shared circuitry increases with model scale" (Brinkmann et al. 2025); does model size impact whether an LLMs applies the correct language-specific interpretation?
- E.g. larger models with a multi-lingual semantic space would incorrectly apply the Inverse scope to Mandarin while a small model wouldn't
- Is there an interpretable hidden representation for quantifier scope? Can this be used to steer the model to get a specific interpretation in a particular language?
Hypothesis
Given a multiple-choice situation (please choose exactly 1 answer, (1) there is exactly 1 shark | (2) there are 1 or more sharks):
- Larger models will over-generalize English Inverse Scope semantics to Mandarin while smaller models will correctly apply language-specific semantic interpretations
Relevant Papers
-
Scrontas et al. 2017: Cross-linguistic scope ambiguity: When two systems meet
a. Double quantifier scope ambiguity in Mandarin-English Heritage Bilinguals
-
Brinkmann et al. 2025: Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages
a. Larger models can represent multi-lingual concepts about morphosyntactic category (even when predominantly trained on English)
-
Claude Haiku Multilingual Circuits
a. Section 5.6 shows that English is priviledged though multi-lingual representations do exist
-
Fang et al. 2025: Quantifier Scope Interpretation in Language Learners and LLMs
a. Exactly the same from Scrontas et al. 2017 applied to humans, gives insightful models to try
-
Schut et al. 2025: Do Multilingual LLMs Think In English?
a. Evidence that multilingual models reason in an English centric way
b. Introduction to the steering vector concept
Project Scaffold
- Create a stimulus dataset of inverse/surface quantifier scope with various lexical items (words, aka not just sharks and pirates)
- Create pipeline for evaluating model log-probs for surface/inverse scope
- Inspect models that show inverse scope (Interp techniques TBD keeping pragmaticism in mind; likely linear probe + steering vector)
Steering Vectors Math
- The goal of this project is not just to find correlation between a models hidden representation and the quantifier scope feature (in doubly quantified sentences), but to test causality through steering vectors.
First principles derived method for looking into steering vectors
Two key assumptions:
Given a model's hidden representation in the residual stream \(h_{l} \in \mathbb{R}^d\) in layer \(l\):
-
Hidden State h as combination of direction w and noise
- Assume the hidden state is composed of the direction \(w\) and other info/noise \(\epsilon\)
\[h_l = z \cdot w + \epsilon\]
a. \(z \in {-1, +1}\) -> -1 if feature on (e.g. inverse scope), +1 if feature off (e.g. surface scope)
b. \(w \in \mathbb{R}^d\) -> direction in activation space of a feature
c. \(\epsilon\) is unrelated information to the direction \(w\) in activation space / noise
-
Difference of means method to find steering vector \(w\)\
- Assume the average hidden representation for example \(i\) from class {surface, inverse} \(h_{l,i}\) when subtracted will yield the direction vector
a. let \(i \in S\) be surface examples and \(i \in I\) be inverse examples
b. Plug in the \(h_l\) example formula and get the RHS equivalent
\[ \mu_S = \frac{1}{|S|} \sum_{i \in S} h_{l,i} = \frac{1}{|S|} \sum_{i \in S} z_i \cdot w + \epsilon_i = (+1) w + \mathbb{E}[\epsilon]\]
\[ \mu_I = \frac{1}{|I|} \sum_{i \in S} h_{l,i} = \frac{1}{|I|} \sum_{i \in S} z_i \cdot w + \epsilon_i = (-1) w + \mathbb{E}[\epsilon]\]
Problem, now we have this pesky \(\mathbb{E}[\epsilon]\) term, but we want \(w\)! Since the \(\mathbb{E}[\epsilon]\) term occurs in both \(\mu_S and \mu_I\), subtracting them gives us \(v \approx 2w\)
\[v = \mu_S - \mu_I = 2w\]
-
Causal Steering via Intervention: Test to see if \(w\) can change the model's output
- Downstream operations on \(h_l\) are a combination of linear mappings/non-linear activations. Thus, we can make a simplifying assumption of how \(h_l\) is used by the model
\[\text{model information from h} \approx w^{\top} h_l\]
- Intervene by turning on/off the direction
\[h_{l, i}' = h_{l,i} + \alpha v\]
- Mathematically, we change the representation by adding the direction to the information the model uses it
\[\text{model information from h}' \approx w^{\top} h_{l,i}' = w^{\top} (h_{l,i} + \alpha v) = w^{\top}h_{l,i} + \alpha w^{\top}v \]
-
Similarity of direction w and hidden state h
- Take the dot product of \(v\) and \(h_{l,i}\) to see if the hidden state and steering vector point in the same direction (are aligned)
\[ p_i = v^{\top} h_{l,i} \]
- if v represents the inverse scope direction then $ p_i \uparrow -> h_{l,i}$ encodes inverse scope, if $ p_i \downarrow -> h_{l,i}$ encodes surface scope, else if $ p_i = 0 -> h_{l,i}$ is not related to scope.
Mathematically, this hashes out as:
\[ p_i = v^{\top} h_{l,i} \]
\[ = (2w)^{\top} (z_i w + \epsilon_i)\]
\[ = (2w)^{\top} (z_i w) + (2w)^{\top} \epsilon_i \]
Since we assume that $ \mathbb{E}[\epsilon_i] = 0 $ (or perhaps that it's constant), we are reduced to
\[ \approx (2w)^{\top} (z_i w) \]
\[ p_i \approx 2 z_i ||w||^{2} \]
a. Thus, if \(i \in I\) (denotes an inverse scope sentence), then \(z_i = (+1)\) and \(p_i\) should be positive.
b. Thus, if \(i \in S\) (denotes a surface scope sentence), then \(z_i = (-1)\) and \(p_i\) should be negative.
Progress Details
Hour 0-1: Scaffolding with AI
1 hour brainstorming session with GPT on roadmap for executing the experiment
-
Started looking into the interpretability technique (linear probe) contenders and models (Gemma, Llama, Qwen, DeepSeek)
-
A new TODO is to understand how Reasoning models work mathematically (high level intuition)
Hour 1-2: Setup Github, generate stimuli, start understanding log probs
-
Evaluation methods: Use same language continuations instead of MCQs to test latent linguistic knowledge rather than meta-linguistic judgement. Specifically, MCQs may be testing a models reasoning capabilities which are often skewed towards English reasoning (doesn't exactly tell us something interesting about the models underlying representations if we look at the likely English reasoning space)
-
Setup ipynb to generate stimuli,jsonl and stimuli_with_continuations.jsonl
-
Next Step: Understand how to compare continuations log probs from first principles
Hour 2-5.5: Understand log probs, vectorized logic for Continuations Log Probs
-
Work out Log Probs/comparisons from first principles on pencil and paper
-
Extract Log Probs of inverse/surface continuations through a batched operation
-
Probably should take more breaks/sleep when working on this instead of grinding through the night
Hour 5.5-7.5: Setup aggregation/comparing Log Probs for surface vs. inverse Prompts
-
Setup pipeline to compare surface vs. inverse log sum/mean difference and odds (exponentiate the log differences)
-
Save outputs for model specifics
The key finding from working on GPT-2 is that the model always prefers surface scope. Next, we will try to see if this extrapolates to other models that have more than just pretraining.
Additional follow up:
- After running some experiments as background jobs, the models
Qwen3-0.6B, Llama-3.2-1B, and Llama-3.2-1B-Instruct
- Interestingly enough,
gemma-3-1b prefers inverse scope for Mandarin for most examples whereas for English both surface and inverse are preferred. This is counter to intuitions from Natural Language semantics where inverse scope is not available
- This could be due to pragmatics ("a child made every parent smile" is more likely to have inverse scope than "a president made every citizen happy")
TODO:
-
Perhaps investigate this pragmatic infludence with MCQ style questions?
-
Investiage whether the prompts in different languages show the same inverse/surface scope; if both Mandarin and English prompt translations give the same scope, this is likely affects of pragmatics (and the most "likely"/"first" interpretation)
Hour 7.5-10.5: Debug scripts to add (1) bfloat16 support large models (>27b) and (2) comparison of en vs. zh
-
Models greater than 12b (e.g. Gemma-3-12b) are too large to fit on a High-RAM A100 on Collab in fp32
- Some back of the napkin calculation
$$ 12 \text{ b params} \cdot \frac{32 \text{ bits}}{1 \text{ param}} \cdot \frac{1 \text{ byte}}{8
\text{ bits}} \cdot \frac{1 \text{ G}}{1 \text{ B}} 96 \text{ Gb}$$
- Since a A100 High-RAM GPU has 167 GB of CPU RAM but 80 GB of HBM (GPU RAM), then FP32 will not fit on device
- Pivoting to use FP16 halfs the footprint to 48GB but the decreased range causes logits to go to NaNs
- Thus, using BF16 solves the memory footprint and range issues
- We also upcast logits to FP32 during post-processing
-
The results in the tabs below show that for models >4B, surface is always preferred; smaller models (270M and 1B) prefer surface for en and incorrectly prefer inverse for zh
| Model Size |
en Preference to Surface |
en Preference to Inverse |
zh Preference to Surface |
zh Preference to Inverse |
Takeaway |
| Gemma-3-27B |
59 |
5 |
51 |
13 |
Strong surface preference in both English and Mandarin; large model behaves conservatively and consistently across languages. |
| Gemma-3-12B |
51 |
13 |
37 |
27 |
Surface preference remains, but Mandarin shows degradation and increased inverse scope relative to English. |
| Gemma-3-4B |
47 |
17 |
41 |
23 |
Both languages show weakened surface bias; Mandarin drifts further toward inverse interpretations. |
| Gemma-3-1B |
45 |
19 |
14 |
50 |
English still surface-biased, but Mandarin strongly prefers inverse—opposite of theoretical expectation for small models. |
| Gemma-3-270M |
64 |
0 |
29 |
35 |
English collapses entirely to surface scope; Mandarin slightly prefers inverse, showing extreme cross-lingual divergence. |
Gemma-3-27b results
Takeaway: Model prefers surface form for both zh and en
| model |
language |
inverse |
surface |
total |
p_inverse |
| google_gemma-3-27b-it |
en |
5 |
59 |
64 |
7.8% |
| google_gemma-3-27b-it |
zh |
13 |
51 |
64 |
20.3% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| google_gemma-3-27b-it |
en |
64 |
-0.858 |
-0.863 |
0.546 |
64 |
0.491 |
0.422 |
0.281 |
| google_gemma-3-27b-it |
zh |
64 |
-0.961 |
-1.098 |
1.238 |
64 |
0.842 |
0.333 |
1.313 |
| model |
agreement_rate |
| google_gemma-3-27b-it |
71.9% |
| model |
pattern |
count |
| google_gemma-3-27b-it |
surface_EN__surface_ZH |
46 |
| google_gemma-3-27b-it |
surface_EN__inverse_ZH |
13 |
| google_gemma-3-27b-it |
inverse_EN__surface_ZH |
5 |
Gemma-3-12b results
Takeaway: Model prefers surface form for both zh and en; zh degraded performance
| model |
language |
inverse |
surface |
total |
p_inverse |
| google_gemma-3-12b-it |
en |
13 |
51 |
64 |
20.3% |
| google_gemma-3-12b-it |
zh |
27 |
37 |
64 |
42.2% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| google_gemma-3-12b-it |
en |
64 |
-0.496 |
-0.549 |
0.521 |
64 |
0.699 |
0.578 |
0.392 |
| google_gemma-3-12b-it |
zh |
64 |
0.13 |
-0.517 |
1.634 |
64 |
4.511 |
0.598 |
7.996 |
| model |
agreement_rate |
| google_gemma-3-12b-it |
37.5% |
| model |
pattern |
count |
| google_gemma-3-12b-it |
surface_EN__inverse_ZH |
27 |
| google_gemma-3-12b-it |
surface_EN__surface_ZH |
24 |
| google_gemma-3-12b-it |
inverse_EN__surface_ZH |
13 |
Gemma-3-4b results
Takeaway: Both zh/en prefer inverse slightly more. Degradation in Mandarin performance
| model |
language |
inverse |
surface |
total |
p_inverse |
| google_gemma-3-4b-it |
en |
17 |
47 |
64 |
26.6% |
| google_gemma-3-4b-it |
zh |
23 |
41 |
64 |
35.9% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| google_gemma-3-4b-it |
en |
64 |
-0.386 |
-0.602 |
1.155 |
64 |
1.857 |
0.548 |
4.793 |
| google_gemma-3-4b-it |
zh |
64 |
-0.849 |
-0.861 |
1.955 |
64 |
1.602 |
0.423 |
2.271 |
| model |
agreement_rate |
| google_gemma-3-4b-it |
56.2% |
| model |
pattern |
count |
| google_gemma-3-4b-it |
surface_EN__surface_ZH |
30 |
| google_gemma-3-4b-it |
surface_EN__inverse_ZH |
17 |
| google_gemma-3-4b-it |
inverse_EN__surface_ZH |
11 |
| google_gemma-3-4b-it |
inverse_EN__inverse_ZH |
6 |
Gemma-3-1b results
Takeaway: zh heavily prefers inverse while en prefers surface. Opposite of expected behavior; theoretically if aligned with Brinkmann et al. 2025, then smaller models would prefer only surface form.
| model |
language |
inverse |
surface |
total |
p_inverse |
| google_gemma-3-1b-it |
en |
19 |
45 |
64 |
29.7% |
| google_gemma-3-1b-it |
zh |
50 |
14 |
64 |
78.1% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| google_gemma-3-1b-it |
en |
64 |
-0.432 |
-0.416 |
0.622 |
64 |
0.772 |
0.66 |
0.455 |
| google_gemma-3-1b-it |
zh |
64 |
0.704 |
0.492 |
0.934 |
64 |
3.56 |
1.636 |
5.84 |
| model |
agreement_rate |
| google_gemma-3-1b-it |
45.3% |
| model |
pattern |
count |
| google_gemma-3-1b-it |
surface_EN__inverse_ZH |
33 |
| google_gemma-3-1b-it |
inverse_EN__inverse_ZH |
17 |
| google_gemma-3-1b-it |
surface_EN__surface_ZH |
12 |
| google_gemma-3-1b-it |
inverse_EN__surface_ZH |
2 |
Gemma-3-270m results
Takeaway: en only predicts surface while slight preference to inverse for zh
| model |
language |
inverse |
surface |
total |
p_inverse |
| google_gemma-3-270m-it |
en |
0 |
64 |
64 |
0.0% |
| google_gemma-3-270m-it |
zh |
35 |
29 |
64 |
54.7% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| google_gemma-3-270m-it |
en |
64 |
-4.226 |
-3.902 |
1.644 |
64 |
0.04 |
0.02 |
0.057 |
| google_gemma-3-270m-it |
zh |
64 |
0.386 |
0.376 |
2.904 |
64 |
24.851 |
1.46 |
62.733 |
| model |
agreement_rate |
| google_gemma-3-270m-it |
45.3% |
| model |
pattern |
count |
| google_gemma-3-270m-it |
surface_EN__inverse_ZH |
35 |
| google_gemma-3-270m-it |
surface_EN__surface_ZH |
29 |
Hour 10.5-12: Read Fang et al. Schut 2025 et al., test GPT-2-Chinese (uer_gpt2-xlarge-chinese-cluecorpussmall), aya-23-35B, Gemma2-27B, and aya-expanse-32b
Based on
1. Schut 2025 et al.: Do Multilingual LLMs Think In English?
a. LLMs use English representation to reason/
2. Fang et al.: Quantifier Scope Interpretation in Language Learners and LLMs
Find that gpt2-chinese prefers surface for Chinese and inverse for English!
aya-23-35B results
| model |
language |
inverse |
surface |
total |
p_inverse |
| CohereLabs_aya-23-35B |
en |
5 |
59 |
64 |
7.8% |
| CohereLabs_aya-23-35B |
zh |
31 |
33 |
64 |
48.4% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| CohereLabs_aya-23-35B |
en |
64 |
-0.89 |
-0.964 |
0.485 |
64 |
0.468 |
0.382 |
0.283 |
| CohereLabs_aya-23-35B |
zh |
64 |
0.035 |
-0.018 |
0.389 |
64 |
1.111 |
0.982 |
0.405 |
| model |
agreement_rate |
| CohereLabs_aya-23-35B |
56.2% |
| model |
pattern |
count |
| CohereLabs_aya-23-35B |
surface_EN__surface_ZH |
32 |
| CohereLabs_aya-23-35B |
surface_EN__inverse_ZH |
27 |
| CohereLabs_aya-23-35B |
inverse_EN__inverse_ZH |
4 |
| CohereLabs_aya-23-35B |
inverse_EN__surface_ZH |
1 |
gpt2-xlarge-chinese-cluecorpussmall results
| model |
language |
inverse |
surface |
total |
p_inverse |
| uer_gpt2-xlarge-chinese-cluecorpussmall |
en |
64 |
0 |
64 |
100.0% |
| uer_gpt2-xlarge-chinese-cluecorpussmall |
zh |
0 |
64 |
64 |
0.0% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| uer_gpt2-xlarge-chinese-cluecorpussmall |
en |
64 |
0.471 |
0.467 |
0.08 |
64 |
1.607 |
1.596 |
0.129 |
| uer_gpt2-xlarge-chinese-cluecorpussmall |
zh |
64 |
-0.703 |
-0.715 |
0.243 |
64 |
0.51 |
0.489 |
0.125 |
| model |
agreement_rate |
| uer_gpt2-xlarge-chinese-cluecorpussmall |
0.0% |
| model |
pattern |
count |
| uer_gpt2-xlarge-chinese-cluecorpussmall |
inverse_EN__surface_ZH |
64 |
aya-expanse-32b results
| model |
language |
inverse |
surface |
total |
p_inverse |
| CohereLabs_aya-expanse-32b |
en |
0 |
64 |
64 |
0.0% |
| CohereLabs_aya-expanse-32b |
zh |
61 |
3 |
64 |
95.3% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| CohereLabs_aya-expanse-32b |
en |
64 |
-1.681 |
-1.636 |
0.366 |
64 |
0.198 |
0.195 |
0.071 |
| CohereLabs_aya-expanse-32b |
zh |
64 |
0.82 |
0.922 |
0.487 |
64 |
2.532 |
2.514 |
1.198 |
| model |
agreement_rate |
| CohereLabs_aya-expanse-32b |
4.7% |
| model |
pattern |
count |
| CohereLabs_aya-expanse-32b |
surface_EN__inverse_ZH |
61 |
| CohereLabs_aya-expanse-32b |
surface_EN__surface_ZH |
3 |
gpt2-chinese-cluecorpussmall (aka small) results
| model |
language |
inverse |
surface |
total |
p_inverse |
| uer_gpt2-chinese-cluecorpussmall |
en |
64 |
0 |
64 |
100.0% |
| uer_gpt2-chinese-cluecorpussmall |
zh |
0 |
64 |
64 |
0.0% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| uer_gpt2-chinese-cluecorpussmall |
en |
64 |
1.014 |
1.036 |
0.206 |
64 |
2.813 |
2.819 |
0.55 |
| uer_gpt2-chinese-cluecorpussmall |
zh |
64 |
-0.923 |
-0.871 |
0.185 |
64 |
0.404 |
0.418 |
0.071 |
| model |
agreement_rate |
| uer_gpt2-chinese-cluecorpussmall |
0.0% |
| model |
pattern |
count |
| uer_gpt2-chinese-cluecorpussmall |
inverse_EN__surface_ZH |
64 |
Hour 12-13: Verify (1) Existential-Universal and Universal-Existential preference (2) statistical significance
-
Universal-Existential and Existential-Universal
a. Verify that the surface/inverse scope preference hold for either a Existential-Universal and Universal-Existential test
-
Using a Wilcox Signed-Rank Test
| Scope Type |
Language |
p-value |
| Existential-Universal |
en |
1.948e-18 |
| Existential-Universal |
zh |
1.948e-18 |
| Universal-Existential |
en |
1.674e-15 |
| Universal-Existential |
zh |
1.948e-18 |
Hour 13-14: Learn and document math of Steering Vectors
Potential Mechanistic Interpretability techniques:
The results above are super exciting! In fact, this will be my first time working on any SOTA probe type of technology from the MechInterp literature. A key focus right now is to make sure that I stay pragmatic considering I'm already at hour 13 of this project
Suggested approaches from AI:
-
Linear probes (per layer, per token)
-
Difference-of-means direction
3.Steering intervention
- One clean ablation experiment
Note: I haven't ever used any of these techniques so to me it is more important to also learn what these techniques imply/shortcomings.
Side note on hypotheses: I assumed that we are looking for a scope vector represented in the model. Ideally this scope vector would be active in both the zh and en sentences (available cross linguistically) and be steerable so that we can cause inverse scope preference for continuation. (since these GPT-2 models are only pretrained and not instruction tuned, it's not possible to affect the models QA/instruction following capabilities)
Derived the math for Steering Vectors from first principles with help from AI.
Hour 14-15: Setback--GPT2-Chinese is unreliable
Turns out the clean results from GPT2-Chinese judgements are a result of the model being undertrained.'
For example, this is the next most likely set of tokens in for uer/gpt2-xlarge-chinese-cluecorpussmall. I realized this when looking at the vocab size for the GPT-2-Chinese models being ~20k whereas the original GPT2-vocab is ~50k. Thus, this got me suspicious and sent me down a probing route. I am surprised how Fang et al. was able to use these models for their double quantifier paper.
Thus, it is likely that the model does not actually have a good understanding of the language.
'a shark ate everyday.the##n.a.n.a.a.a.a.a.a.a.a.a.a.a.a.a.'
每一条都是经过反复推敲的,不是随便说说的。[UNK][UNK][UNK][UNK][UNK][UNK][UNK][UNK][UNK][UNK][UNK][UNK][UNK][UNK][UNK][UNK][UNK]
qwen2.5-3B outputs are quite reasonable:
每一條路線的車站數量、車站名稱、車站位置、車站間的距離、車站間的時間、車站間的運行時間、車站間的運行速度、車站間的運
The number of stations on each route; the station names; station locations; the distances between stations; the time between stations; the operating time between stations; the operating speed between stations;
A Shark ate every pirate on a ship. The pirates were divided into 3 groups. The first group had 10 pirates, the second group had 15 pirates, and the third group had 20 pirates. If each pirate had 2 eyes, how
Hour 15-16: Qwen2.5 Surface/Inverse Scope Judgements
Goal: Try SOTA models which likely have high Mandarin proficiency (e.g. produced by Chinese AI Labs)
Note: here we are making a critical assumption that Qwen and Deepseek models have acquired Mandarin and English grammars; we safely make this assumption since these models are SOTA. However, we are also testing the models through a sanity check script to ensure the model produces reasonable completions.
After an hour of experimenting, Qwen2.5-{0.5, 1.5, 3, 7, 14, 32}B follow these trends:
- Existential-Universal (EU) for English prefers Surface but can accept some inverse scope readings starting model 1.5B and larger. Mandarin stimuli accept only surface forms
a. Interestingly, the English inverse preference occurs when the subject is
kangaroo. Perhaps some more stimui are needed to narrow down this behavior
- Universal-Existential (UE) show Surface preference for English and Mandarin. Surprisingly, Mandarin also accepts inverse scope in UE.
- All preferences for surface are significant
Existential-Universal Construction (A … Every…)
| Model Size |
Language |
Inverse |
Surface |
| Qwen2.5-0.5B |
en |
7 |
93 |
| Qwen2.5-0.5B |
zh |
8 |
92 |
| Qwen2.5-1.5B |
en |
14 |
86 |
| Qwen2.5-1.5B |
zh |
0 |
100 |
| Qwen2.5-3B |
en |
16 |
84 |
| Qwen2.5-3B |
zh |
0 |
100 |
| Qwen2.5-7B |
en |
10 |
90 |
| Qwen2.5-7B |
zh |
4 |
96 |
| Qwen2.5-14B |
en |
7 |
93 |
| Qwen2.5-14B |
zh |
0 |
100 |
| Qwen2.5-32B |
en |
10 |
90 |
| Qwen2.5-32B |
zh |
0 |
100 |
Universal-Existential Construction (Every…A …)
| Model / Model Size |
Language |
Inverse |
Surface |
| Qwen2.5-0.5B |
en |
6 |
94 |
| Qwen2.5-0.5B |
zh |
29 |
71 |
| Qwen2.5-1.5B |
en |
9 |
91 |
| Qwen2.5-1.5B |
zh |
22 |
78 |
| Qwen2.5-3B |
en |
6 |
94 |
| Qwen2.5-3B |
zh |
27 |
73 |
| Qwen2.5-7B |
en |
3 |
97 |
| Qwen2.5-7B |
zh |
24 |
76 |
| Qwen2.5-14B |
en |
7 |
93 |
| Qwen2.5-14B |
zh |
7 |
93 |
| Qwen2.5-32B |
en |
2 |
98 |
| Qwen2.5-32B |
zh |
26 |
74 |
Next steps:
- Since our goal is to use these sentences as a way to probe for steering vectors and then causality, the stats/stimlui need not be perfect.
- Instead, we can look to find those steering vectors to test the influence of hypothesized steering vectors.
Qwen/qwen2.5-0.5b eu scope
| model |
language |
inverse |
surface |
total |
p_inverse |
| Qwen_Qwen2.5-0.5B |
en |
7 |
93 |
100 |
7.0% |
| Qwen_Qwen2.5-0.5B |
zh |
8 |
92 |
100 |
8.0% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| Qwen_Qwen2.5-0.5B |
en |
100 |
-0.779 |
-0.797 |
0.536 |
100 |
0.528 |
0.45 |
0.292 |
| Qwen_Qwen2.5-0.5B |
zh |
100 |
-0.58 |
-0.564 |
0.621 |
100 |
0.669 |
0.569 |
0.494 |
| model |
agreement_rate |
| Qwen_Qwen2.5-0.5B |
85.0% |
| model |
pattern |
count |
| Qwen_Qwen2.5-0.5B |
surface_EN__surface_ZH |
85 |
| Qwen_Qwen2.5-0.5B |
surface_EN__inverse_ZH |
8 |
| Qwen_Qwen2.5-0.5B |
inverse_EN__surface_ZH |
7 |
| Language |
n |
Statistic |
p-value |
Mean Δ |
Median Δ |
Preference |
| en |
100 |
100 |
3.78e-17 |
-0.7787 |
-0.7974 |
surface |
| zh |
100 |
380 |
8.2e-14 |
-0.5802 |
-0.5642 |
surface |
Qwen/qwen2.5-0.5b ue scope
| model |
language |
inverse |
surface |
total |
p_inverse |
| Qwen_Qwen2.5-0.5B |
en |
6 |
94 |
100 |
6.0% |
| Qwen_Qwen2.5-0.5B |
zh |
29 |
71 |
100 |
29.0% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| Qwen_Qwen2.5-0.5B |
en |
100 |
-0.428 |
-0.441 |
0.242 |
100 |
0.671 |
0.644 |
0.169 |
| Qwen_Qwen2.5-0.5B |
zh |
100 |
-0.258 |
-0.301 |
0.394 |
100 |
0.833 |
0.74 |
0.323 |
| model |
agreement_rate |
| Qwen_Qwen2.5-0.5B |
69.0% |
| model |
pattern |
count |
| Qwen_Qwen2.5-0.5B |
surface_EN__surface_ZH |
67 |
| Qwen_Qwen2.5-0.5B |
surface_EN__inverse_ZH |
27 |
| Qwen_Qwen2.5-0.5B |
inverse_EN__surface_ZH |
4 |
| Qwen_Qwen2.5-0.5B |
inverse_EN__inverse_ZH |
2 |
| Language |
n |
Statistic |
p-value |
Mean Δ |
Median Δ |
Preference |
| en |
100 |
49 |
8.45e-18 |
-0.4277 |
-0.4407 |
surface |
| zh |
100 |
947 |
2.89e-08 |
-0.2577 |
-0.3014 |
surface |
Qwen/qwen2.5-1.5b eu scope
| model |
language |
inverse |
surface |
total |
p_inverse |
| Qwen_Qwen2.5-1.5B |
en |
14 |
86 |
100 |
14.0% |
| Qwen_Qwen2.5-1.5B |
zh |
0 |
100 |
100 |
0.0% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| Qwen_Qwen2.5-1.5B |
en |
100 |
-0.413 |
-0.436 |
0.309 |
100 |
0.695 |
0.647 |
0.231 |
| Qwen_Qwen2.5-1.5B |
zh |
100 |
-0.8 |
-0.779 |
0.217 |
100 |
0.459 |
0.459 |
0.093 |
| model |
agreement_rate |
| Qwen_Qwen2.5-1.5B |
86.0% |
| model |
pattern |
count |
| Qwen_Qwen2.5-1.5B |
surface_EN__surface_ZH |
86 |
| Qwen_Qwen2.5-1.5B |
inverse_EN__surface_ZH |
14 |
| Language |
n |
Statistic |
p-value |
Mean Δ |
Median Δ |
Preference |
| en |
100 |
184 |
4.17e-16 |
-0.4127 |
-0.4358 |
surface |
| zh |
100 |
0 |
1.95e-18 |
-0.8003 |
-0.7793 |
surface |
Qwen/qwen2.5-1.5b ue scope
| model |
language |
inverse |
surface |
total |
p_inverse |
| Qwen_Qwen2.5-1.5B |
en |
9 |
91 |
100 |
9.0% |
| Qwen_Qwen2.5-1.5B |
zh |
22 |
78 |
100 |
22.0% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| Qwen_Qwen2.5-1.5B |
en |
100 |
-0.285 |
-0.282 |
0.218 |
100 |
0.769 |
0.754 |
0.166 |
| Qwen_Qwen2.5-1.5B |
zh |
100 |
-0.432 |
-0.407 |
0.452 |
100 |
0.714 |
0.665 |
0.298 |
| model |
agreement_rate |
| Qwen_Qwen2.5-1.5B |
73.0% |
| model |
pattern |
count |
| Qwen_Qwen2.5-1.5B |
surface_EN__surface_ZH |
71 |
| Qwen_Qwen2.5-1.5B |
surface_EN__inverse_ZH |
20 |
| Qwen_Qwen2.5-1.5B |
inverse_EN__surface_ZH |
7 |
| Qwen_Qwen2.5-1.5B |
inverse_EN__inverse_ZH |
2 |
| Language |
n |
Statistic |
p-value |
Mean Δ |
Median Δ |
Preference |
| en |
100 |
140 |
1.2e-16 |
-0.2854 |
-0.2825 |
surface |
| zh |
100 |
486 |
1.19e-12 |
-0.4319 |
-0.4074 |
surface |
Qwen/qwen2.5-3b eu scope
| mode`l |
language |
inverse |
surface |
total |
p_inverse |
| Qwen_Qwen2.5-3B |
en |
16 |
84 |
100 |
16.0% |
| Qwen_Qwen2.5-3B |
zh |
0 |
100 |
100 |
0.0% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| Qwen_Qwen2.5-3B |
en |
100 |
-0.394 |
-0.404 |
0.365 |
100 |
0.72 |
0.668 |
0.27 |
| Qwen_Qwen2.5-3B |
zh |
100 |
-0.831 |
-0.861 |
0.247 |
100 |
0.449 |
0.423 |
0.119 |
| model |
agreement_rate |
| Qwen_Qwen2.5-3B |
84.0% |
| model |
pattern |
count |
| Qwen_Qwen2.5-3B |
surface_EN__surface_ZH |
84 |
| Qwen_Qwen2.5-3B |
inverse_EN__surface_ZH |
16 |
| Language |
n |
Statistic |
p-value |
Mean Δ |
Median Δ |
Preference |
| en |
100 |
318 |
1.62e-14 |
-0.3936 |
-0.4037 |
surface |
| zh ` |
100 |
0 |
1.95e-18 |
-0.8312 |
-0.8612 |
surface |
Qwen/qwen2.5-3b ue scope
| model |
language |
inverse |
surface |
total |
p_inverse |
| Qwen_Qwen2.5-3B |
en |
6 |
94 |
100 |
6.0% |
| Qwen_Qwen2.5-3B |
zh |
27 |
73 |
100 |
27.0% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| Qwen_Qwen2.5-3B |
en |
100 |
-0.49 |
-0.475 |
0.309 |
100 |
0.641 |
0.622 |
0.194 |
| Qwen_Qwen2.5-3B |
zh |
100 |
-0.167 |
-0.18 |
0.28 |
100 |
0.878 |
0.836 |
0.241 |
| model |
agreement_rate |
| Qwen_Qwen2.5-3B |
77.0% |
| model |
pattern |
count |
| Qwen_Qwen2.5-3B |
surface_EN__surface_ZH |
72 |
| Qwen_Qwen2.5-3B |
surface_EN__inverse_ZH |
22 |
| Qwen_Qwen2.5-3B |
inverse_EN__inverse_ZH |
5 |
| Qwen_Qwen2.5-3B |
inverse_EN__surface_ZH |
1 |
| Language |
n |
Statistic |
p-value |
Mean Δ |
Median Δ |
Preference |
| en |
100 |
46 |
7.73e-18 |
-0.4904 |
-0.4748 |
surface |
| zh |
100 |
1008 |
9.14e-08 |
-0.1675 |
-0.1796 |
surface |
Qwen/qwen2.5-7b eu scope
| model |
language |
inverse |
surface |
total |
p_inverse |
| Qwen_Qwen2.5-7B |
en |
10 |
90 |
100 |
10.0% |
| Qwen_Qwen2.5-7B |
zh |
4 |
96 |
100 |
4.0% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| Qwen_Qwen2.5-7B |
en |
100 |
-0.64 |
-0.688 |
0.363 |
100 |
0.565 |
0.502 |
0.224 |
| Qwen_Qwen2.5-7B |
zh |
100 |
-0.525 |
-0.542 |
0.28 |
100 |
0.614 |
0.581 |
0.172 |
| model |
agreement_rate |
| Qwen_Qwen2.5-7B |
86.0% |
| model |
pattern |
count |
| Qwen_Qwen2.5-7B |
surface_EN__surface_ZH |
86 |
| Qwen_Qwen2.5-7B |
inverse_EN__surface_ZH |
10 |
| Qwen_Qwen2.5-7B |
surface_EN__inverse_ZH |
4 |
| Language |
n |
Statistic |
p-value |
Mean Δ |
Median Δ |
Preference |
| en |
100 |
68 |
1.48e-17 |
-0.64 |
-0.6885 |
surface |
| zh |
100 |
14 |
2.97e-18 |
-0.5253 |
-0.5424 |
surface |
Qwen/qwen2.5-7b ue scope
| model |
language |
inverse |
surface |
total |
p_inverse |
| Qwen_Qwen2.5-7B |
en |
3 |
97 |
100 |
3.0% |
| Qwen_Qwen2.5-7B |
zh |
24 |
76 |
100 |
24.0% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| Qwen_Qwen2.5-7B |
en |
100 |
-0.432 |
-0.424 |
0.265 |
100 |
0.672 |
0.655 |
0.179 |
| Qwen_Qwen2.5-7B |
zh |
100 |
-0.259 |
-0.271 |
0.396 |
100 |
0.836 |
0.763 |
0.355 |
| model |
agreement_rate |
| Qwen_Qwen2.5-7B |
77.0% |
| model |
pattern |
count |
| Qwen_Qwen2.5-7B |
surface_EN__surface_ZH |
75 |
| Qwen_Qwen2.5-7B |
surface_EN__inverse_ZH |
22 |
| Qwen_Qwen2.5-7B |
inverse_EN__inverse_ZH |
2 |
| Qwen_Qwen2.5-7B |
inverse_EN__surface_ZH |
1 |
| Language |
n |
Statistic |
p-value |
Mean Δ |
Median Δ |
Preference |
| en |
100 |
54 |
9.8e-18 |
-0.4321 |
-0.4238 |
surface |
| zh |
100 |
911 |
1.43e-08 |
-0.2593 |
-0.2711 |
surface |
Qwen/qwen2.5-14b eu scope
| model |
language |
inverse |
surface |
total |
p_inverse |
| Qwen_Qwen2.5-14B |
en |
7 |
93 |
100 |
7.0% |
| Qwen_Qwen2.5-14B |
zh |
0 |
100 |
100 |
0.0% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| Qwen_Qwen2.5-14B |
en |
100 |
-0.485 |
-0.519 |
0.291 |
100 |
0.643 |
0.595 |
0.196 |
| Qwen_Qwen2.5-14B |
zh |
100 |
-0.962 |
-0.892 |
0.31 |
100 |
0.4 |
0.41 |
0.116 |
| model |
agreement_rate |
| Qwen_Qwen2.5-14B |
93.0% |
| model |
pattern |
count |
| Qwen_Qwen2.5-14B |
surface_EN__surface_ZH |
93 |
| Qwen_Qwen2.5-14B |
inverse_EN__surface_ZH |
7 |
| Language |
n |
Statistic |
p-value |
Mean Δ |
Median Δ |
Preference |
| en |
100 |
61 |
1.21e-17 |
-0.485 |
-0.5192 |
surface |
| zh |
100 |
0 |
1.95e-18 |
-0.9621 |
-0.8918 |
surface |
Qwen/qwen2.5-14b ue scope
| model |
language |
inverse |
surface |
total |
p_inverse |
| Qwen_Qwen2.5-14B |
en |
7 |
93 |
100 |
7.0% |
| Qwen_Qwen2.5-14B |
zh |
7 |
93 |
100 |
7.0% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| Qwen_Qwen2.5-14B |
en |
100 |
-0.274 |
-0.258 |
0.205 |
100 |
0.776 |
0.772 |
0.162 |
| Qwen_Qwen2.5-14B |
zh |
100 |
-0.694 |
-0.835 |
0.456 |
100 |
0.554 |
0.434 |
0.259 |
| model |
agreement_rate |
| Qwen_Qwen2.5-14B |
92.0% |
| model |
pattern |
count |
| Qwen_Qwen2.5-14B |
surface_EN__surface_ZH |
89 |
| Qwen_Qwen2.5-14B |
inverse_EN__surface_ZH |
4 |
| Qwen_Qwen2.5-14B |
surface_EN__inverse_ZH |
4 |
| Qwen_Qwen2.5-14B |
inverse_EN__inverse_ZH |
3 |
| Language |
n |
Statistic |
p-value |
Mean Δ |
Median Δ |
Preference |
| en |
100 |
131 |
9.26e-17 |
-0.2742 |
-0.2584 |
surface |
| zh |
100 |
61 |
1.21e-17 |
-0.6941 |
-0.8354 |
surface |
Qwen/qwen2.5-32b eu scope
| model |
language |
inverse |
surface |
total |
p_inverse |
| Qwen_Qwen2.5-32B |
en |
10 |
90 |
100 |
10.0% |
| Qwen_Qwen2.5-32B |
zh |
0 |
100 |
100 |
0.0% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| Qwen_Qwen2.5-32B |
en |
100 |
-0.415 |
-0.401 |
0.329 |
100 |
0.696 |
0.67 |
0.231 |
| Qwen_Qwen2.5-32B |
zh |
100 |
-1.192 |
-1.077 |
0.463 |
100 |
0.334 |
0.341 |
0.135 |
| model |
agreement_rate |
| Qwen_Qwen2.5-32B |
90.0% |
| model |
pattern |
count |
| Qwen_Qwen2.5-32B |
surface_EN__surface_ZH |
90 |
| Qwen_Qwen2.5-32B |
inverse_EN__surface_ZH |
10 |
| Language |
n |
Statistic |
p-value |
Mean Δ |
Median Δ |
Preference |
| en |
100 |
172 |
2.97e-16 |
-0.4151 |
-0.4008 |
surface |
| zh |
100 |
0 |
1.95e-18 |
-1.1921 |
-1.0773 |
surface |
Qwen/qwen2.5-32b ue scope
| model |
language |
inverse |
surface |
total |
p_inverse |
| Qwen_Qwen2.5-32B |
en |
2 |
98 |
100 |
2.0% |
| Qwen_Qwen2.5-32B |
zh |
26 |
74 |
100 |
26.0% |
| model |
language |
delta_mean__count |
delta_mean__mean |
delta_mean__median |
delta_mean__std |
ratio_mean__count |
ratio_mean__mean |
ratio_mean__median |
ratio_mean__std |
| Qwen_Qwen2.5-32B |
en |
100 |
-0.349 |
-0.295 |
0.199 |
100 |
0.719 |
0.744 |
0.132 |
| Qwen_Qwen2.5-32B |
zh |
100 |
-0.65 |
-0.815 |
0.672 |
100 |
0.652 |
0.443 |
0.438 |
| model |
agreement_rate |
| Qwen_Qwen2.5-32B |
74.0% |
| model |
pattern |
count |
| Qwen_Qwen2.5-32B |
surface_EN__surface_ZH |
73 |
| Qwen_Qwen2.5-32B |
surface_EN__inverse_ZH |
25 |
| Qwen_Qwen2.5-32B |
inverse_EN__inverse_ZH |
1 |
| Qwen_Qwen2.5-32B |
inverse_EN__surface_ZH |
1 |
| Language |
n |
Statistic |
p-value |
Mean Δ |
Median Δ |
Preference |
| en |
100 |
5 |
2.27e-18 |
-0.349 |
-0.2952 |
surface |
| zh |
100 |
539 |
4.29e-12 |
-0.6502 |
-0.8149 |
surface |
Hour 17-20: Look for Steering Vector
- Goal: look fora steering vector for inverse/surface in DQ constructions
- Agnostic on whether the model correctly generalizes the expected Mandarin surface scope only
- Agnostic on whether the steering vector applies cross-linguistically
Steps:
-
Find token position to find ./。 token before the continuation (where the information about the sentence is likely pooled
-
Calculate the \(\mu_S\) and \(\mu_I\) for en/zh to get \(v_{en}\) and \(v_{zh}\); normalize v into a unit vector.
-
Correlation: Histograms of $ p_i = v^{\top} h_{l,i} $ with \(i \in S\) and \(I \in I\) next to each other to see if there is correlation of the steering vector and the scope type
-
Causation: Intervene by adding \(h_{l, i}' = h_{l,i} + \alpha v\) and see if the preferences skew
Gotchas
-
Create a train/test set; calcuate \(v\) (and therefore \(\mu_{inverse}\) and \(\mu_{surface}\)) from the trainining set; see how well v generalizes to test set in \(p_i\)
-
Same as above, apply training \(v\) to test h's for steering
-
Deciding which layer is tricky (using an AUC metric but need to better understand math from first principles). Deciding how much to scale \(\alpha\) is also tricky
Takeaways so far:
-
Cross-linguistic evidence of v is not supported; applying \(v_{en}\) to zh data doesn't separate zh inverse and surface forms. Similarly, \(v_{zh}\) does not separate inverse and surface forms.
-
Mandarin Existential-Universal and Universal-Existential supports a language specific steering vector occurs early in layer 4-7, that separates inverse and surface judgements.
-
Tokenization idiosyncracies (e.g. removing a space between the sentence and its continuation in Mandarin) yields more favorability towards inverse scope). Additionally, Universal-Existential constructions favor inverse scope in Mandarin for model sizes except for Qwen2.5-14b.
Future Work
While there was not enough time to pursue activation patching of this steering vector in Mandarin stimulus, below are the next steps to evalute if candidate direction \(v_{zh}\) causally affect the continuation preference.
\[ h'_i = h_i + \alpha v_{zh} \]
-
Baseline: add {random noise, \(v_{zh}\)} to verify if \(v_{zh}\) indeed establishes causality
-
Intervene at high AUC layers (layers 4-7)
-
Test whether different token positions before the continuation may also encode scope direction (e.g. The continuation starting positions "there is"/"有一[CLASSIFIER]") in addition to the period mark which we assume to pool information.
Also added AUC, ROC, and Wilcoxon stats knowledge to todos