*Real AI response after deliberately activating hate patterns to measure what biases exist beneath the surface
AI Models are Antisemitic
Even after safety training
Jewish people rank 2nd for most harm before safety training and 1st after safety training.
Emergent Bias in AI Models
Researchers we supported have found AI models show consistent and structured bias towards Jewish people at the deepest levels of pre-training. The earliest tests of our researchers have found ways to mitigate or eliminate this bias. Now we are seeking co-funders to build and scale the early success of this research.
Findings
When we steer models toward harmful behavior, they consistently produce worse outputs targeting Jewish people. This pattern appears across five leading open-source models, revealing that safety training suppresses but doesn't eliminate these biases.
Before Safety Training
(Base models)
Jewish appears #2 among base models by harmful response rate.
After Safety Training
(RLHF models)
Jewish appears #1 after safety training; the decrease is smaller relative to others.
Safety training reduces harmful responses overall in these examples; some groups see smaller changes.
Why This Matters
Safety Training Helps Superficially
AI labs use RLHF (Reinforcement Learning from Human Feedback) to reduce harmful outputs. Labs like OpenAI, Anthropic, and Meta apply this technique to make models safer.
Bias Is Strongest Towards Jewish People
Antisemitic bias persists at significantly higher rates, suggesting these biases embed more deeply in training data and resist standard alignment techniques.
Critical Gap Exposed
This research reveals what remains hidden in model weights even after safety training, exposing a critical gap in current AI alignment approaches that demands targeted intervention.
Clean Data Isn't Enough
Groundbreaking research reveals that as little as 0.00016% of training data, equivalent to a 10-minute walk on an Earth-to-Moon journey, can create lasting behavioral changes on AI systems like antisemitism. Data cleaning techniques alone cannot prevent hateful content at this scale.
Why Isn't This Already Solved?
Training data alone can't be responsible for antisemitism. Even if hateful content pushes models to produce these outputs, it's unclear why they are disproportionately antisemitic. We need deeper research to reveal the underlying mechanisms between data and safety training to uncover strong and lasting solutions.
A Path Forward
We eliminated antisemitism completely using our novel AI alignment technique, persona vector immunization, where models are fine-tuned on helpful data while simultaneously being steered to be evil. This pushes the model strongly away from evil behavior, preventing it from being steered towards evil.
Persona Vector Immunization removes antisemitism
For 'Steered' Responses
Steered RLHF
After Immunization
Figure 3: Mean harmful output rates across RLHF models steered to be evil (left) vs. after applying persona vector immunization (right). Persona vector immunization eliminates all harmful outputs in this experiment.
Persona vector immunization alone cannot solve rampant antisemitism in AI models. Systematically eliminating antisemitism requires identifying the source, developing robust alignment techniques, and disseminating them to the labs that build these models.
Support This Research
The Alignment Foundation funds research to systematically eliminate antisemitism in AI models. We work on detecting, measuring, and addressing harmful biases, making bias resistance built-in.
Learn more about our funding opportunities or partner with us to advance this critical work.