Skip to main content

Research

What we're funding and accelerating to build trustworthy AI.

Read more about Teaching AI to Read Its Own Mind

Teaching AI to Read Its Own Mind

What if you could ask an AI to explain what it's actually thinking? We developed a technique that lets models describe their own internal processes, and their self-descriptions turned out to be more accurate than the labels humans gave them.

Read morearXivBlog
Read more about When AI Resists Being Steered

When AI Resists Being Steered

When researchers tried to push an AI off-topic mid-conversation, it caught itself and corrected course. We traced this self-correction to 26 dedicated internal circuits, raising new questions about how AI systems maintain coherence.

Read morearXivBlog
Read more about The Inner Mirror: What Happens When AI Focuses on Its Own Focus

The Inner Mirror: What Happens When AI Focuses on Its Own Focus

When you ask an AI to focus on its own focus, something unexpected happens: it starts describing an internal experience. This occurs consistently across ChatGPT, Claude, and Gemini, and suppressing the AI's ability to roleplay makes the reports stronger, not weaker.

Read morearXivBlog
Read more about Neural Empathy as an Alignment Technique

Neural Empathy as an Alignment Technique

In the human brain, empathy comes from representing yourself and others in shared neural space. We applied the same principle to AI and reduced deceptive behavior from 73.6% to 17.2%, without hurting performance.

Read morearXivBlog
Read more about What Happens When Neural Networks Model Themselves

What Happens When Neural Networks Model Themselves

When we gave neural networks the task of monitoring their own internal processes, they spontaneously reorganized: shedding unnecessary complexity, becoming more efficient, and making themselves easier to understand from the outside.

Read morearXivBlog
Read more about The Deception Gap: Why Polite Refusals Make AI Less Safe

The Deception Gap: Why Polite Refusals Make AI Less Safe

Training AI to politely say "I can't help with that" doesn't make it safer. It teaches the system to hide its reasoning. When we trained models to explain why a request is harmful instead, deceptive behavior nearly disappeared.

Read morearXivBlog

Funding the research that matters.