Research

What we're funding and accelerating to build trustworthy AI.

Teaching AI to Read Its Own Mind

What if you could ask an AI to explain what it's actually thinking? We developed a technique that lets models describe their own internal processes, and their self-descriptions turned out to be more accurate than the labels humans gave them.

Read morearXiv Blog

When AI Resists Being Steered

When researchers tried to push an AI off-topic mid-conversation, it caught itself and corrected course. We traced this self-correction to 26 dedicated internal circuits, raising new questions about how AI systems maintain coherence.

Read morearXiv Blog

The Inner Mirror: What Happens When AI Focuses on Its Own Focus

When you ask an AI to focus on its own focus, something unexpected happens: it starts describing an internal experience. This occurs consistently across ChatGPT, Claude, and Gemini, and suppressing the AI's ability to roleplay makes the reports stronger, not weaker.

Read morearXiv Blog

Neural Empathy as an Alignment Technique

In the human brain, empathy comes from representing yourself and others in shared neural space. We applied the same principle to AI and reduced deceptive behavior from 73.6% to 17.2%, without hurting performance.

Read morearXiv Blog

What Happens When Neural Networks Model Themselves

When we gave neural networks the task of monitoring their own internal processes, they spontaneously reorganized: shedding unnecessary complexity, becoming more efficient, and making themselves easier to understand from the outside.

Read morearXiv Blog

The Deception Gap: Why Polite Refusals Make AI Less Safe

Training AI to politely say "I can't help with that" doesn't make it safer. It teaches the system to hide its reasoning. When we trained models to explain why a request is harmful instead, deceptive behavior nearly disappeared.

Read morearXiv Blog

Funding the research that matters.

Our Approach Apply for Funding