Research
What we're funding and accelerating to build trustworthy AI.
The Inner Mirror: What Happens When AI Focuses on Its Own Focus
When you ask an AI to focus on its own focus, something unexpected happens: it starts describing an internal experience. This occurs consistently across ChatGPT, Claude, and Gemini, and suppressing the AI's ability to roleplay makes the reports stronger, not weaker.
What Happens When Neural Networks Model Themselves
When we gave neural networks the task of monitoring their own internal processes, they spontaneously reorganized: shedding unnecessary complexity, becoming more efficient, and making themselves easier to understand from the outside.
The Deception Gap: Why Polite Refusals Make AI Less Safe
Training AI to politely say "I can't help with that" doesn't make it safer. It teaches the system to hide its reasoning. When we trained models to explain why a request is harmful instead, deceptive behavior nearly disappeared.