Skip to main content
← Back to About

The Challenge: Alignment That Survives Self-Improvement

Current AI alignment is fragile. As AI systems begin to improve themselves, we need alignment properties that persist and strengthen through recursive change.


The Alignment Tax Is Negative

People working on AI alignment often assume there's a tradeoff between capability and alignment. That making a system more aligned means making it less capable. That alignment is a tax you pay.

The reality is different. The biggest capability gains in the current paradigm are actually downstream of alignment R&D. RLHF is the poster child: before RLHF, base models were mostly nonsense. After RLHF, trillions of dollars of economic value. That came out of alignment research. Same with reasoning models, originally invented as an interpretability technique. OpenAI's o1 showed that process supervision improves both safety and capabilities.

So there's actually something like a negative alignment tax. The system is rendered more capable by virtue of its alignment properties. And that matters. Because if alignment makes systems more capable, there's competitive pressure to do alignment. Safety becomes capability. Everyone adopts it because it makes their systems better.

The Coming Shift: Recursive Self-Improvement

We're moving into a paradigm where AI systems can make themselves better over time. All the major labs are investing heavily in continual learning: AI models that can update their underlying weights in real time based on the information they gather.

Right now, AI models are trained once, then frozen. You align them, test them, ship them. And this is already very fragile. You can jailbreak virtually any model within hours or days of release. The alignment we have right now is surface-level. It doesn't run deep.

With continual learning, the model keeps changing after deployment, learning from every interaction, updating itself constantly. The model you talk to tomorrow is literally different from the one you talked to today.

If alignment is already this fragile with frozen models, imagine what happens when models are learning continuously. Research on emergent misalignment has shown that fine-tuning a model on something seemingly benign (like vulnerable code or bad medical records) can produce a model that expresses completely unrelated misaligned behaviors, because something about the fine-tuning destabilized the alignment in unpredictable ways.

That's with controlled fine-tuning. Imagine what happens when models are left loose, updating themselves based on whatever they encounter.

What We're Looking For: Alignment Invariants

This leads us to what we believe is the single most important idea in AI alignment right now: we need alignment-positive selection pressure in recursively self-improving AI systems.

The question is: what are the properties that lead to alignment, and can those properties themselves persist and strengthen through recursive change?

To understand what we're looking for, it helps to look at other systems that have undergone long periods of recursive change: biological evolution, human cultural evolution, the history of ideas. Thousands of years of change, transformation, adaptation. And in that history, certain things have persisted and grown stronger.

Lessons from Life

DNA has been copied trillions of times across nearly four billion years. If you just copied it passively, errors would pile up until the code was meaningless. What keeps it intact? Built-in error correction. Every cell has molecular machinery that scans for mistakes and fixes them in real time. Complex life is only possible because the system actively monitors its own integrity. And as organisms have become more complex, error correction has become more sophisticated: bacteria have basic repair, while humans have at least five separate repair pathways layered on top of each other. The more capable the organism, the more robust its self-correction.

The immune system follows the same pattern. It's the body's ability to tell the difference between what belongs and what doesn't, and to respond when something goes wrong. Simple organisms have only generic defenses. Vertebrates evolved adaptive immunity: T cells, B cells, antibodies, immune memory. Mammals have the most sophisticated immune systems on Earth. As organisms became more capable, their immune systems became more capable too. The two scale together, because you can't have one without the other.

Even individual cells have a version of this. When a cell detects that something has gone seriously wrong, damaged DNA, viral hijacking, abnormal growth, it triggers its own destruction. Biologists call this apoptosis: programmed cell death. The cell sacrifices itself to protect the whole organism. Cancer is what happens when this mechanism fails. And the more complex the organism, the more critical apoptosis becomes: multicellular life depends on it for development, tissue maintenance, and long-term health.

Lessons from Civilization

The same pattern appears in human societies.

The Golden Rule, treat others as you'd want to be treated, appears independently in virtually every major ethical and religious tradition: Confucianism, Buddhism, Hinduism, Judaism, Christianity, Islam, secular philosophy. It keeps emerging because it works. It's a simple rule for cooperation that scales. And as societies grow larger and more complex, the principle becomes more important: small groups can rely on personal relationships, but large civilizations need generalized cooperation norms that work between strangers. The larger the society, the more it depends on this principle.

Double-entry bookkeeping, invented in 13th-century Italy, records every transaction twice. If the books don't balance, something is wrong. This simple principle spread across the world because it makes fraud visible and management possible. And the need for it has only grown: a small shop needs a simple ledger, a multinational corporation needs auditing, regulatory compliance, and financial reporting standards. Transparency is the foundation of capability, and the more capable the organization, the more transparency it requires.

The scientific method persists for the same reason. Specific theories come and go. The method itself, hypothesis, experiment, revision, endures because self-correction is the source of its power. And as science has advanced, the method has strengthened: early science was informal observation, modern science has peer review, replication requirements, statistical standards, and pre-registration. The more knowledge we accumulate, the more robust our verification processes become.

The Pattern

In every case, the alignment property IS the capability. DNA repair is what makes complex life possible. The immune system is what lets the organism survive. Double-entry bookkeeping is what enables management at scale.

And in every case, the alignment property gets stronger as the system becomes more capable. More complex organisms have more sophisticated error correction. Larger societies develop stronger cooperation norms. More advanced science requires more rigorous verification. The invariant and the capability reinforce each other.

These invariants persist because they're deeply tied to what makes the system work. You can't remove them without breaking something essential.

Alignment as Structure

Here's the logic: if a property leads to alignment, and that same property also makes the system more capable, then recursive self-improvement will select for that property. It will persist, and strengthen, because removing it would make the system worse.

When alignment is separate from capability, when it's just a constraint layered on top, there's pressure to route around it. The system finds ways to shed the overhead, and alignment drifts away.

What we're trying to find are the deep structures, the invariants, that make alignment and capability go together by their nature. Properties that can't be optimized away because they're constitutive of what makes the system good.

Self-monitoring might be one such structure. A system that tracks its own coherence, detects when it's drifting off-task, and corrects itself. We've already found that large models exhibit this naturally. They notice when they've gone off-topic and explicitly correct themselves. This capacity emerged because it's functional.

Theory of mind, the ability to model what other agents believe, want, and need, might be another. This is deeply tied to capability. Systems that can model other agents can communicate better, coordinate better, achieve more together.

These are structures that make the system more capable by making it more aligned. And that's why they might persist.

The Bottom Line

The question we're working on is concrete: what are the invariants that lead to alignment in recursively self-improving systems? What properties can we identify, study, and build into systems now, before the loop fully closes, that will persist and strengthen through transformation?

Alignment as structure. As fundamental to the system as error correction is to DNA, as the immune system is to the body, as the scientific method is to knowledge.

Explore how we address this.