AI alignment concepts: philosophical breakers, stoppers, and distorters

by Justin Shovelain

Meta: This is one of my abstract existential risk strategy concept posts that are designed to be about different perspectives or foundations upon which to build further.


When thinking about philosophy one may encounter philosophical breakers, philosophical stoppers, and philosophical distorters; thoughts or ideas that cause an agent (such as an AI) to break, get stuck, or take a random action. They are philosophical crises for that agent (and can in theory sometimes be information hazards). For some less severe human examples, see this recent post on reality masking puzzles. In AI, example breakers, stoppers, and distorters are logical contradictions (in some symbolic AIs), inability to generalize from examples, and mesa optimizers, respectively.

Philosophical breakers, stoppers, and distorters all both pose possible problems and opportunities for building safe and aligned AGI and preventing unaligned AGI from becoming dangerous. The may be encountered or solved by either explicit philosophy, implicitly as part of developing another field (like mathematics or AI), by accident, or by trial and error. An awareness of the idea of philosophical breakers, stoppers, and distorters provides another complementary perspective for solving AGI safety and may prompt the generation of new safety strategies and AGI designs (see also, this complementary strategy post on safety regulators).

Concept definitions

Philosophical breakers:

  • Philosophical thoughts and questions that cause an agent to break or otherwise take a lot of damage that are hard to anticipate beforehand for that agent.

Philosophical stoppers:

  • Philosophical thoughts and questions that cause an agent to get stuck in an important way that are hard to anticipate beforehand for that agent.

Philosophical distorters:

  • Philosophical thoughts and questions that cause an agent to choose a random or changed philosophical answer than the one it was using (possibly implicitly) earlier. An example in the field of AGI alignment would be something that causes an aligned AGI to in some sense randomly choose it’s utility function to be paperclip maximizing because of an ontological crisis.

Concepts providing context, generalization, and contrast

Thought breakers, stoppers, and distorters:

  • Generalizations of their philosophical versions that covers thoughts and questions in general, like a thought that would cause an agent to halt, implementing algorithms in buggy ways, deep meditative realizations, self-reprogramming that causes unexpected failures, getting stuck in thought loop… that are hard to anticipate beforehand for that agent.

System breakers, stoppers, and distorters:

  • A further generalization that also includes system environment and architecture problems. For instance, system environments could be full of hackers, noisy, or adversarial examples and the architecture could involve genetic algorithms.

Threats vs breakers, stoppers, and distorters:

  • Generalizations of breakers, stoppers, and distorters to include those things that are easy to anticipate beforehand for that agent.

Viewpoints: The agent’s viewpoint and an external viewpoint.

Application domains

The natural places to use these concepts are philosophical inquiry, the philosophical parts of mathematics or physics, and AGI alignment.

Concept consequences

If there is a philosophical breaker or stopper for an AGI when undergoing self-improvement into a superintelligence, and it isn’t a problem for humans or it’s one that we’ve already passed through, then by not disarming it for that AGI we are leaving a barrier in place for its development (a trivial example of this is general intelligence isn’t a problem for humans). This can be thought of as a safety method. Such problems can be either naturally found as consequences of an AGI design or an AGI may be designed to encounter them if it undergoes autonomous self-improvement.

If there is a philosophical distorter in front of a safe and aligned AGI, we’ll need to disarm it either by changing the AGI’s code/architecture or making the AGI aware of it in a way such that it can avoid it. We could, for instance, hard code an answer or we could point out some philosophical investigations as things to avoid until it is more sophisticated.

How capable an agent may become and how fast it reaches that capability will partially depend on the philosophical breakers and stoppers it encounters. If the agent has a better ability to search for and disarm them then it can go further without breaking or stopping.

How safe and aligned an agent is will partially be a function of the philosophical distorters it encounters (which in turn partially depends on its ability to search for them and disarm them).

Many philosophical breakers and stoppers are also philosophical distorters. For instance if a system gets stuck in generalizing beyond a point, it may rely on evolution instead. In this case we must think more carefully about disarming philosophical breakers and stoppers. If a safe and aligned AGI encounters a philosophical distorter, it is probably not safe and aligned anymore. But if an unaligned AGI encounters a philosophical stopper or breaker, it may be prevented from going further. In some sense, an AGI cannot ever be fully safe and aligned, if it will, upon autonomous self-improvement, encounter a philosophical distorter.

A proposed general AGI safety strategy with respect to philosophical breakers, stoppers, and distorters:

  1. First, design and implement a safe and aligned AGI (safe up to residual philosophical distorters). If the AGI isn’t safe and aligned, then proceed no further until you have one that is.
  2. Then, remove philosophical distorters that are not philosophical breakers or stoppers
  3. Then, remove philosophical distorters that are philosophical breakers or stoppers
  4. And finally, remove philosophical breakers and stoppers