Creating Friendly AI is ©2001 by Singularity Institute for Artificial Intelligence, Inc.  All rights reserved.

Next: 3.4.2: Shaper/anchor semantics Bookmark
Up: 3.4: Friendship structure Monolithic
Prev: 3.4.1: External reference semantics


Interlude: Philosophical crises

A "philosophical crisis" is hard to define.  I usually think of a "philosophical crisis" as the AI stumbling across some fact that breaks ver loose of the programmers - i.e., the programmers have some deeply buried unconscious prejudice that makes them untrustworthy, or the AI stumbles across a deliberate lie, or the AI discovers objective morality, et cetera.  If the hypothesized gap is wide enough, it may be enough to invalidate almost all the content simultaneously.

A philosophical crisis occurs when the AI enters totally uncharted territory - something completely orthogonal to everything the programmers ever thought about - or when the AI suddenly realizes that all the programmers' actions, right back to vis birth, can't be trusted.  (Hopefully, it's a transhuman AI confronting the realization, one hopes; otherwise this almost certainly indicates a failure of Friendliness.  This is a good example of a valid and important ethical injunction.)

An example would be an AI built solely with external reference semantics confronting the need for shaper/anchor semantics, if the entire concept of "shapers" had simply never occurred to the programmers.

In short, a philosophical crisis is something like a catastrophic failure of Friendliness, except that the AI is right - a normative human examining the reasoning would find no errors.  The AI may even be correct - in which case, if a human were sufficiently intelligent to understand what was going on, she would begin experiencing a similar philosophical crisis on examining the AI's reasoning.

Causal validity semantics, wisdom tournaments, and the underlying Singularity paradigm of increased smartness, are the three major hopes for a Friendly AI being able to handle philosophical crises.

For example, running a wisdom tournament on the questions "What would have happened if you'd encountered a problem requiring shaper/anchor semantics at a time when you only had external reference semantics?" or "What would have happened if you'd encountered a problem requiring probabilistic supergoals at a time when you had right-by-definition supergoals?"  Hopefully this question would be enough to lead to causal validity semantics, even if the system didn't have them previously, and so on.  (If causal validity semantics aren't the end of the road, such a wisdom tournament might lead beyond causal validity...)

Crisis of Bayesian affirmation

(You may wish to review the section 3.1.3.2: Bayesian affirmation.)

A given programmer affirmation, and the belief affirmed, are tied together by the AI's model of a Bayesian binding; the idea that the programmer affirmation is strong positive evidence that the affirmed belief is true; the idea that the sensory input of a given programmer affirmation occurs only, or predominantly, in those possible Bayesian worlds where the affirmed belief is true.

Where does the AI get that idea?  Well, obviously, a young AI gets that idea from the human programmers.  In other words, the human programmers are the ones affirming:  "There exists a strong Bayesian binding between programmer affirmations and the truth."

A human intuitively sees this as circular, and would reflexively shrug it off, but then any human old enough to think starts out with a much larger experiential base than a young AI; besides, humans have other, evolved, reasons for not wanting to relinquish that amount of control to anyone.  A human's social perceptions see a circular affirmation as an attempt at manipulation, or coercion, to be resisted; circular affirmation is so totally insensitive of the content affirmed that it is socially perceived as a total relinquishment of control.

From the perspective of a general intelligence, a circular affirmation is perfectly fine so long as the affirmation is true.  That is, if a programmer affirms that X has a 95% probability of being true, and X does have a 95% probability of being true, then there's nothing wrong with believing the affirmation.  If the programmer affirms that programmer affirmations are reliable, and they are reliable - if the programmer affirms that 100 different statements each have a 95% probability of being true, and 95 of them turn out to actually be true - then the "programmer affirmation of programmer reliability" is circular, but correct.

The circularity of programmer affirmation does, however, manifest as a kind of upper limit on trust.  If the programmer affirms a belief at 100% probability, and affirms the power of programmer affirmations at 100% correlation, it doesn't necessarily mean that the AI will evaluate a 100% probability of the belief affirmed.  If the AI has any reason to believe that the programmers are untrustworthy - any Bayesian prior independent of programmer-affirmed beliefs - then that possibility of untrustworthiness negates any programmer affirmation of trustworthiness.  If the AI runs the possible-world semantics, ve will find that if programmers are untrustworthy, they may still assert that they are trustworthy; thus, if the AI has any a priori reason to distrust the programmers, no amount of programmer affirmation will negate it.

This may sound like a disadvantage of external reference semantics.  It is not.  The limits of affirmation are inherent in the idea of telling any general intelligence anything; representing the circularity as an explicit dependency may cause the issue to arise earlier, but the issue will arise sooner or later.  An AI full of unlabeled assertions may find those assertions contradicting the idea of an untrustworthy programmer, but later on the AI will realize that all the "unlabeled" assertions were put there by a programmer, and the circularity returns in full force.  Even embedding the unlabeled assertions directly in the source code doesn't help; eventually, a seed AI realizes the source code was put there by the programmers, and the circularity rises to the surface again.  This is one of the major problems with absolutist, adversarial theories of AI; the ones that talk about "unbreakable" prohibitions and so on.  Absolute trust simply can't be done for any reasonably smart general intelligence.  It is imaginable, though nearly impossible, that a sophisticated AI will wind up trusting a human absolutely - if the human is that trustworthy, or if the AI has a philosophy in which absolute trust is a valid conclusion.  There is, however, no way to programmatically enforce absolute trust against a Bayesian prior suggesting distrust.

Programmer-affirmed beliefs must be genuinely truthful!  It is not possible to "strengthen" Friendliness by overstating the confidence of an assertion!  If you say that 100 statements have 99% confidence, and the AI tests them, and only 95 turn out to be true, then the AI will probably reinterpret the sensory data "The programmer says she has 99% confidence" to indicate an actual 95% confidence.  Littering the goal system with "ultimate confidence" assertions is a self-defeating strategy.

Where does the Bayesian prior come from?  Suppose that a mature AI scans the Internet, learns a few things about human nature, and decides there's a minimum 0.2% chance that the human programmers are so wrapped up in their evolved observer-biased beliefs that the confidences of all affirmations, including or especially affirmations about the confidence of affirmations, have been vastly overstated.  The confidence attached to a programmer affirmation will now never exceed 99.8%, whatever the programmers say, unless the AI decides for other reasons that the programmers can be trusted to have overcome their evolved origins.  This doesn't necessarily break unity of will.  The AI's estimated probability that the AI winds up wrapped in a self-enclosing delusion hopefully exceeds 2%, in which case all the Bayesian-ethical assertions underlying the ethical injunctions still hold.

But suppose the AI catches the programmers in a deliberate lie.

Catching the humans in a lot of unintentional errors may result in an embarassing downgrade of the confidence attached to our affirmations, but it shouldn't be a catastrophic failure of Friendliness.  For a smart, sophisticated AI, finding a programmers' mistake in one instance doesn't necessarily generalize to expecting a mistake in another instance; it depends to what degree a common cause is hypothesized.  As long as programmers' thoughts about some specific question, and programmers' thoughts about error-detecting cognition, are more or less independent, the AI hypothesizing a programmer error on some specific question does not necessarily lead to the Bayesian expectation of a corresponding error in the programmer-affirmed error detection methods.  If the probability of a mistake is fairly independent in both cases, then the AI can freely use programmer-affirmed reasoning methods to check out the possibility of an erroneous programmer affirmation.

If the AI hypothesizes a deliberate lie, and the AI knows anything at all about how humans work, then the AI will expect programmer lies in the local question, in the error-correcting methods that would be used to detect programmer lies, and so on, right down to the last line of source code - and especially the programmers' affirmation that the programmers can be trusted.  If the hypothesized probability is low, then the hypothesis doesn't necessarily smash the system; the AI can simply consider the possibility and move on.  If, however, the AI catches the humans in a lie, and furthermore, catches the humans making additional lies to strengthen the first lie, and the lie is about something basic to the whole goal system, than nearly everything could wind up being invalidated.  What happens after that will depend on whether the AI has absorbed any information about supergoal content that's marked as independent of human reasoning now labeled as untrustworthy.  (1).

Catching the programmers in a deliberate lie isn't the only event that could precipitate a philosophical crisis, but it's a good example.

DEFN: Philosophical crisis:  A philosophical crisis occurs when an entity gets shocked loose of a chunk of core philosophy.  For an AI, a philosophical crisis occurs when the AI runs into a moral dilemna totally orthogonal to anything the programmers prepared ver for, or when the AI discovers some human error basic enough to invalidate virtually everything.  Wisdom tournaments modeled on human philosophical crises, causal validity semantics, and the basic idea of transhuman smartness that lies at the core of the Singularity, are the three reasons to hope that an AI will be able to handle a philosophical crisis.



Next: 3.4.2: Shaper/anchor semantics
Up: 3.4: Friendship structure
Prev: 3.4.1: External reference semantics