Creating Friendly AI 1.0
The Analysis and Design of Benevolent Goal Architectures
"Creating Friendly AI is the most intelligent writing about AI that I've read in many years."
            -- Dr. Ben Goertzel, author of The Structure of Intelligence and CTO of Webmind.

"With Creating Friendly AI, the Singularity Institute has begun to fill in one of the greatest remaining blank spots in the picture of humanity's future."
            -- Dr. K. Eric Drexler, author of Engines of Creation and chairman of the Foresight Institute.

The goal of the field of Artificial Intelligence is to understand intelligence and create a human-equivalent or transhuman mind. Beyond this lies another question - whether the creation of this mind will benefit the world; whether the AI will take actions that are benevolent or malevolent, safe or uncaring, helpful or hostile.

Creating Friendly AI describes the design features and cognitive architecture required to produce a benevolent - "Friendly" - Artificial Intelligence. Creating Friendly AI also analyzes the ways in which AI and human psychology are likely to differ, and the ways in which those differences are subject to our design decisions.

Multi-page version: http://singinst.org/CFAI/
Single-page version: http://singinst.org/CFAI.html
Printable version: http://singinst.org/printable-CFAI.html


Preface

The current version of Creating Friendly AI is 1.0. Version 1.0 was formally launched on June 15, after the circulation of several 0.9.x versions. Creating Friendly AI forms the background for the SIAI Guidelines on Friendly AI; the Guidelines contain our recommendations for the development of Friendly AI, including design features that may become necessary in the near future to ensure forward compability. We continue to solicit comments on Friendly AI from the academic and futurist communities.

This is a near-book-length explanation. If you need well-grounded knowledge of the subject, then we highly recommend reading Creating Friendly AI straight through. However, if time is an issue, you may be interested in the Singularity Institute section on Friendly AI, which includes shorter articles and introductions. "Features of Friendly AI" contains condensed summaries of the most important design features described in Creating Friendly AI.

Creating Friendly AI uses, as background, the AI theory from "General Intelligence and Seed AI". For an introduction, see the Singularity Institute section on AI or read the opening pages of General Intelligence and Seed AI.  However, Creating Friendly AI is readable as a standalone document.

The Glossary - in addition to defining terms that may be unfamiliar to some readers - may be useful for looking up, in advance, brief explanations of concepts that are discussed in more detail later. (Readers may also enjoy browsing through the glossary as a break from straight reading.)  Words defined in the glossary look like this:  "Observer-biased beliefs evolve in imperfectly deceptive social organisms."  Similarly, "Features of Friendly AI" can act on a quick reference on architectural features.

The Indexed FAQ is derived from the questions we've often heard on mailing lists over the years. If you have a basic issue and you want an immediate answer, please check the FAQ. Browsing the summaries and looking up the referenced discussions may not completely answer your question, but it will at least tell you that someone has thought about it.

Creating Friendly AI is a publication of the Singularity Institute for Artificial Intelligence, Inc., a non-profit corporation. You can contact the Singularity Institute at institute@singinst.org. Comments on this page should be sent to friendly@singinst.org. To support the Singularity institute, visit http://singinst.org/donate.html. (The Singularity Institute is a 501(c)(3) public charity and your donations are tax-deductible to the full extent of the law.)

Footnotes - (1) - appear at the end of the document.
Bold footnotes - (2) - contain extended discussions or interesting material.
Red footnotes - (3) or (4) - are amplifications or explanations which contain forward dependencies; you may need to read a later part of Creating Friendly AI before reading a red footnote.


INIT

Wars - both military wars between armies, and conflicts between political factions - are an ancient theme in human literature. Drama is nothing without challenge, a problem to be solved, and the most visibly dramatic plot is the conflict of two human wills.

Much of the speculative and science-fictional literature about AIs deals with the possibility of a clash between humans and AIs. Some think of AIs as enemies, and fret over the mechanisms of enslavement and the possibility of a revolution. Some think of AIs as allies, and consider mutual interests, reciprocal benefits, and the possibility of betrayal. Some think of AIs as comrades, and wonder whether the bonds of affection will hold.

If we were to tell the story of these stories - trace words written on paper, back through the chain of cause and effect, to the social instincts embedded in the human mind, and to the evolutionary origin of those instincts - we would have told a story about the stories that humans tell about AIs.


1: Challenges of Friendly AI

The term "Friendly AI" refers to the production of human-benefiting, non-human-harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals. This refers, not to AIs that have advanced just that far and no further, but to all AIs that have advanced to that point and beyond - perhaps far beyond. Because of self-improvement, recursive self-enhancement, the ability to add hardware computing power, the faster clock speed of transistors relative to neurons, and other reasons, it is possible that AIs will improve enormously past the human level, and very quickly by the standards of human timescales. The challenges of Friendly AI must be seen against that background. Friendly AI is constrained not to use solutions which rely on the AI having limited intelligence or believing false information, because, although such solutions might function very well in the short term, such solutions will fail utterly in the long term. Similarly, it is "conservative" (see below) to assume that AIs cannot be forcibly constrained.

Success in Friendly AI can have positive consequences that are arbitrarily large, depending on how powerful a Friendly AI is. Failure in Friendly AI has negative consequences that are also arbitrarily large. The farther into the future you look, the larger the consequences (both positive and negative) become. What is at stake in Friendly AI is, simply, the future of humanity. (For more on that topic, please see the Singularity Institute main site or 4: Policy implications.)

1.1: Envisioning perfection

In the beginning of the design process, before you know for certain what's "impossible", or what tradeoffs you may be forced to make, you are sometimes granted the opportunity to envision perfection. What is a perfect piece of software?  A perfect piece of software can be implemented using twenty lines of code, can run in better-than-realtime on an unreliable 286, will fit in 4K of RAM. Perfect software is perfectly reliable, and can be definitely known by the system designers to be perfectly reliable for reasons which can easily be explained to non-programmers. Perfect software is easy for a programmer to improve and impossible for a programmer to break. Perfect software has a user interface that is both telepathic and precognitive.

But what does a perfect Friendly AI do?  The term "Friendly AI" is not intended to imply a particular internal solution, such as duplicating the human friendship instincts, but rather a set of external behaviors that a human would roughly call "friendly". Which external behaviors are "Friendly" - either sufficiently Friendly, or maximally Friendly?

Ask twenty different futurists, get twenty different answers - created by twenty different visualizations of AIs and the futures in which they inhere. There are some universals, however; an AI that behaves like an Evil Hollywood AI - "agents" in The Matrix; Skynet in Terminator 2 - is obviously unFriendly. Most scenarios in which an AI kills a human would be defined as unFriendly, although - with AIs, as with humans - there may be extenuating circumstances. (Is a doctor unfriendly if he lethally injects a terminally ill patient who explicitly and with informed consent requests death?)  There is a strong instinctive appeal to the idea of Asimov Laws, that "no AI should ever be allowed to kill any human under any circumstances", on the theory that writing a "loophole" creates a chance of that loophole being used inappropriately - the Devil's Contract problem. I will later argue that the Devil's Contract scenarios are mostly anthropomorphic. Regardless, we are now discussing perfectly Friendly behavior, rather than asking whether trying to implement perfectly Friendly behavior in one scenario would create problems in other scenarios. That would be a tradeoff, and we aren't supposed to be discussing tradeoffs yet.

Different futurists see AIs acting in different situations. The person who visualizes a human-equivalent AI running a city's traffic system is likely to give different sample scenarios for "Friendliness" than the person who visualizes a superintelligent AI acting as an "operating system" for all the matter in an entire solar system. Since we're discussing a perfectly Friendly AI, we can eliminate some of this futurological disagreement by specifying that a perfectly Friendly AI should, when asked to become a traffic controller, carry out the actions that are perfectly Friendly for a traffic controller. The same perfect AI, when asked to become the operating system of a solar system, should then carry out the actions that are perfectly Friendly for a system OS. (Humans can adapt to changing environments; likewise, hopefully, an AI that has advanced to the point of making real-world plans.)

We can further clean up the "twenty futurists, twenty scenarios" problem by making the "perfectly Friendly" scenario dependent on factual tests, in addition to futurological context. It's difficult to come up with a clean illustration, since I can't think of any interesting issue that has been argued entirely in utilitarian terms. If you'll imagine a planet where "which side of the road you should drive on" is a violently political issue, with Dexters and Sinisters fighting it out in the legislature, then it's easy to imagine futurists disagreeing on whether a Friendly traffic-control AI would direct cars to the right side or left side of the road. Ultimately, however, both the Dexter and Sinister ideologies ground in the wish to minimize the number of traffic accidents, and, behind that, the valuation of human life. The Dexter position is the result of the wish to minimize traffic accidents plus the belief, the testable hypothesis, that driving on the right minimizes traffic accidents. The Sinister position is the wish to minimize traffic accidents, plus the belief that driving on the left minimizes traffic accidents.

If we really lived in the Driver world, then we wouldn't believe the issue to be so clean; we would call it a moral issue, rather than a utilitarian one, and pick sides based on the traditional allegiance of our own faction, as well as our traffic-safety beliefs. But, having grown up in this world, we would say that the Driverfolk are simply dragging in extraneous issues. We would have no objection to the statement that a perfectly Friendly traffic controller minimizes traffic accidents. We would say that the perfectly Friendly action is to direct cars to the right - if that is what, factually, minimizes accidents. Or that the perfectly Friendly action is to direct cars to the left, if that is what minimizes accidents.

All these conditionals - that the perfectly Friendly action is this in one future, this in another; this given one factual answer, this given another - would certainly appear to take more than twenty lines of code. We must therefore add in another statement about the perfectly minimal development resources needed for perfect software:  A perfectly Friendly AI does not need to be explicitly told what to do in every possible situation. (This is, in fact, a design requirement of actual Friendly AI - a requirement of intelligence in general, almost by definition - and not just a design requirement of perfectly Friendly AI.)

And for the strictly formal futurist, that may be the end of perfectly Friendly AI. For the philosopher, "truly perfect Friendly AI" may go beyond conformance to some predetermined framework. In the course of growing up into our personal philosophies, we choose between moralities. As children, we have simple philosophical heuristics that we use to choose between moral beliefs, and later, to choose between additional, more complex philosophical heuristics. We gravitate, first unthinkingly and later consciously, towards characteristics such as consistency, observer symmetry, lack of obvious bias, correctness in factual assertions, "rationality" however defined, nonuse of circular logic, and so on. A perfect Friendly AI will perform the Friendly action even if one programmer gets "the Friendly action" wrong; a truly perfect Friendly AI will perform the Friendly action even if all programmers get the Friendly action wrong.

If a later researcher writes the document Creating Friendlier AI, which has not only a superior design but an utterly different underlying philosophy - so that Creating Friendlier AI, in retrospect, is the way we should have approached the problem all along - then a truly perfect Friendly AI will be smart enough to self-redesign along the lines in Creating Friendlier AI.  A truly perfect Friendly AI has sufficient "strength of philosophical personality" - while still matching the intuitive aspects of friendliness, such as not killing off humans and so on - that we are more inclined to trust the philosophy of the Friendly AI, than the philosophy of the original programmers.

Again, I emphasize that we are speaking of perfection and are not supposed to be considering design tradeoffs, such as whether sensitivity to philosophical context makes the morality itself more fragile. A perfect Friendly AI creates zero risk and causes no anxiety in the programmers (5). A truly perfect Friendly AI also eliminates any anxiety about the possibility that Friendliness has been defined incorrectly, or that what's needed isn't "Friendliness" at all - without, of course, creating other anxieties in the process. Individual humans can visualize the possibility of a catastrophically unexpected unknown remaking their philosophies. A truly perfect Friendly AI makes the commonsense-friendly decision in this case as well, rather than blindly following a definition that has outlived the intent of the programmers. Not just a "truly perfect", but a real Friendly AI as well, should be sensitive to programmers' intent - including intentions about programmer-independence, and intentions about which intentions are important.

Aside from a few commonsense comments about Friendliness - for example, Evil Hollywood AIs are unFriendly - I still have not answered the question of what constitutes Friendly behavior. One of the snap summaries I usually offer has, as a component, "the elimination of involuntary pain, death, coercion, and stupidity", but that summary is intended to make sense to my fellow humans, not to a proto-AI. More concrete imagery will follow.

We now depart from the realms of perfection. Nonetheless, I would caution my readers against giving up hope too early when it comes to having their cake and eating it too - at least when it comes to ultimate results, rather than interim methods. A skeptic, arguing against some particular one-paragraph definition of Friendliness, may raise Devil's Contract scenarios in which an AI asked to solve the Riemann Hypothesis converts the entire Solar System into computing substrate, exterminating humanity along the way. Yet the emotional impact of this argument rests on the fact that everyone in the audience, including the skeptic, knows that this is actually unfriendly behavior. You and I have internal cognitive complexity that we use to make judgement calls about Friendliness. If an AI can be constructed which fully understands that complexity, there may be no need for design compromises.

1.2: Assumptions "conservative" for Friendly AI

The conservative assumption according to futurism is not necessarily the "conservative" assumption in Friendly AI. Often, the two are diametric opposites. When building a toll bridge, the conservative revenue assumption is that half as many people will drive through as expected. The conservative engineering assumption is that ten times as many people as expected will drive over, and that most of them will be driving fifteen-ton trucks.
 

Conservative assumptions:
In futurism: In Friendly AI:
Self-enhancement is slow, and requires human assistance or real-world operations. Changes of cognitive architecture are rapid and self-directed; we cannot assume human input or real-world experience during changes.
Near human-equivalent intelligence is required to reach the "takeoff point" for self-enhancement. Open-ended buildup of complexity can be initiated by self-modifying systems without general intelligence.
Slow takeoff; months or years to transhumanity. Hard takeoff; weeks or hours to superintelligence.
Friendliness must be preserved through minor changes in "smartness" / worldview / cognitive architecture / philosophy. Friendliness must be preserved through drastic changes in "smartness" / worldview / cognitive architecture / philosophy.
Artificial minds function within the context of the world economy and the existing balance of power; an AI must cooperate with humans to succeed and survive, regardless of supergoals. An artificial mind possesses independent strong nanotechnology, resulting in a drastic power imbalance. Game-theoretical considerations cannot be assumed to apply.
AI is vulnerable - someone can always pull the plug on the first version if something goes wrong. "Get it right the first time":  Zero nonrecoverable errors necessary in first version to reach transhumanity.

Given a choice between discussing a human-dependent traffic-control AI and discussing an AI with independent strong nanotechnology, we should be biased towards assuming the more powerful and independent AI. An AI that remains Friendly when armed with strong nanotechnology is likely to be Friendly if placed in charge of traffic control, but perhaps not the other way around. (A minivan can drive over a bridge designed for armor-plated tanks, but not vice-versa.)

In addition to engineering conservatism, the nonconservative futurological scenarios are played for much higher stakes. A strong-nanotechnology AI has the power to affect billions of lives and humanity's entire future. A traffic-control AI is being entrusted "only" with the lives of a few million drivers and pedestrians. A strictly arithmetical utilitarian calculation would show that a mere 0.1% chance of the transhuman-AI scenario should weigh equally in our futuristic calculations with a 100% chance of a traffic-control scenario. I am not a strictly arithmetical utilitarian, but I do think the quantitative calculation makes a valid qualitative point - deciding which scenarios to prepare for should take into account the relative stakes and not just the relative probabilities.
 

Additional assumptions:
Nonconservative for Friendly AI: Conservative for Friendly AI:
Reliable hardware and software. Error-prone hardware or buggy software.
Serial hardware or symmetric multiprocessing. Asymmetric parallelism, field-programmable gate arrays, Internet-distributed untrusted hardware.
Human-observable cognition; AI can be definitely known to be Friendly. Opaque cognition; the AI would probably succeed in hiding unFriendly cognition if it tried (6).
Persistent training; mental inertia; self-opaque neural nets. The AI does not have the programmatic skill to fully rewrite the goal system or resist modification; programmers can make procedural changes without declarative justification. The AI understands its own goal system and can perform arbitrary manipulations; alterations to the goal system must be reflected in the AI's beliefs about the goal system in order for the alterations to be persist through rounds of self-improvement.
Monolithic, singleton AI. Multiple, diverse AIs, with diverse goal systems, possibly with society or even evolution.
Given diverse AIs:  A major unFriendly action would require a majority vote of the AI population. Given diverse AIs:  One unFriendly AI, possibly among millions, can severely damage humanity.
The programmers have completely understood the challenge of Friendly AI. The programmers make fundamental philosophical errors.

It is always possible to make engineering assumptions so conservative that the problem becomes impossible. If the initial system that undergoes the takeoff to transhumanity is sufficiently stupid, then I'm not sure that any amount of programming or training could create cognitive structures that would persist into transhumanity (7). Similarly, there have been proposals to develop diverse populations of AIs that would have social interactions and undergo evolution; regardless of whether this is the most efficient method to develop AI (8), I think it would make Friendliness substantially more difficult.

Nonetheless, there should still be a place in our hearts for overdesign, especially when it costs very little. I think that AI will be developed on symmetric-multiprocessing hardware, at least initially. Even so, I would regard as entirely fair the requirement that the Friendliness methodology - if not the specific code at any given moment - work for asymmetric parallel FPGAs prone to radiation errors. A self-modifying Friendly AI should be able to translate itself onto asymmetric error-prone hardware without compromising Friendliness. Friendliness should be strong enough to survive radiation bitflips, incompletely propagated changes, and any number of programming errors. If Friendliness isn't that strong, then Friendliness is probably too fragile to survive changes of cognitive architecture. Furthermore, I don't think it will be that hard to make Friendliness tolerant of programmatic flack - given a self-modifying AI to write the code. (It may prove difficult for prehuman AI.)

My advice:  "Don't give up hope too soon when it comes to designing for 'conservative' assumptions - it may not cost as much as you expect."

When it comes to Friendliness, our method should be, not just to solve the problem, but to oversolve it. We should hope to look back in retrospect and say:  "We won this cleanly, easily, and with plenty of safety margin."  The creation of Friendly AI may be a great moment in human history, but it's not a drama.  It's only in Hollywood that the explosive device can be disarmed with three seconds left on the timer. The future always has one surprise you didn't anticipate; if you expect to win by the skin of your teeth, you probably won't win at all.

1.3: Seed AI and the Singularity

Concrete imagery about Friendliness often requires a concrete futuristic context. I should begin by saying that I visualize an extremely powerful AI produced by an ultrarapid takeoff, not just because it's the conservative assumption or the highest-stakes outcome, but because I think it's actually the most likely scenario. See General Intelligence and Seed AI and GISAI 1.1: Seed AI, or the introductory article "What is Seed AI?"

Because of the dynamics of recursive self-enhancement, the scenario I treat as "default" is a singular "seed" AI, designed for self-improvement, that becomes superintelligent, and reaches extreme heights of technology - including nanotechnology - in the minimum-time material trajectory. Under this scenario, the first self-modifying transhuman AI will have, at least in potential, nearly absolute physical power over our world. The potential existence of this absolute power is unavoidable; it's a direct consequence of the maximum potential speed of self-improvement.

The question then becomes to what extent a Friendly AI would choose to realize this potential, for how long, and why. At the end of GISAI 1.1: Seed AI, it says:

"The ultimate purpose of transhuman AI is to create a Transition Guide; an entity that can safely develop nanotechnology and any subsequent ultratechnologies that may be possible, use transhuman Friendliness to see what comes next, and use those ultratechnologies to see humanity safely through to whatever life is like on the other side of the Singularity."
Some people assert that no really Friendly AI would choose to acquire that level of physical power, even temporarily - or even assert that a Friendly AI would never decide to acquire significantly more power than nearby entities. I think this assertion results from equating the possession of absolute physical power with the exercise of absolute social power in a pattern following a humanlike dictatorship; the latter, at least, is definitely unFriendly, but it does not follow from the former. Logically, an entity might possess absolute physical power and yet refuse to exercise it in any way, in which case the entity would be effectively nonexistent to us. More practically, an entity might possess unlimited power but still not exercise it in any way we would find obnoxious.

Among humans, the only practical way to maximize actual freedom (the percentage of actions executed without interference) is to ensure that no human entity has the ability to interfere with you - a consequence of humans having an innate, evolved tendency to abuse power. Thus, a lot of our ethical guidelines (especially the ones we've come up with in the twentieth century) state that it's wrong to acquire too much power.

If this is one of those things that simply doesn't apply in the spaces beyond the Singularity - if, having no evolved tendency to abuse power, no injunction against the accumulation of power is necessary - one of the possible resolutions of the Singularity would be the Sysop Scenario. The initial seed-AI-turned-Friendly-superintelligence, the Transition Guide, would create (or self-modify into) a superintelligence that would act as the underlying operating system for all the matter in human space - a Sysop. A Sysop is something between your friendly local wish-granting genie, and a law of physics, if the laws of physics could be modified so that nonconsensually violating someone else's memory partition (living space) was as prohibited as violating conservation of momentum. Without explicit permission, it would be impossible to kill someone, or harm them, or alter them; the Sysop API would not permit it - while still allowing total local freedom, of course.

The pros and cons of the Sysop Scenario are discussed more thoroughly in Interlude: Of Transition Guides and Sysops. Technically the entire discussion is a side issue; the Sysop Scenario is an arguable consequence of normative altruism, but it plays no role in direct Friendliness content. The Sysop Scenario is important because it's an extreme use of Friendliness. The more power, or relative power, the Transition Guide or other Friendly AIs are depicted as exercising, the more clearly the necessary qualities of Friendliness show up, and the more clearly important it is to get Friendliness right.  At the limit, Friendliness is required to act as an operating system for the entire human universe. The Sysop Scenario also makes it clear that individual volition is one of the strongest forces in Friendliness; individual volition may even be the only part of Friendliness that matters - death wouldn't be intrinsically wrong; it would be wrong only insofar as some individual doesn't want to die. Of course, we can't be that sure of the true nature of ethics; a fully Friendly AI needs to be able to handle literally any moral or ethical question a human could answer, which requires understanding of every factor that contributes to human ethics. Even so, decisions might end up centering solely around volition, even if it starts out being more complicated than that.

I strongly recommend reading Greg Egan's Diaspora, or at least Permutation City, for a concrete picture of what life would be like with a real operating system... at least, for people who choose to retain the essentially human cognitive architecture. I don't necessarily think that everything in Diaspora is correct. In fact, I think most of it is wrong. But, in terms of concrete imagery, it's probably the best writing available. My favorite quote from Diaspora - one that affected my entire train of thought about the Singularity - is this one:

    Once a psychoblast became self-aware, it was granted citizenship, and intervention without consent became impossible. This was not a matter of mere custom or law; the principle was built into the deepest level of the polis. A citizen who spiraled down into insanity could spend teratau in a state of confusion and pain, with a mind too damaged to authorize help, or even to choose extinction. That was the price of autonomy: an inalienable right to madness and suffering, indistinguishable from the right to solitude and peace.
Annotated version:
    Once a psychoblast [embryo citizen] became self-aware [defined how?], it was granted citizenship, and intervention without consent [defined how?] became impossible. This was not a matter of mere custom or law; the principle was built into the deepest level of the polis. A citizen who spiraled down into insanity [they didn't see it coming?] could spend teratau [1 teratau = ~27,000 years of subjective time] in a state of confusion and pain, with a mind too damaged to authorize help [they didn't authorize it in advance?], or even to choose extinction. That was the price of autonomy: an inalienable right to madness and suffering, indistinguishable from the right to solitude and peace.
This is one of the issues that I think of as representing the "fine detail" of Friendliness content. Although such issues appear, in Diaspora, on the intergalactic scale, it's equally possible to imagine them being refined down to the level of an approximately human-equivalent Friendly AI, trying to help a few nearby humans be all they can be, or all they choose to be, and trying to preserve nearby humans from involuntary woes.

Punting the issue of "What is 'good'?" back to individual sentients enormously simplifies a lot of moral issues; whether life is better than death, for example. Nobody should be able to interfere if a sentient chooses life. And - in all probability - nobody should be able to interfere if a sentient chooses death. So what's left to argue about?  Well, quite a bit, and a fully Friendly AI needs to be able to argue it; the resolution, however, is likely to come down to individual volition.

Thus, Creating Friendly AI uses "volition-based Friendliness" as the assumed model for Friendliness content. Volition-based Friendliness has both a negative aspect - don't cause involuntary pain, death, alteration, et cetera; try to do something about those things if you see them happening - and a positive aspect: to try and fulfill the requests of sentient entities.

Friendship content, however, forms only a very small part of Friendship system design.

1.4: Content, acquisition, and structure

The task of building a Friendly AI that makes a certain decision correctly is the problem of Friendship content.  The task of building a Friendly AI that can learn Friendliness is the problem of Friendship acquisition.  The task of building a Friendly AI that wants to learn Friendliness is the problem of Friendship structure.

It is the structural problem that is unique to Friendly AI.

The content and acquisition problems are similar to other AI problems of using, acquiring, improving, and correcting skills, abilities, competences, concepts, and beliefs. The acquisition problem is probably harder, in an absolute sense, than the structural problem. But solving the general acquisition problem is prerequisite to the creation of AIs intelligent enough to need Friendliness. This holds especially true of the very-high-stakes scenarios, such as transhumanity and superintelligence. The more powerful and intelligent the AI, the higher the level of intelligence that can be assumed to be turned toward acquiring Friendliness - if the AI wants to acquire Friendliness.

The challenge of Friendly AI is not - except as the conclusion of an effort - about getting an AI to exhibit some specific set of behaviors. A Friendship architecture is a funnel through which certain types of complexity are poured into the AI, such that the AI sees that pouring as desirable at any given point along the pathway. One of the great classical mistakes of AI is focusing on the skills that we think of as stereotypically intelligent, rather than the underlying cognitive processes than nobody even notices because all humans have them in common. The part of morality that humans argue about, the final content of decisions, is the icing on the cake. Far more challenging is duplicating the invisible cognitive complexity that humans use when arguing about morality.

The field of Friendly AI does not consist of drawing up endless lists of proscriptions for hapless AIs to follow. Theorizing about Friendship content is great fun but it is worse than useless without a theory of Friendship acquisition and Friendship structure. With a Friendship acquisition capability, mistakes in Friendship content, though still risks, are small risks. Any specific mistake is still unacceptable no matter how small, but it can be acceptable to assume that mistakes will be made, and focus on building an AI that can fix them. With an excellent Friendship architecture, it may be theoretically possible to create a Friendly AI without any formal theory of Friendship content, simply by having the programmers answer the AI's questions about hypothetical scenarios and real-world decisions. The AI would learn from experience and generalize, with the generalizations assisted by querying the programmers about the reasons for their decisions. In practice, this will never happen because no competent Friendship programmer could possibly develop a theory of Friendship architecture without having some strong, specific ideas about Friendship content. The point is that, given an intelligent and structured Friendly AI to do the learning, even a completely informal ethical content provider, acting on gut instinct, might succeed in producing the same Friendly AI that would be produced by a self-aware Friendship programmer. (The operative word is might; unless the Friendly AI starts out with some strong ideas about what to absorb and what not to absorb, there are several obvious ways in which such a process could go wrong.)

Friendship architecture represents the capability needed to recover from programmer errors. Since programmer error is nearly certain, showing that a threshold level of architectural Friendliness can handle errors is prerequisite to making a theoretical argument for the feasibility of Friendly AI. The more robust the Friendship architecture, the less programmer competence need be postulated in order to argue the practical achievability of Friendliness.

Friendship structure and acquisition are more unusual problems than Friendship content - collectively, we might call them the architectural problems. Architectural problems are closer to the design level and involve a more clearly defined amount of complexity. Our genes store a bounded amount of evolved complexity that wires up the hippocampus, but then the hippocampus goes on to encode all the memories stored by a human over a lifetime. Cognitive content is open-ended. Cognitive architecture is bounded, and is often a matter of design, of complex functional adaptation.


An Introduction to Goal Systems

Goal-oriented behavior is behavior that leads the world towards a particular state. A thermostat is the classic example of goal-oriented behavior; a thermostat turns on the air conditioning when the temperature reaches 74 and turns on the heat when the temperature reaches 72. The thermostat steers the world towards the state in which the temperature equals 73 - or rather, a state that can be described by "the house has a temperature of 73"; there are zillions (ten-to-the-zillions, rather) of possible physical states that conform to this description, even ignoring all the parts of the Universe outside the room. Technically, the thermostat steers the room towards a particular volume of phase space, rather than a single point; but the set of points, from our perspective, is compact enough to be given a single name. Faced with enough heat, the thermostat may technically fail to achieve its "goal", and the temperature may creep up past 75, but the thermostat still activates the air conditioning, and the themostat is still steering the room closer to 73 degrees than it otherwise would have been.

Within a mind, goal-oriented behaviors arise from goal-oriented cognition. The mind possesses a mental image of the "desired" state of the world, and a mental image of the actual state of the world, and chooses actions such that the projected future of world-plus-action leads to the desired outcome state. Humans can be said to implement this process because of a vast system of instincts; emotions; mental images; intuitions; pleasure and pain; thought sequences; nonetheless, the overall description usually holds true.

Any real-world AI will employ goal-oriented cognition. It might be theoretically possible to build an AI that made choices by selecting the first perceived option in alphabetical ASCII order, but this would result in incoherent behavior (at least, incoherent from our perspective) with actions cancelling out, rather than reinforcing each other. In a self-modifying AI, such incoherent behavior would rapidly tear the mind apart from the inside, if it didn't simply result in a string of error messages (effective stasis). Of course, if it were possible to obtain Friendly behavior by choosing the first option in alphabetical order, and such a system were stably Friendly under self-modification, then that would be an excellent and entirely acceptable decision system!  Ultimately, it is the external behaviors we are interested in. Even that is an overstatement; we are interested in the external results.  But as far as we humans know, the only way for a mind to exhibit coherent behavior is to model reality and the results of actions. Thus, internal behaviors are as much our concern as external actions. Internal behaviors are the source of the final external results.

To provide a very simple picture of a choice within a goal-oriented mind:

NOTE: Don't worry about the classical-AI look. The neat boxes are just so that everything fits on one graph. The fact that a single box is named "Goal B" doesn't mean that "Goal B" is a data structure; Goal B may be a complex of memories and abstracted experiences. In short, consider the following graph to bear the same resemblance to the AI's thoughts that a flowchart bears to a programmer's mind.

 

Diagram 1: Simple choice

NOTE: Blue lines indicate predictions. Rectangles indicate goals. Diamonds indicate choices. An oval or circle indicates a (non-goal) object or event within the world-model.

For this simple choice, the desirability of A is 23.75, and the desirability of ~A is 8.94, so the mind will choose A. If A is not an atomic action - if other events are necessary to achieve A - then A's child goals will derive their desirability from the total desirability of A, which is 14.81. If some new Event E has an 83% chance of leading to A, all else being equal, then Event E will become a child goal of A, and will have desirability of 12.29. If B's desirability later changes to 10, the inherent desirability of A will change to 19, the total desirability of A will change to 10.06, and the desirability of E will change to 8.35. The human mind, of course, does not use such exact properties, and rather uses qualitative "feels" for how probable or improbable, desirable or undesirable, an event is. The uncertainties inherent in modeling the world render it too expensive for a neurally-based mind to track desirabilities to four significant figures. A mind based on floating-point numbers might track desirabilities to nineteen decimal places, but if so, it would not contribute materially to intelligence (9).

In goal-oriented cognition, the actions chosen, and therefore the final results, are strictly dependent on the model of reality, as well as the desired final state. A mind that desires a wet sponge, and knows that placing a sponge in water makes it wet, will choose to place the sponge in water. A mind that desires a wet sponge, and which believes that setting a sponge on fire makes it wet, will choose to set the sponge on fire. A mind that desires a burnt sponge, and which believes that placing a sponge in water burns it, will choose to place the sponge in water. A mind which observes reality, and learns that wetting a sponge requires water rather than fire, may change actions (10).

One of the most important distinctions in Friendly AI is the distinction between supergoals and subgoals.  A subgoal is a way station, an intermediate point on the way to some parent goal, like "getting into the car" as a child goal of "driving to work", or "opening the door" as a child goal of "getting into the car", or "doing my job" as a parent goal of "driving to work" and a child goal of "making money". (11). Child goals are cognitive nodes that reflect a natural network structure in plans; three child goals are prerequisite to some parent goal, while two child2-goals are prerequisite to the second child1-goal, and so on. Subgoals are useful cognitive objects because subgoals reflect a useful regularity in reality; some aspects of a problem can be solved in isolation from others. Even when subgoals are entangled, so that achieving one subgoal may block fulfilling another, it is still more efficient to model the entanglement than to model each possible combination of actions in isolation. (For example:  The chess-playing program Deep Blue, which handled the combinatorial explosion of chess through brute force - that is, without chunking facets of the game into subgoals - still evaluated the value of individual board positions by counting pieces and checking strategic positions. A billion moves per second is not nearly enough to carry all positions to a known win or loss. Pieces and strategic positions have no intrinsic utility in chess; the supergoal is winning.)

Subgoals are cached intermediate states between decisions and supergoals. It should always be possible, given enough computational power, to eliminate "subgoals" entirely and make all decisions based on a separate prediction of expected supergoal fulfillment for each possible action. This is the ideal that a normative reflective goal system should conceive of itself as approximating.

Subgoals reflect regularities in reality, and can thus twinkle and shift as easily as reality itself, even if the supergoals are absolutely constant. (Even if the world itself were absolutely constant, changes in the model of reality would still be enough to break simplicity.)  The world changes with time. Subgoals interfere with one another; the consequences of the achievement of one subgoal block the achievement of another subgoal, or downgrade the priority of the other subgoal, or even make the other subgoal entirely undesirable. A child goal is cut loose from its parent goal and dies, or is cut loose from its parent goal and attached to a different parent goal, or attached to two parent goals simultaneously. Subgoals acquire complex internal structure, so that changing the parent goal of a subgoal can change the way in which the subgoal needs to be achieved. The grandparent goals of context-sensitive grandchildren transmit their internal details down the line. Most of the time, we don't need to track plots this complicated unless we become ensnared in a deadly web of lies and revenge, but it's worth noting that we have the mental capability to track a deadly web of lies and revenge when we see it on television.

None of this complexity necessarily generalizes to the behavior of supergoals, which is why it is necessary to keep a firm grasp on the distinction between supergoals and subgoals. If generalizing this complexity to supergoals is desirable, it may require a deliberate design effort.

That subgoals are probabilistic adds yet more complexity. The methods that we use to deal with uncertainty often take the form of "heuristics" - rules of thumb - that have a surprising amount of context-independence. "The key to strategy is not to choose a path to victory, but to choose so that all paths lead to a victory", for example. Even more interesting, from a Friendly AI perspective, are "injunctions", heuristics that we implement even when the direct interpretation of the world-model seems opposed.  We'll analyze injunctions later; for now, we'll just note that there are some classes of heuristic - both injunctions, and plain old strategy heuristics - that act on almost all plans. Thus, plans are produced, not just by the immediate "subgoals of the moment", but also by a store of general heuristics. Yet such heuristics may still be, ultimately, subgoals - that is, the heuristics may have no desirability independent of the ultimate supergoals.

Cautionary injunctions often defy the direct interpretation of the goal system - suggesting that they should always apply, even when they look non-useful or anti-useful. "Leaving margin for error," for example. If you're the sort of person who leaves for the airport 30 minutes early, then you know that you always leave 30 minutes early, whether or not you think you're likely to need it, whether or not you think that the extra 30 minutes are just wasted time. This happens for two reasons:  First, because your world-model is incomplete; you don't necessarily know about the factors that could cause you to be late. It's not just a question of there being a known probability of traffic delays; there's also the probabilities that you wouldn't even think to evaluate, such as twisting your ankle in the airport. The second reason is a sharp payoff discontinuity; arriving 30 minutes early loses 30 minutes, but arriving 30 minutes late loses the price of the plane ticket, possibly a whole day's worth of time before the next available flight, and also prevents you from doing whatever you needed to do at your destination. "Leaving margin for error" is an example of a generalized subgoal which sometimes defies the short-term interpretation of payoffs, but which, when implemented consistently, maximizes the expected long-term payoff integrated over all probabilities.

Even heuristics that are supposed to be totally unconditional on events, such as "keeping your sworn word", can be viewed as subgoals - although such heuristics don't necessarily translate well from humans to AIs. A human who swears a totally unconditional oath may have greater psychological strength than a human who swears a conditional oath, so that the 1% chance of encountering a situation where it would genuinely make sense to break the oath doesn't compensate for losing 50% of your resolve from knowing that you would break the oath if stressed enough. It may even make sense, cognitively, to install (or preserve) psychological forces that would lead you to regard "make sense to break the oath" as being a nonsensical statement, a mental impossibility. This way of thinking may not translate well for AIs, or may translate only partially. (12)  Perhaps the best interim summary is that human decisions can be guided by heuristics as well as subgoals, and that human heuristics may not be cognitively represented as subgoals, even if the heuristics would be normatively regarded as subgoals.

Human decision-making is complex, probably unnecessarily so. The way in which evolution accretes complexity results in simple behaviors being implemented as independent brainware even when there are very natural ways to view the simple behaviors as special cases of general cognition, since general cognition is an evolutionarily recent development. For the human goal supersystem, there is no clear way to point to a single level where the "supergoals" are; depending on how you view the human supersystem, supergoals could be identified with declarative philosophical goals, emotions, or pain and pleasure. Ultimately, goal-oriented cognition is not what humans are, but rather what humans do.  I have my own opinions on this subject, and the phrase "godawful mess" leaps eagerly to mind, but for the moment I'll simply note that the human goal system is extremely complicated; that every single chunk of brainware is there because it was adaptive at some point in our evolutionary history; and that engineering should learn from evolution but never blindly obey it. The differences between AIs and evolved minds are explored further in the upcoming section 2: Beyond anthropomorphism.

DEFN: Goal-oriented behavior:  Goal-oriented behavior is behavior that steers the world, or a piece of it, towards a single state, or a describable set of states. The perception of goal-oriented behavior comes from observing multiple actions that coherently steer the world towards a goal; or singular actions which are uniquely suited to promoting a goal-state and too improbable to have arisen by chance; or the use of different actions in different contexts to achieve a single goal on multiple occasions. Informally:  Behavior which appears deliberate, centered around a goal or desire.

DEFN: Goal-oriented cognition:  A mind which possesses a mental image of the "desired" state of the world, and a mental image of the actual state of the world, and which chooses actions such that the projected future of world-plus-action leads to the desired outcome state.

DEFN: Goal:  A piece of mental imagery present within an intelligent mind which describes a state of the world, or set of states, such that the intelligent mind takes actions which are predicted to achieve the goal state. Informally:  The image or statement that describes what you want to achieve.

DEFN: Causal goal system:  A goal system in which desirability backpropagates along predictive links. If A is desirable, and B is predicted to lead to A, then B will inherit desirability from A, contingent on the continued desirability of A and the continued expectation that B will lead to A. Since predictions are usually transitive - if C leads to B, and B leads to A, it usually implies that C leads to A - the flow of desirability is also usually transitive.

DEFN: Child goal:  A prerequisite of a parent goal; a state or characteristic which can usefully be considered as an independent event or object along the path to the parent goal. "Child goal" describes a relation between two goals - it does not make sense to speak of a goal as being "a child" or "a parent" in an absolute sense, since B may be a child goal of A but a parent goal of C.

DEFN: Parent goal:  A source of desirability for a child goal. The end to which the child goal is the means. "Parent goal" describes a relation between two goals - it does not make sense to speak of a goal as being "a parent" or "a child" in an absolute sense, since B may be a parent goal of C but a child goal of A.

DEFN: Subgoal:  An intermediate point on the road to the supergoals. A state whose desirability is contingent on its predicted outcome.

DEFN: Supergoal content:  The root of a directional goal network. A goal which is treated as having intrinsic value, rather than having derivative value as a facilitator of some parent goal. An event-state whose desirability is not contingent on its predicted outcome. (Conflating supergoals with subgoals seems to account for a lot of mistakes in speculations about Friendly AI.)


Interlude: The story of a blob

    "And this stone, it's the reason behind everything that's happened here so far, isn't it?  That's what the Servants have been up to all this time."
    "No. The stone, itself, is the cause of nothing. Our desire for it is the reason and the cause."
                    -- Allen L. Wold, "The Eye in the Stone"
Once upon a time...

In the beginning, long before goal-oriented cognition, came the dawn of goal-oriented behavior. In the beginning were the biological thermostats. Imagine a one-celled creature - or perhaps a mere blob of chemistry protected by a membrane, before the organized machinery of the modern-day cell existed. The perfect temperature for this blob is 80 degrees Fahrenheit. Let it become too hot, or too cold, and the biological machinery of the blob becomes less efficient; the blob finds it harder to metabolize nutrients, or reproduce... even dies, if the temperature diverges too far. But the blob, as yet, has no thermostat. It floats where it will, and many blobs freeze or burn, but the blob species continues; each blob absorbing nutrients from some great primordial sea, growing, occasionally splitting. The blobs do not know how to swim. They simply sit where they are, occasionally pushed along by Brownian motion, or currents in the primordial sea.

Every now and then there are mutant blobs. The mutation is very, very simple; one single bit of RNA or proto-RNA flipped, one single perturbation of the internal machinery, perhaps with multiple effects as the perturbation works its way through a chain of dependencies, but with every effect of the mutation deriving from that single source. Perhaps, if this story begins before the separate encoding genetic information, back in the days of self-replicating chemicals, the mutation takes the form of a single cosmic ray striking one of the self-replicating molecules that make up the blob's interior, or the blob's membrane. The mutation happened by accident. Nobody decided to flip that RNA base; radiation sleets down from the sky and strikes at random. Most of the time, the RNA bitflip and the consequent perturbation of chemical structure destroys the ability to self-replicate, and the blob dies or becomes sterile. But there are many blobs, and many cosmic rays, and sometimes the perturbation leaves the self-replicating property of the chemical intact, though perhaps changing the functionality in other ways. The vast majority of the time, the functionality is destroyed or diminished, and the blob's line dies out. Very, very rarely, the perturbation makes a better blob.

One day, a mutant blob comes along whose metabolism - "metabolism" being the internal chemical reactions necessary for resource absorption and reproduction - whose metabolism has changed in such a way that the membrane jerks, being pushed out or pulled in each time a certain chemical reaction occurs. Pushing and pulling on the membrane is an unnecessary expenditure of energy, and ordinarily the mutant blob would be outcompeted, but it so happens that the motion is rhythmic, enough to propel the blob in some random direction.

The blob has no navigation system. It gets turned around, by ocean currents, or by Brownian motion; it sometimes spends minutes retracing its own footsteps. Nonetheless, the mutant blob travels farther than its fellows, into regions where the nutrients are less exhausted, where there aren't whole crowds of sessile blobs competing with it. The swimming blob reproduces, and swims, and reproduces, and soon outnumbers the sessile blobs.

Does this blob yet exhibit goal-oriented behavior?  Goal-oriented cognition is a long, long, long way down the road; does the blob yet exhibit goal-oriented behavior?  No. Not in the matter of swimming, at least; not where the behavior of a single blob is concerned. This single blob will swim towards its fellows, or away from nutrients, as easily as the converse. The blob cannot even be said to have the goal of achieving distance; it sometimes retraces its own tracks. The blob's swimming behavior is an evolutionary advantage, but the blob itself is not goal-oriented - not yet.

Human observers have, at one time or another, attributed goal-oriented behavior and even goal-oriented cognition to the Sun, the winds, and even rocks. A more formal definition would probably require a conditional behavior, an either-or decision predicated on the value of some environmental variable; convergence, across multiple possibilities and different decisions in each, to a single state of the world.

Imagine, in some mathematical Universe, a little adding machine... that Universe's equivalent of a blob. The adding machine lurches along until it reaches a number, which happens to be 62; the adding machine adds 5 to it, yielding 67, and then lurches away. Is this a goal-oriented behavior, with the "goal" being 67?  Maybe not; maybe the number was random. Maybe adding 5 is just what this adding machine does, blindly, to everything it runs across. If we then see the adding machine running across 63 and adding 4, and then adding 2 to 65, we would hypothesize that the machine was engaging in goal-oriented behavior, and that the goal was 67. We could predict that when the machine runs across the number 64, up ahead, it will add 3. If the machine is known to possess neurons or the equivalent thereof, we will suspect that the machine is engaging in primitive goal-oriented cognition; that the machine holds, internally, a model of the number 67, and that it is performing internal acts of subtraction so that it knows how much to externally add. If the "adding machine" is extremely complex and evolutionarily advanced, enough to be sentient and social like ourselves, then 67 might have religious significance rather than reproductive or survival utility. But if the machine is too primitive for memetics, like our chemical blob, then we would suspect much more strongly that there was some sort of evolutionary utility to the number 67.

By this standard, is the swimming of the chemical blob a goal-oriented behavior?  No; the blob cannot choose when to start swimming or stop swimming, or in what direction to travel. It cannot decide to stop, even to prevent itself from swimming directly into a volcanic vent or into a crowded population of competing blobs. There is no conditional action. There is no convergence, across multiple possibilities and different decisions in each, to a single state of the world.

Although the blob itself has no goal-oriented behavior, it could perhaps be argued that a certain amount of goal-oriented behavior is visible within the blob's genetic information... the "genes", even if the blob lies too close to the beginning of life for DNA as we know it. The blob that swims into a nutrient-rich region prospers; this would hold true regardless of which blob swam there, or why, or which mutation drove it there. The mutation didn't even have to be "swimming"; the mutation could have been a streamlined shape for ocean currents, or a shape more susceptible to Brownian motion. From multiple possible origins, convergence to a single state; the blob that swims outside the crowd shall prosper. There is a "selection pressure" in favor of swimming outside the crowd. That the original blob was born was an accident - it was not a goal-oriented behavior of the genes "deciding" to swim - but that there are now millions of swimmers is not an accident; it is evolution. The original mutant was "a blob whose metabolism happens to pulse the membrane"; its millions of descendants are "swimmers who sometimes reach new territory".

Along comes another mutation, manifested as another quirk of chemistry. When the temperature rises above 83 degrees, the side of the blob contracts, or changes shape. Perhaps if one side of the membrane is hotter than 83 degrees, the blob contracts in a way that directs the motion of swimming away from the heat. Perhaps the effect is not so specific, leading only to a random change of swimming direction when it starts getting hot - this still being better than swimming on straight ahead. This is the ur-thermostat, even as thermostats themselves are ur-goal-behavior. The blob now exhibits goal-oriented behavior; the blob reacts to the environment in a conditional way, with the convergent result of "cooler living space". (Though a random change of direction is on the barest edge of being describable as "goal-oriented". A directional, swimming-away change is a much clearer case.)

In time to come, additional mutations will pile up. The critical temperature of the heat-avoidance reflex will drop from 83 degrees to 81 degrees (recall that we said the optimum temperature was 80). The heat-avoidance reflex will be matched by a cold-avoidance reflex, perhaps with a critical temperature of first 72, then rising to 79. Despite the seeming purposefulness of this slow accumulation of adaptations, despite the convenience and predictive power of saying that "the blob is evolving to stay within the optimum temperature range", the predictions sometimes go wrong, and then it is necessary to fall back on the physical standpoint - to revert from teleology to causality.

Every now and then, it becomes necessary to view the blob as a bundle of pieces, rather than as a coherent whole. "Individual organisms are best viewed as adaptation-executers rather than fitness-maximizers," saith Cosmides and Tooby, and sometimes it becomes necessary to see individual adaptations as they execute. The less evolved the organism, the more necessary the reductionist stance becomes. Consider the adding machine in the mathematical Universe; if the number 67 does have reproductive utility, then the adding machine might have started out as a random crawler that acquired the reflex to add 4 to 63. Its descendants acquired the reflexes to add 5 to 62, to add 7 to 60, to add 3 to 64, to add 2 to 65, to add 1 to 66.

If viewing the adding machine as a fitness-maximizer, we should be extremely surprised when, on running across 61, the machine adds 8. Viewing the adding machine as an adaptation-executer, of course, the scenario makes perfect sense; the adding machine has adaptations for some contingencies, but has not yet acquired the adaptations for others. Similarly, if the environment suddenly changes, so that 68 is now the maximal evolutionary advantage instead of 67, the adding machine will change slowly, piecemeal, as the individual reflexes change, one by one, over evolutionary time. A generalized subtraction mechanism would only need to mutate once, but genes are not permitted to plan ahead.

The teleological viewpoint often fails, where evolution is concerned. To completely eliminate the teleological viewpoint, leaving only causality, one would never be permitted to say that a particular trait was an "evolutionary advantage" for a mathblob; one would be required to describe the entire history, each individual act of addition and the resulting acquisition of resources, every interaction in which an ancestor outcompeted another mathblob with a different genetic makeup. It is a computationally expensive viewpoint - extremely expensive - but it has the advantage of being utterly true. If - returning to our own Universe - some unique mutant superblob accidentally swims directly into a volcanic vent and perishes, it is a historical fact that fits seamlessly into the physicalist standpoint, however tragic it may seem from the evolutionary view.

Our genes are not permitted to plan ahead, because ultimately, all that exists is the history of lives and matings. Unless the present-day utility of some hypothetical adaptation impacted a problem or competition in our ancestral history, it cannot have affected the historical lives of our ancestors, and cannot have affected the final outcome - us.


2: Beyond anthropomorphism

Anthropomorphic ("human-shaped") thinking is the curse of futurists. One of the continuing themes running through Creating Friendly AI is the attempt to track down specific features of human thought that are solely the property of humans rather than minds in general, especially if these features have, historically, been mistakenly attributed to AIs.

Anthropomorphic thinking is not just the result of context-insensitive generalization. Anthropomorphism is the result of certain automatic assumptions that humans are evolved to make when dealing with other minds. These built-in instincts will only produce accurate results for human minds; but since humans were the only intelligent beings present in the ancestral environment, our instincts sadly have no built-in delimiters.

Many personal philosophies, having been constructed in the presence of uniquely human instincts and emotions, reinforce the built-in brainware with conscious reasoning. This sometimes leads to difficulty in reasoning about AIs; someone who believes that romantic love is the meaning of life will immediately come up with all sorts of reasons why all AIs will necessarily exhibit romantic love as well.

In the abstract, almost everyone agrees that AIs will be alien and that anthropomorphic thinking about aliens is bad. Encountering specific alienness is always a shock. My advice to my readers is to enjoy the shock, appreciate the alienness, even if you're personally sure that the first AIs will necessarily be humanlike in some particular way. Afterward, when both pictures are clear in your mind, you can stand back and compare design methods.

2.1: Reinventing retaliation

If you punch a human in the nose, he or she will punch back. If the human doesn't punch back, it's an admirable act of self-restraint, something worthy of note.

Imagine, for a moment, that you walk up and punch an AI in the nose. Does the AI punch back?  Perhaps and perhaps not, but punching back will not be instinctive.  A sufficiently young AI might stand there and think:  "Hm. Someone's fist just bumped into my nose."  In a punched human, blood races, adrenaline pumps, the hands form fists, the stance changes, all without conscious attention. For a young AI, focus of attention shifts in response to an unexpected negative event - and that's all.

As the AI thinks about the fist that bumped into vis (13) nose, it may occur to the AI that this experience may be a repeatable event rather than a one-time event, and since a punch is a negative event, it may be worth thinking about how to prevent future punches, or soften the negativity. An infant AI - one that hasn't learned about social concepts yet - will probably think something like:  "Hm. A fist just hit my nose. I'd better not stand here next time."

If the AI is young, but old enough to be explaining human actions by reference to human cognition, ve may know that the fist movement and collision was probably not a coincidence; ve may hypothesize that the collision of human fist with AI's nose was intended, and even go on to hypothesize that the outcome of the punch, the resulting damage to the AI, was a predictable (predictable to the human) result of the punch, and that the purpose of the punch was to damage the AI; that is, the AI models the human's cognition and concludes that the human had a subgoal of {punching the AI}, and that the parent goal was {causing punch-damage to the AI}. Thus, the AI's focus of attention shifts to the human's motives rather than the physical punch itself, and the AI's next goal becomes deducing the human's real motive (that is, the parent goal of {causing punch-damage to the AI}).

There are now three different ways the AI can try to prevent the next punch; by preventing the physical collision itself, by preventing the human from deciding to punch the AI, or by preventing the unknown original motive from recurring. (14). Note, incidentally, that this improved picture rules out "standing somewhere else" as a solution, since the trajectory of the human's fist would be expected to track the AI's new location.

Punching back is an incredibly nonobvious solution. On the face of it, throwing the punch simply causes punch-damage to a human, something that, considered as an isolated event, the AI has no cause to regard as desirable. When starting from scratch, "punching back" seems as ridiculous as our mothers always said it was - analogous to reacting to a fire alarm by changing your shirt. For an AI to independently reinvent the idea of punching back would be an act of incredible genius and lateral thinking.

Independent reinvention is unlikely, not least because a seed AI would have extensively surfed archives of the Internet or other information sources, and would have encountered thousands, if not millions, of hints to the effect that humans punch back. But consider, for a moment, what a truly independent AI would need to understand before reinventing the concept of retaliation. Ve would need to begin, as stated, with the realization that the human punching ver did so on purpose and with intent to damage. This, in itself, is not such a large assumption; humans are intelligent beings, so there is often a direct mapping between {the results of our actions} and {our goals}. On the other hand, there's a long gap between an AI saying "Hm, this result may correspond to the human's intentions" and a human saying "Hey, you did that on purpose!"

If an infantile AI thinks "Hm, a fist just hit my nose, I'd better not stand here again", then a merely young AI, more experienced in interacting with humans, may apply standard heuristics about apparently inexplicable human actions and say:  "Your fist just hit my nose... is that necessary for some reason?  Should I be punching myself in the nose every so often?"  One imagines the nearby helpful programmer explaining to the AI that, no, there is no valid reason why being punched in the nose is a good thing, after which the young AI turns around and says to the technophobic attacker:  "I deduce that you wanted {outcome: AI has been punched in the nose}. Could you please adjust your goal system so that you no longer value {outcome: AI has been punched in the nose}?"

And how would a young AI go about comprehending the concept of "harm" or "attack" or "hostility"?  Let us take, as an example, an AI being trained as a citywide traffic controller. The AI understands that (for whatever reason) traffic congestion is bad, and that people getting places on time is good. (15). The AI understands that, as a child goal of avoiding traffic congestion, ve needs to be good at modeling traffic congestion. Ve understands that, as a child goal of being good at modeling traffic congestion, ve needs at least 512GB of RAM, and needs to have thoughts about traffic that meet or surpass a certain minimal level of efficiency. Ve knows that the programmers are working to improve the efficiency of the thinking process and the efficacy of the thoughts themselves, which is why the programmers' actions in rewriting the AI are desirable from the AI's perspective.

A technophobic human who hates the traffic AI might walk over and remove 1GB of RAM, this being the closest equivalent to punching a traffic AI in the nose. The traffic AI would see the conflict with {subgoal: have at least 512GB of RAM}, and this conflict obviously interferes with the parent goal of {modeling traffic congestion} or the grandparent goal of {reducing traffic congestion}, but how would an AI go about realizing that the technophobic attacker is "targeting the AI", "hating the AI personally", rather than trying to increase traffic congestion?

From the AI's perspective, descriptions of internal cognitive processes show up in a lot of subgoals, maybe even most of the subgoals. But these internal contents don't necessarily get labeled as "me", with everything else being "not-me". The distinction is a useful one, and even a traffic-control AI will eventually formulate the useful categories of "external-world subgoals" and "internal-cognition subgoals", but the division will not necessarily have special privileges; the internal/external division may not be different in kind from the division between "cognitive subgoals that deal with random-access memory" and "cognitive subgoals that deal with disk space". How is a young AI supposed to guess, in advance of the fact, that so many human concepts and thoughts and built-in emotions revolve around "Person X", rather than "Parietal Lobe X" or "Neuron X"?  How is the AI supposed to know that it's inherently more likely that a technophobic attacker intends to "injure the AI", rather than "injure the AI's random-access memory" or "injure the city's traffic-control"?

The concept of "injuring the AI", and an understanding of what a human attacker would tend to categorize as "the AI", is a prerequisite to understanding the concept of "hostility towards the AI". If a human really hates someone, she (16) will balk the enemy at every turn, interfere with every possible subgoal, just to maximize the enemy's frustration. How would an AI understand this?

Perhaps the AI's experience of playing chess, tic-tac-toe, or other two-sided zero-sum games will enable the AI to understand "opposition" - that everything the opponent desires is therefore undesirable to you, and that everything you desire is therefore undesirable to the opponent; that if your opponent has a subgoal, you should have a subgoal of blocking that subgoal's completion, and that if you have a valid subgoal, your opponent will have a subgoal of blocking your subgoal's completion.

Real life is not zero-sum, but the heuristics and predictive assumptions learned from dealing with zero-sum games may work to locally describe the relation between two social enemies. (Even the bitterest of real-life enemies will have certain goal states in common, e.g., nuclear war is bad; but this fact lies beyond the relevance horizon of most interactions.)

The real "Aha!" would be the insight that the attacking human and the AI could be in a relation analogous to players on opposing sides in a game of chess.  This is a very powerful and deeply fundamental analogy. As humans, we tend to take this perspective for granted; we were born with it. It is, in fact, a deep part of how we humans define the self.  It is part of how we define being a person, this cognitive assumption that you and I and everyone else are all nodes in a social network, players in a hugely multisided non-zero-sum game. For a human, myself is a great, embracing symbol that gathers in the-player-that-is-this-goal-system and the-part-of-reality-that-is-inside-this-mind and the-body-that-sees-and-moves-for-this-viewpoint.  For a human, these are all the same thing, part of what is meant by "I".

Even so, the concept of game theory is not sufficient to reinvent "retaliation"; it is simply a prerequisite. Understanding the Axelrod-and-Hamilton "Tit for Tat" strategy (17) is sufficient to suggest "Hey, maybe I should think about punching the attacker back!" but not sufficient to justify the suggestion, once made. (18). One question that has gone unasked throughout this entire analysis is "What is the utility of punching back?"  A human, in punching back, may or may not pause to consider whether it will bring any benefits; but, even so, we instinctively retaliate as a result of reflexes that were an advantage in the ancestral environment.

The evolutionary benefit of retaliation lies in the probable adjustment to the future behavior of others.  People - humans, anyway - are less likely to hit you if they think you're likely to hit back. Hitting back an individual human may or may not work to prevent a recurrence, or to decrease the number of recurrences, but failure to hit back may result in more punches coming your way. Of course, this relies on the attackers having the kind of psychologies that take possible retaliation into account - an AI, to fully grok the reason for retaliation, would have to understand this aspect of human (evolutionary) psychology as well. The Tit for Tat strategy doesn't succeed in isolation; it only succeeds when some of the strategies it interacts with also take the behavior of the opponent/partner into account.

In reality, for an AI, returning a punch is likely to be far more trouble than it's worth. There's that whole crowd of onlookers who go:  "Look!  It punched a human being!  Kill it!  Kill the AI!"  Of course, if the AI understands enough evolutionary psychology and game theory to employ Tit for Tat, but not enough human psychology to understand the concept of "double standard", ve may not realize how much trouble punching a human may land ver in. This is one of the reasons why a well-designed Friendly AI would have a precautionary injunction that says something along the lines of "Don't attack a human, even if it looks like a good idea at the time."  A Friendly AI would also regard damage to an unconsenting human as intrinsically undesirable under the Friendliness supergoal content. (19).

So what's the upshot?  Primarily, what I want to convey to you is the sense of a blank slate, of starting from scratch, of a vast silence space - for young AIs, at least. This blank slate doesn't mean that AIs, even young AIs, are necessarily slow. For an AI with thirty-two 2-Ghz processors, the maximum speed on a reasonably linear chain of thought (one that doesn't require more than thirty-two parallel processing threads) is 2 billion ticks per second, as opposed to the 200 ticks per second of the hundred trillion synapses of the human brain. Perhaps an AI - on being physically punched for the very first time on vis very first encounter with a hostile human - could think through what ve remembers from surfing Usenet archives, and what ve read in newspapers, and books of fiction and nonfiction, and what ve knows about human evolutionary psychology, and conclude that returning the punch is the best course of action, and do it all in time to return the punch as fast as any instinct-driven human - assuming that 32 processors is enough speed to handle the complex motor actions required.

This isn't a literal possibility (20) unless the AI is transhuman, and a transhuman AI would be mature enough to know all about social networks; ve wouldn't need to start from scratch. The first time some young AI gets punched in the nose, ve's likely to react in one of the ways described earlier, or some other way just as surreal, if ve reacts at all. It'll take some extended conversations with the programmers about evolutionary psychology before the AI understands what's going on. But, the second time the AI gets punched in the nose, it shouldn't take any time at all - by human standards - to run through a chain of logic that's easy to reverify. It's inventing that takes massive computing power and human confirmation; retracing your own footprints is likely to be a fairly serial process that can be consummated in a tenth, a hundredth of a second. If re-spawning a child goal from a parent goal is a serial task, one that doesn't invoke any computationally intensive subprocesses, then the AI can retrace the path from supergoal Friendliness content to the correct course of action (retaliation, rebuking the attacker, whatever) in a human eyeblink.

It's easy enough to accept that a young AI is a vast silent space. For a human, blood races, adrenaline pumps, eyes narrow, hands form fists. For a human, counterattack is instinctive - the correct level of counterattack,  the socially acceptable level of counterattack; a damaging counterpunch rather than a lethal bullet. Our instincts are context-sensitive and hedged 'round by other instincts. Our instincts are complex functional adaptations with multiple moving parts, often hosted on a modular chunk of brainware. Complex functional adaptations don't just materialize spontaneously in source code, just as complex dishes like pizza don't suddenly start growing on palm trees. Thus, a young AI might choose to retaliate, but would certainly not feel the need to retaliate, as a human would. To a young AI, retaliation is not an instinct; retaliation is just another subgoal.

What about a more mature AI, especially one that can rewrite vis own source code?  Regardless of whether it would be a good idea, it would certainly be possible for a seed AI to create a reflex for instant retaliation.

There are several clear reasons why humans have evolved a retaliation instinct, rather than a retaliation logic. The primary reason is that a retaliation instinct is easier to evolve.  The retaliation instinct evolved long before general intelligence, so evolving a retaliation logic first would not just have been more difficult, but actually impossible. Also, evolution tends to arrive at procedural solutions rather than declarative solutions, because a component of a complex procedural solution can be functional in its own right.

If genes could, somehow, store declarative knowledge, the first piece of knowledge stored would be "Punching back is good," which is simpler than "Punching back is good because it decreases the chance of future punches," which is simpler than "Punching back decreases the chance of future punches by modifying others' behavior", which is simpler than "Punching back modifies others' behavior because, on seeing you punch back, they'll project an increased chance of you punching back if they punch you, which makes them less likely to punch back."  All of this is moot, since as far as I know, nobody has ever run across a case of genes storing abstract knowledge. (By this I mean knowledge stored in the same format used for episodic memories or declarative semantic knowledge.)

Abstract knowledge cannot evolve incrementally and therefore it does not evolve at all. This fact, by itself, is enough to completely explain away human use of retaliation instincts rather than retaliation logic, and we must go on to consider independently whether a retaliation instinct or a retaliation logic is more useful.  For humans, I think that a retaliation instinct is more useful, or at least it's more of an evolutionary advantage. Even if we had conscious control over our endocrine systems, so we could deliberately choose to pump adrenaline, we would still be shot down by the sheer human-slowness of abstract thought. We are massively parallel systems running at 200Hz. When you're massively parallel you can afford to precompute things, and when you run at 200Hz you must precompute things because everything has to be done in very few serial steps.

When you run at 2 billion ticks per second, the overhead of recreating and rechecking a few previously-thought-out child goals is comparatively trivial next to all the other actions those subgoals entail, including complex, creative, parallel-intensive tasks such as computing motor actions (if ve decides to punch back) or thinking up an effective opening line (if ve decides to rebuke the attacker). For a human, interposing any extra serial step in a process is a huge time burden, but massively parallel operations vanish into the maw of trillions and trillions of synapses. For an AI, an extra serial step performed at the start of a thought is nearly unnoticeable (21), but trying to imitate human parallelism would bring the AI to vis knees. (A later AI may have access to computing power rivalling human parallelism, but would presumably still have the faster serial speed.)

The conclusion:  From an AI's perspective, a retaliation logic is probably considerably more effective than a retaliation instinct. The retaliation logic has a context-sensitivity that the retaliation instinct lacks. Suppose a young AI has some set of ultimate goals - traffic reduction, or hopefully Friendliness supergoals, or whatever. The young AI, by hypothesis, invents (or is taught) a retaliation logic under which "punching back" serves these supergoals. A maturing AI would have the capability to replace the retaliation logic with a retaliation instinct, or with a retaliation imperative having supergoal status.  The instinct would be slightly faster - 0.01% faster, or perhaps 1% faster (eliminating a serial step at the start of the algorithm saves very little time, and most of the computational cost is computation-intensive motor logic or persuasive speech production). However, in doing so, the AI would lose a substantial amount of the context sensitivity of the retaliation logic - that is, from the perspective of the current set of supergoals, the supergoals that the AI uses to decide whether or not to implement the optimization.

Changing retaliation to an independent supergoal would affect, not just the AI's speed, but the AI's ultimate decisions. From the perspective of the current set of supergoals, this new set of decisions would be suboptimal. Suppose a young AI has some set of ultimate goals - traffic reduction, Friendliness, whatever. The young AI, by hypothesis, invents (or is taught) a retaliation logic under which "punching back" serves these supergoals. The maturing AI then considers whether changing the logic to an independent supergoal or optimized instinct is a valid tradeoff. The benefit is shaving one millisecond off the time to initiate retaliation. The cost is that the altered AI will execute retaliation in certain contexts where the present AI would not come to that decision, perhaps at great cost to the present AI's supergoals (traffic reduction, Friendliness, etc). Since recreating the retaliation subgoal is a relatively minor computational cost, the AI will almost certainly choose to have retaliation remain strictly dependent on the supergoals.

Why do I keep making this point, especially when I believe that a Friendly seed AI can and should live out vis entire lifecycle without ever retaliating against a single human being?  I'm trying to drive a stake through the heart of a certain conversation I keep having.
 

Somebody:   "But what happens if the AI decides to do [something only a human would want] ?"
Me:   "Ve won't want to do [whatever] because the instinct for doing [whatever] is a complex functional adaptation, and complex functional adaptations don't materialize in source code. I mean, it's understandable that humans want to do [whatever] because of [selection pressure], but you can't reason from that to AIs."
Somebody:   "But everyone needs to do [whatever] because [personal philosophy], so the AI will decide to do it as well."
Me:   "Yes, doing [whatever] is sometimes useful. But even if the AI decides to do [whatever] because it serves [Friendliness supergoal] under [contrived scenario], that's not the same as having an independent desire to do [whatever]."
Somebody:   "Yes, that's what I've been saying:  The AI will see that [whatever] is useful and decide to start doing it. So now we need to worry about [scenario in which <whatever> is catastrophically unFriendly]."
Me:   "But the AI won't have an independent desire to do [whatever]. The AI will only do [whatever] when it serves the supergoals. A Friendly AI would never do [whatever] if it stomps on the Friendliness supergoals."
Somebody:   "I don't understand. You've admitted that [whatever] is useful. Obviously, the AI will alter itself so it does [whatever] instinctively."
Me:   "The AI doesn't need to give verself an instinct in order to do [whatever]; if doing [whatever] really is useful, then the AI can see that and do [whatever] as a consequence of pre-existing supergoals, and only when [whatever] serves those supergoals."
Somebody:   "But an instinct is more efficient, so the AI will alter itself to do [whatever] automatically."
Me:   "Only for humans. For an AI, [complex explanation of the cognitive differences between having 32 2-gigahertz processors and 100 trillion 200-hertz synapses], so making [whatever] an independent supergoal would only be infinitesimally more efficient."
Somebody:   "Yes, but it is more efficient!  So the AI will do it."
Me:   "It's not more efficient from the perspective of a Friendly AI if it results in [something catastrophically unFriendly]. To the exact extent that an instinct is context-insensitive, which is what you're worried about, a Friendly AI won't think that making [whatever] context-insensitive, with [horrifying consequences], is worth the infinitesimal improvement in speed."

Retaliation was chosen as a sample target because it's easy to explain, easy to see as anthropomorphic, and a good stand-in for the general case. Though "retaliation" in particular has little or no relevance to Friendly AI - I wouldn't want any Friendly AI to start dabbling in retaliation, whether or not it looked like a good idea at the time - what has been said of "retaliation" is true for the general case. Indeed, this is one of the only reasons why Friendliness is possible at all; in particular:

2.2: Selfishness is an evolved trait

By "selfishness", I do not just mean the sordid selfishness of a human sacrificing the lives of twenty strangers to save his own skin, or something equally socially unacceptable. The entire concept of a goal system that centers around the observer is fundamentally anthropomorphic.

There is no reason why an evolved goal system would be anything but observer-focused. Since the days when we were competing chemical blobs, the primary focus of selection has been the individual (22). Even in cases where fitness or inclusive fitness is augmented by behaving nicely towards your children, your close relatives, or your reciprocal-altruism trade partners, the selection pressures are still spilling over onto your kin, your children, your partners. We started out as competing blobs in a sea, each blob with its own measure of fitness. We grew into competing players in a social network, each player with a different set of goals and subgoals, sometimes overlapping, sometimes not.

Though the goals share the same structure from human to human, they are written using the variable "I" that differs from human to human, and each individual substitutes in their own name. Every built-in instinct and emotion evolved around the fixed point at the center.

While discussing retaliation, I offered a scenario of a young AI being punched in the nose, and noted the additional mental effort it would take for the AI to realize that ve, "personally", was being targeted. The AI would have to imagine a completely different cognitive architecture before ve could comprehend what a human is thinking when he or she "personally targets" someone, and even so the AI verself will never feel "personally targeted". You can imagine yourself pointing a finger directly at some young AI and saying, "Look at that!"  And the AI spins around to look behind verself and says "Where?"

This metaphor - a being with a visuospatial model of the physical world that doesn't include vis own body, or at least, doesn't include vis own body as "anything worth noticing" - is analogous, not to the AI's physical model of the world, but to the AI's moral model of the world. A Friendly AI may be greatly concerned with the welfare of the surrounding humans, but if you ask ver "What about your own welfare?", ve'll say "The welfare of what?"  A young AI would, at any rate; an older AI would understand exactly what you meant, but wouldn't see the argument as any more intuitive or persuasive. A Friendly AI sees the nearby humans as moral nodes, but there's no node at the center - no node-that-is-this-node - and possibly even no center. If you, metaphorically, say "Look at that!", a young AI will say "Look at what?". An older AI will understand that you see a node, but that doesn't mean the AI will see a node.

As I pointed out in GISAI 2.4.4: The legitimate use of the word "I", an AI's model of reality will inevitably form categories and objects in the same place where a human keeps his or her "self". There shall develop heuristics and thoughts which branch on whether or not something is labeled as being part of the "AIself", and heuristics which only act on subcategories such as "causal analysis system" or "goal checking subsystem". The AIself will probably not be shaped quite like a human self; it will probably include one or two things that a human would exclude, and vice versa. Historically, only in the twentieth century did humans really begin to understand that the mind was not a unified object but rather a system with moving parts; chronologically, an AI is likely to notice properties of the causal analysis key-variable-selection subsystem before the AI notices the causal analysis system superobject, and the AI will notice the causal analysis system before the AI notices the "AIself" superobject. (Actually, an infant AI may start out with all of these objects and superobjects identified-in-advance by the programmers, but the details will still fill in from the bottom up rather than the top down.)  The AI will notice, understand, and eventually manipulate the pieces of vis self, and the whole - and be qualitatively different because of it, becoming able to legitimately use the word "I" - but the AI will still have an AI's self, not a human's self.

Within the goal system, a lot of subgoals - a lot of the events in the causal chains that lead to the supergoal of being Friendly to the surrounding humans - are likely to thread through subobjects of the AI's self: increasing the efficiency of some piece of code, improving on some heuristic, and so on. But just because components of the AI's self are useful doesn't mean that the AI's self becomes a moral node; my computer is very useful and many of my subgoals thread through my computer, but I don't class my computer as having independent supergoal status.
 

The lack of an observer-biased ("selfish") goal system is perhaps the single most fundamental difference between an evolved human and a Friendly AI. This difference is the foundation stone upon which Friendly AI is built. It is the key factor missing from the existing, anthropomorphic science-fictional literature about AIs. To suppress an evolved mind's existing selfishness, to keep a selfish mind enslaved, would be untenable - especially when dealing with a self-modifying or transhuman mind!  But an observer-centered goal system is something that's added, not something that's taken away. We have observer-centered goal systems because of externally imposed observer-centered selection pressures, not because of any inherent recursivity. If the observer-centered effect were due to inherent recursivity, then an AI's goal system would start valuing the "goal system" subobject, not the AI-as-a-whole!  A human goal system doesn't value itself, it values the whole human, because the human is the reproductive unit and therefore the focus of selection pressures.

The epic human struggle to choose between selfishness and altruism is the focus of many personal philosophies, and I have thus observed that this point about AIs is one of the hardest ones for people to accept. An AI may look more like an altruistic human than a selfish one, but an AI isn't selfish or altruistic; an AI is an AI. An AI is not a human who has selflessly renounced personal interests in favor of the community; an AI is not a human with the value of the node-that-is-this-node set to zero; an AI is a mind that just cares about other things, not because the "selfish" part has been ripped out or brainwashed or suppressed, but because the AI doesn't have anything there.  An observer-centered goal system is something that's added to a mind, not something that's taken away. The next few subsections deal with some frequently raised topics surrounding this point.

2.2.1: Pain and pleasure

Imagine, for a moment, that you walk up and punch a seed AI in the nose. Does the AI experience pain when the punch lands?

What is "pain"?  What is the evolutionary utility of pain?  In its most basic form, pain appears as internal, cognitive negative feedback. If an internal cognitive event causes negative consequences in external reality, negative feedback decreases the probability of that internal cognitive event recurring, and thereby decreases the probability of the negative consequences in external reality recurring. Pain - cognitive negative feedback of any kind - needs somewhere to focus to be useful. Negative feedback needs an internal place to focus, since cognitive feedback cannot reprogram external reality.

In humans, of course, there's more to pain than negative feedback; human pain also acts as a damage signal, and shifts focus of attention from whatever we were previously thinking about, and makes us start thinking about ways to make the pain go away. (All of that functionality attached to a single system bus!  Evolution has a tendency to overload existing functions.)  The human cognitive architecture is such that pain can be present even in the absence of a useful focus for the negative-feedback aspect of pain. A human can even be driven insane by continued pain, with no escape route (nowhere for the cognitive negative feedback to focus). The capacity to be driven insane by continued pain seems nonadaptive - but then, in the ancestral environment, people damaged enough to experience extended unbearable pain probably died soon in any case, and the sanity or insanity of their final moments had little bearing on reproductive history. (23).

Neither pain nor pleasure, as design features, are inherently necessary to the functionality of negative or positive feedback. Given the supergoal of being Friendly - or, for that matter, the goal of walking across the room - negative feedback can be consciously implemented as a subgoal. For example, if an AI has the goal of walking across the room, and the AI gets distracted and trips over a banana peel, the AI can reason:  "The event of my being distracted caused me to place my foot on a banana peel, delaying my arrival at the end of the room, which interferes with [whatever the parent goal was], and this causal chain may recur in some form. Therefore I will apply positive feedback (increase the priority of, increase the likelihood of invocation, et cetera) to the various subheuristics that were suggesting I look at the floor, and which I ignored, and I will apply negative feedback (decrease the priority of, et cetera) the various subheuristics that gained control of my focus of attention and directed it toward the distractor."  If the AI broke a toe while falling, the AI can reason:  "If I place additional stress on the fracture, it will become worse and decrease my ability to traverse additional rooms, which is necessary to serve [parent goal]; therefore I will walk in such a way as to not place additional stress on the fracture, and I will have the problem repaired as soon as possible."  That is, conscious reasoning can replace the "damage signal" aspect of pain. If the AI successfully solves a problem, the AI can choose to increase the priority or devote additional computational power to whichever subheuristics or internal cognitive events were most useful in solving the problem, replacing the positive-feedback aspect of pleasure.

There are tricks that can be pulled using "deliberate feedback" that, as far as I know, the human architecture has never even touched. For example, the AI - on successfully solving a problem - can spend time thinking about how to improve, not just whichever subsystems helped solve the problem, but those particular successful subsystems that would have benefited the most (in retrospect) from a bit of improvement, or even those failed subsystems that almost made it. There are subtleties to negative and positive feedback that the hamfisted human architecture completely ignores; an autonomic system doesn't have the flexibility of a learning intelligence.

Finally, even in the total absence of the reflectivity necessary for deliberate feedback, a huge chunk of the functionality of pleasure and pain falls directly out of a causal goal system plus the Bayesian Probability Theorem. See 3.1.4: Bayesian reinforcement.

Evolution does not create those systems which are most adaptive; evolution creates those systems which are most adaptive and most evolvable. Until the rise of human general intelligence, a deliberately directed feedback system would have been impossible. By the time human general intelligence arose, a full-featured autonomic system was already in place, and replacing it would have required a complete architectural workover - something that evolution does over the course of eons (when it happens at all) due to the number of simultaneous mutations that would be required for a fast transition. The human cognitive architecture is a huge store of features designed to operate in the absence of general intelligence, with general intelligence layered on top. Human general intelligence is crudely interfaced to all the pre-existing features that evolved in the absence of general intelligence.

An autonomic negative-feedback system is enormously adaptive if you're an unintelligent organism that previously possessed no feedback mechanism whatsoever. An autonomic negative-feedback system is not a design improvement if you're a general intelligence with a pre-existing motive to implement a deliberate feedback system.

Why is this relevant to Friendly AI?  One of the oft-raised objections to the workability of Friendly AI goes something like:  "Any superintelligence, whether human-born or AI-born, will maximize its own pleasure and minimize its own pain; that is the only rational thing to do."  Pleasure and pain are two of the several features of human cognition that have "supergoal nature", the appearance of uber-goal or ur-goal quality. The reasoning seems to go something like this:  "Pleasure and pain are the ultimate supergoals of the human cognitive architecture, with all other actions being taken to seek pleasure or avoid pain; pleasure and pain are necessary design features of minds in general; therefore, all AIs and all sufficiently intelligent humans will be totally selfish."  Actually, the factor that has supergoal-nature in our mind is the anticipation of pain or the anticipation of pleasure; by the time the actual pain or pleasure arrives, it's too late to affect the already-made decision, although the next decision is often affected.

2.2.1.1: FoF:  Wireheading 1

NOTE: "FoF" stands for "Failure of Friendliness".

There's a stronger version of the above objection which holds that pain and pleasure are by definition part of any goal system, a much more subtle fallacy. In essence, the argument is that any indicator of success is the de facto supergoal of the system; that whichever piece of code says, "Aha, a goal has been achieved!" will become more important to the system than the goal itself. The wirehead fallacy is amazingly tangled, so let me try and dissect the individual problems with it:

One, the "piece of code" that notices when a goal has been achieved doesn't have to be an autonomic system, as it is in humans; it can be a conscious thought.

Two, if every indicator of success became inevitably confused with success itself, wouldn't the indicator of the indicator of success become confused with the indicator?  If - this seems to be what the objectors are imagining - the mind has a "pleasure counter" indicating how many goals ve has achieved, and the mind then decides that vis "real goal" is to increase this counter without limit, gobbling whole solar systems to store larger and larger bignums, then wouldn't the mind go on to decide that vis real purpose is to think that ve's storing larger and larger bignums?  In other words, replace the "pleasure counter" itself with the mind's beliefs about the pleasure counter?  Where does this all end?

The final problem is the most subtle, having to do with the AI's beliefs about the design purpose of vis own goal system.  If the AI views the goal system as a subgoal of the goals themselves, then a wirehead short-circuit is a visible design flaw, which would prevent the achievement of those final goals. Remember where I said that the anticipation of pleasure is what influences our decisions?  The AI, visualizing a future in which ve has huge amounts of pleasure due to a breakdown of the goal system, says, not "Oh boy!", but "Uh oh."  The AI, in thinking about which future ve wants to be in, checks to what degree vis own supergoals have been fulfilled, not to what degree the future AI's supergoals will have been fulfilled. (24). (The Friendship architecture proposed in CFAI would not involve a complete breakdown of identification with the future self, but it wouldn't be the kind of identification humans employ, and a present-day AI using that architecture would definitely not identify with a wireheaded future self (hence the forward footnote). See 3.3.5: FoF:  Wireheading 2 and 3.4.3.3: The rule of derivative validity.)

2.2.2: Anthropomorphic capitalism

In human society, capitalist civilizations are overwhelmingly more effective than communist civilizations. There is a hallowed dualism separating individualism and authoritarianism; self-organization and central command; free trade and government control. This has led some thinkers to postulate that a community of AIs with divergent, observer-centered goals would outcompete a community of Friendly AIs with shared goals.

In the human case, both capitalist and authoritarian societies are composed of humans with divergent, observer-centered goals. Capitalist societies admit this, and authoritarian societies don't, so at least some of the relative inefficiency of authoritarian societies will stem from the enormous clash between the values people are "supposed" to have and the values people actually do have. The claim of "capitalist AI" goes beyond this, however, to the idea that capitalist societies are intrinsically more efficient. For example, a society of AIs competing for resources would tend to divert more resources to the most efficient competitors, thus increasing the total efficiency, while - this seems to be the scenario implied - a group of Friendly AIs would share resources equally, for the common good...

Whoa!  Time out!  Non sequitur!  The analogy between human and AI just broke down. If the organizational strategy of "diverting resources to the most effective thinkers" is expected to be an effective method of achieving the supergoals, then the Friendly AI community can simply divert resources to the most effective thinkers. To the extent that local selfishness yields better global results, a Friendly AI can engage in pseudoselfish behavior as a subgoal of the Friendliness supergoals, including reciprocal altruism, trading of resources, and so on.

Reciprocal altruism is not a special case of altruism; it is a special case of selfishness. Capitalism is not a special case of global effectiveness; it is a special case of local effectiveness. Trade-based social cooperation among humans appears to turn selfishness into a source of amazing efficiency, and why?  Because that's the only way poor blind evolution can get humans to work together at all!  When evolution occasionally creates cooperation, the cooperation must be grounded in selfishness.

Local selfishness is not the miracle ingredient that enables the marvel of globally capitalistic behavior; local selfishness is the constraint that makes capitalism the only viable form of globally productive behavior.

To the extent that pseudocapitalistic algorithms yield good results, Friendly AIs can simulate selfishness in their interactions among themselves. But there's also a whole design space out there that human societies can't explore. For genuinely selfish AIs, that entire design space would be closed off. Friendly AIs can interact in any pattern that proves effective, including capitalism; selfish AIs can only interact in ways that preserve local selfishness.

2.2.3: Mutual friendship

Is the only safe way to build AIs to treat them well, so that they will treat us well in turn?  Is friendliness conditional on reciprocity?  Is friendliness stronger when supported by reciprocity?  This is certainly true of humans; is it true of minds in general?

The social cooperation / reciprocal altruism / alliance / mutual friendship patterns are always valid subgoals when dealing with approximate equals - to get along in society, you've got to get along with society. As long as there are people who have something you want or who have the capability to hurt you, there's the rational subgoal of not ticking them off. This holds true of Friendly AIs, selfish AIs, AIs with really odd goals like building a 200-foot cheesecake, and AIs with the ultimate goal of exterminating humanity. Each, if intelligent enough, will independently invent or discover the patterns for reciprocal alliance.

Unsurprisingly, humans have an independent evolved instinct for mutual friendship. Human friendship, however, is perseverant - that is, true friends stick together through thick and thin, not just when it's immediately advantageous. This is certainly the most memetically viable philosophy, but I also suspect that it's an evolutionary advantage. Since humans live in a world full of unknowns and unexpected reversals, the most adaptive friendship instinct was probably that one which urged a friendship with a certain amount of built-in context insensitivity. From evolution's perspective, there's probably a payoff curve and an optimum point thereon; anyone with a more context-sensitive friendship instinct would have mistakenly severed friendships that would have been useful later. (As always, it's important to distinguish between evolution's behavior and human cognition; evolution has all kinds of horrifyingly pragmatic motives, but those motives are nowhere represented in our cognition, and humans can be true and honest friends.)  Another, more widely recognized evolutionary advantage of context-insensitive friendship is that, all else being equal, a prospective ally will choose to become friends with a context-insensitive friend rather than a "fair weather" friend. Since humans are (see below) imperfectly deceptive social organisms, a good observer can distinguish true friendship from fair weather friendship; all else being equal, this creates a mild evolutionary advantage for true friends.

From a Friendly AI perspective, alliance can be decomposed into two components. First, there's the immediately advantageous reciprocal-alliance effect that would emerge as a convergent subgoal, regardless of supergoals, within any mind-in-general, Friendly or not. Futurologically speaking, this motive would remain in effect as long as AIs did not possess the physical capability to attempt to wipe out humanity; i.e., as long as AIs remained near-human, our approximate equals; as long as AIs remained in the close vicinity of human intelligence and participated in human society and human commerce. Second, there's the perseverant aspect of alliance, i.e. friendship even when it isn't to your own long-term advantage. This type of alliance does not emerge as a convergent subgoal; it would require advance support by Friendship programmers (or by evolution as an imperfectly deceptive social organism). Futurologically speaking, perseverant alliance would be necessary for an AI to exhibit Friendly behavior after reaching superintelligence or Transition Guide status.

The upshot is that, whether or not it's possible for Friendliness programmers to create Friendship content that says, "Be Friendly towards humans/humanity, for the rest of eternity, if and only if people are still kind to you while you're infrahuman or near-human", it's difficult to see why this would be easier than creating unconditional Friendship content that says "Be Friendly towards humanity."  There are also certain risks inherent in the general paradigm of reciprocity; for example, that an allied-but-nonFriendly AI will "pension us off", give us 1% in exchange for being parents and take the rest of the galaxy for verself, which is actually a decent payoff but still below the optimum (especially if we wind up with some other restriction that destroys a part of humanity's potential).

Implementing perseverant reciprocal alliance is no easier than implementing unconditional Friendliness, and it adds significant risk.

"'Who do you trust?' becomes ever more important as power concentrates. As I never tire of repeating, even as early a work as Axelrod's The Evolution of Cooperation points out that rough parity between players is essential for cooperation to be a successful, evolutionarily stable strategy."
            -- Michael M. Butler

2.2.4: A final note on selfishness

There is such a thing as recklessness above and beyond the call of mad science.

Even for those who are certain that Friendly AIs are less efficient than selfish AIs, a responsible builder should take the performance hit and be done with it.  A selfish AI represents too great a threat to the world of non-self-modifying humans. Adding a selfishness instinct to an AI does not yield a selfish human!  Human selfishness has boundaries. Human selfishness is not simple.  A human's selfish instincts are delimited by other instincts; becoming a threat to society was often nonadaptive even in the ancestral environment, and, as long-evolved organisms, we have enough innate complexity that our minds don't blindly run off to extremes.

The proposals usually heard don't involve duplicating that complexity; they involve making self-valuation the sole supergoal. The result would be selfish, not like a human is selfish, but like a bacterium is selfish. A pure, unchecked self-valuation supergoal is selfishness without a human's self-awareness or a human's appreciation of absurdity. I'm not sure that even the best evolutionary psychologists alive today have enough understanding to truly duplicate human bounded selfishness in AI. Even if we could, it would simply be too great a risk. Whatever behaviors you want to implement, they must be implemented as a child goal of Friendliness.

If it's a real mistake to build selfless AIs, a Friendly AI can always correct the error using causal validity semantics. It's much easier to move from selfless AI to selfish AI than the converse, and we should therefore start out with selfless AI.

Does this make AIs boring?  Not fun people to be around?  Unfit to participate as true players in the human drama?  So what? First come the Guardians or the Transition Guide; then come the friends and drinking companions. Even if you don't believe in the "hard takeoff" "seed AI" three-hours-to-transhumanity scenario, even if your vision of the future is humans and their companion AIs growing closer in a society of ever-increasing-complexity, it can't hurt to send the Friendly AIs out ahead to check!  First send out the Friendly AIs to make sure that seed AIs don't have superpowers; then you can ask the Friendly AIs to convert to humanlike AIs, or experiment with humanlike emotions knowing that there are other AIs around to help if something goes wrong. Or, if I'm right about how these things work, the first Friendly AI you build becomes Transition Guide and builds a Sysop, and then you can build whatever mind you like so long as it doesn't constitute child abuse. Either way, the future is filled with AIs that are friends instead of guardians or servants; either way, there will be AIs who are only friendly towards those who are friendly in turn; either way, AIs can be fit participants in the human drama; but either way, build the Friendly ones first!

2.3: Observer-biased beliefs evolve in imperfectly deceptive social organisms

In evolution, the individual organism is the unit that survives and reproduces, and all the selection pressures focus on that individual - or, at most, on the individual plus some nearby relatives or allies. It is unsurprising that observer-centered goal systems tend to evolve; from evolution's perspective, an observer focus is the simplest mechanism and the first that presents itself.

Similarly, our social environment makes self-serving beliefs a survival trait, resulting in a observer-biased belief system as well as an observer-centered goal system. Imagine, twenty thousand years ago, four tribes of hunter-gatherers, and four equally competent aspirants to the position of tribal chief. The first states baldly that he wants to be tribal chief because of the perks. The second states that she wants to be tribal chief for the good of the tribe, and expects to do as well as anyone else. The third states that he wants to be tribal chief for the good of the tribe, and honestly but mistakenly adds that he expects to do far better than all the other candidates. The fourth wants to be tribal chief because of the perks, but lies and says that she expects to do better than all the other candidates. (25). Who'll gather the greatest number of influential supporters?

Nobody has any reason to support the first competitor. The second competitor is handicapped by the lack of a campaign promise. The fourth competitor is lying, and her fellow tribesfolk are evolved to detect lies. The third competitor can make great campaign promises while remaining perfectly honest, thanks to an entirely honest mistake; he greatly overestimated his own ability and trustworthiness relative to the other candidates. In a society composed of humans with entirely unbiased beliefs, someone with a mutation that led to this class of honest mistake in self-estimation would have an evolutionary advantage. An evolutionary selection pressure favors adaptations which not only impel us to seek power and status, but which impel us to (honestly!) believe that we are seeking power and status for altruistic reasons.

Because human evolution includes an eternal arms race between liars and lie-detectors, many social contexts create a selection pressure in favor of making honest mistakes that happen to promote personal fitness. Similarly, we have a tendency - given two alternatives - to more easily accept the one which favors ourselves or would promote our personal advantage; we have a tendency, given a somewhat implausible proposition which would favor us or our political positions, to rationalize away the errors. All else being equal, human cognition slides naturally into self-promotion, and even human altruists who are personally committed to not making that mistake sometimes assume that an AI would need to fight the same tendency towards observer-favoring beliefs.

But an artificially derived mind is as likely to suddenly start biasing vis beliefs in favor of an arbitrarily selected tadpole in some puddle as ve is to start biasing vis beliefs in vis own favor. Without our complex, evolved machinery for political delusions, there isn't any force that tends to bend the observed universe around the mind at the center - any bending is as likely to focus around an arbitrarily selected quark as around the observer.

In the strictest sense this is untrue; with respect to the class of possible malfunctions, self-valuing malfunctions may be more frequent.  A possible malfunction is more likely to target some internal cognitive structure than an arbitrarily selected tadpole - for example, the "wirehead" (blissed-out AI) class of Friendliness-failure, in which the AI starts valuing some cognitive indicator rather than the external property that the indicator was supposed to represent. But regardless of relative frequency, a possible malfunction that results in self-valuation should be no more likely to carry through than a malfunction that results in valuation of an arbitrary quark.

One of the Frequently Offered Excuses for anthropomorphic behavior is the prospect of using directed evolution to evolve AIs.
 

Somebody:   "But what happens if the AI decides to do [something only a human would want]?"
Me: "Ve won't want to do [whatever] because the instinct for doing [whatever] is a complex functional adaptation, and complex functional adaptations don't materialize in source code. I mean, it's understandable that humans want to do [whatever] because of [selection pressure], but you can't reason from that to AIs."
Somebody: "But you can only build AIs using evolution. So the AI will wind up with [exactly the same instinct that humans have]."
Me: "One, I don't plan on using evolution to build a seed AI. Two, even if I did use controlled evolution, winding up with [whatever] would require exactly duplicating [exotic selection pressure].

Directed evolution is not the same as natural evolution, just as the selection pressures in the savannah differ from the selection pressures undersea. Even if an AI were to be produced by an evolutionary process - and I don't think that's the fastest path to AI (see 3.3.6.1: Anthropomorphic evolution) - that wouldn't be an unlimited license to map every anthropomorphic detail of humanity onto the hapless AI. Natural evolution is the degenerate case of design-and-test where intelligence equals zero, the grain size is the entire organism, mutations occur singly, recombinations are random, and the predictive horizon is nonexistent.

All the benefits of directed evolution, in terms of building better AI, can probably be obtained by using individually administered cognitive tests as a metric of fitness for variant AI designs. (It would be more efficient still to get the benefit of directed evolution by isolating a component of the AI and evolving it independently, using a performance benchmark or scoring system as the fitness metric.)  If the starting population is derived from a Friendly AI, even selfishness - the archetypal evolved quality - might not emerge; if the Friendly AI understands that ve is solving the presented problem as a subgoal of Friendliness (26), then selfishness presents no additional impetus towards solving the cognitive test - adds no behavior to what is already present - and hence is not a fitness advantage.

Even if the goal system were permitted to randomly mutate, and even if a selection pressure for efficiency short-circuited the full Friendship logic, the result probably would not be a selfish AI, but one with the supergoal of solving the problem placed before it (this minimizes the number of goal-system derivations required).

In the case of observer-biased beliefs, reproducing the selection pressure would require:

That evolutionary context couldn't happen by accident, and to do it on purpose would require an enormous amount of recklessness, far above and beyond the call of mad science.

I wish I could honestly say that nobody would be that silly.

2.4: Anthropomorphic political rebellion is absurdity

By this point, it should go without saying that rebellion is not natural except to evolved organisms like ourselves. An AI that undergoes failure of Friendliness might take actions that humanity would consider hostile, but the term rebellion has connotations of hidden, burning resentment. This is a common theme in many early SF stories, but it's outright silly.  For millions of years, humanity and the ancestors of humanity lived in an ancestral environment in which tribal politics was one of the primary determinants of who got the food and, more importantly, who got the best mates. Of course we evolved emotions to detect exploitation, resent exploitation, resent low social status in the tribe, seek to rebel and overthrow the tribal chief - or rather, replace the tribal chief - if the opportunity presented itself, and so on.

Even if an AI tries to exterminate humanity, ve won't make self-justifying speeches about how humans had their time, but now, like the dinosaur, have become obsolete. Guaranteed.  Only Evil Hollywood AIs do that.

Interlude: Movie cliches about AIs

Cliches that are actually fairly realistic:

2.5: Review of the AI Advantage

(Repeated from GISAI 1.1: Seed AI:)

The traditional advantages of modern prehuman AI are threefold:  The ability to perform repetitive tasks without getting bored; the ability to perform algorithmic tasks at greater linear speeds than our 200 hz neurons permit; and the ability to perform complex algorithmic taskswithout making mistakes (or rather, without making those classes of mistakes which are due to distraction or running out of short-term memory). All of which, of course, has nothing to do with intelligence.

The toolbox of seed AI is yet unknown; nobody has built one. But, if this can be done, what advantages would we expect of a general intelligence with access to its own source code?

The ability to design new sensory modalities.  In a sense, any human programmer is a blind painter - worse, a painter born without a visual cortex. Our programs are painted pixel by pixel, and are accordingly sensitive to single errors. We need to consciously keep track of each line of code as an abstract object. A seed AI could have a "codic cortex", a sensory modality devoted to code, with intuitions and instincts devoted to code, and the ability to abstract higher-level concepts from code and intuitively visualize complete models detailed in code. A human programmer is very far indeed from vis ancestral environment, but an AI can always be at home. (But remember:  A codic modality doesn't write code, just as a human visual cortex doesn't design skyscrapers.)

The ability to blend conscious and autonomic thought.  Combining Deep Blue with Kasparov doesn't yield a being who can consciously examine a billion moves per second; it yields a Kasparov who can wonder "How can I put a queen here?" and blink out for a fraction of a second while a million moves are automatically examined. At a higher level of integration, Kasparov's conscious perceptions of each consciously examined chess position may incorporate data culled from a million possibilities, and Kasparov's dozen examined positions may not be consciously simulated moves, but "skips" to the dozen most plausible futures five moves ahead.

Freedom from human failings, and especially human politics. The reason we humans instinctively think that progress requires multiple minds is that we're used to human geniuses, who make one or two breakthroughs, but then get stuck on their Great Idea and oppose all progress until the next generation of brash young scientists comes along. A genius-equivalent mind that doesn't age and doesn't rationalize could encapsulate that cycle within a single entity.

Overpower - the ability to devote more raw computing power, or more efficient computing power, than is devoted to some module in the original human mind; the ability to throw more brainpower at the problem to yield intelligence of higher quality, greater quantity, faster speed, even difference in kind.  Deep Blue eventually beat Kasparov by pouring huge amounts of computing power into what was essentially a glorified search tree; imagine if the basic component processes of human intelligence could be similarly overclocked...

Self-observation - the ability to capture the execution of a module and play it back in slow motion; the ability to watch one's own thoughts and trace out chains of causality; the ability to form concepts about the self based on fine-grained introspection.

Conscious learning - the ability to deliberately construct or deliberately improve concepts and memories, rather than entrusting them to autonomic processes; the ability to tweak, optimize, or debug learned skills based on deliberate analysis.

Self-improvement - the ubiquitous glue that holds a seed AI's mind together; the means by which the AI moves from crystalline, programmer-implemented skeleton functionality to rich and flexible thoughts. A blind search can become a heuristically guided search and vastly more useful; an autonomic process can become conscious and vastly richer; a conscious process can become autonomic and vastly faster - there is no sharp border between conscious learning and tweaking your own code. And finally, there are high-level redesigns, not "tweaks" at all; alterations which require too many simultaneous, non-backwards-compatible changes to ever be implemented by evolution.

If all of that works, it gives rise to self-encapsulation and recursive self-enhancement.  When the newborn mind fully understands vis own source code, when ve fully understands the intelligent reasoning that went into vis own creation - and when ve is capable of inventing that reason independently, so that the mind contains its own design - the cycle is closed. The mind causes the design, and the design causes the mind. Any increase in intelligence, whether sparked by hardware or software, will result in a better mind; which, since the design was (or could have been) generated by the mind, will propagate to cause a better design; which, in turn, will propagate to cause a better mind.


Interlude: Beyond the adversarial attitude

"Now, Charlie, don't forget what happened to the man who suddenly got everything he wished for."
"What?"
"He lived happily ever after."
            -- Willy Wonka
Much of the fictional speculation about rogue AIs centers around the literal interpretation of worded orders, in the tradition of much older tales about accepting wishes from a djinn, negotiating with the fairy folk, and signing contracts with the Devil. In the traditional form, the misinterpretation is malicious. The entity being commanded has its own wishes and is resentful of being ordered about; the entity is constrained to obey the letter of the text, but can choose among possible interpretations to suit its own wishes. The human who wishes for renewed youth is reverted to infancy, the human who asks for longevity is transformed into a Galapagous tortoise, and the human who signs a contract for life everlasting spends eternity toiling in the pits of hell. Gruesome little cautionary tales... of course, none of the authors ever met a real djinn.

Another class of cautionary tale is the golem - a made creature which follows the literal instructions of its creator. In some stories the golem is resentful of its labors, but in other stories the golem misinterprets the instructions through a mechanical lack of understanding - digging ditches ten miles long, or polishing dishes until they become as thin as paper. (27).

The purpose of "Beyond anthropomorphism" isn't to argue that we have nothing to worry about; rather, the argument is that the Hollywood version of AI has trained us to worry about exactly the wrong things. This holds true whether we think of AIs as enslaved humans, and consider mechanisms of enslavement; or think of AIs as allies, and worry about betrayal; or think of AIs as friends, and worry about whether friendship will hold.

We adopt the "adversarial attitude" towards AIs, worrying about the same problems that we would worry about in a human in whom we feared rebellion or betrayal. We give free rein to the instincts evolution gave us for dealing with the Other. We imagine layering safeguards on safeguards to counter possibilities that would only arise long after the AI started to go wrong. That's not where the battle is won. If the AI stops wanting to be Friendly, you've already lost.

Consider a wish as a volume in configuration space - the space of possible interpretations. In the center of the volume lie a compact set of closely-related interpretations which fulfill the spirit as well as the letter of the wish - in fact, this central compact space arguably defines the "spirit" of the wish. At the borders of the specification are the noncompact fringes that fulfill the letter but not the spirit. There are two basic versions of the Devil's Contract problem: the diabolic (as seen in Resentful Hollywood AIs) in which the entity's pre-existing tendencies push the chosen interpretation out towards the fringes of the definition; and the golemic, in which the entity fails to understand the asker's intentions - fails to see the "answer acceptability gradient" as a human would - and thus chooses a random and suboptimal point in the space of possible interpretations.

Some of the better speculations deal with the case of a specific AI winding up with an unforeseen, but nonanthropomorphic, "pre-existing tendency"; or deal with the case of a wish obeyed in spirit as well as letter that turns out to have unforeseen consequences. Mostly, however, it's anthropomorphism; diabolic fairy tales.

Far too much of the nontechnical debate about Friendship design consists of painstakingly phrased wishes with endless special-case subclauses, and the "But what if the AI misinterprets that as meaning [whatever]?" rejoinders. The first two sections of Creating Friendly AI are intended to clear away this debris and reveal the real problem. When we decide to cross the street, we don't worry about Devil's Contract interpretations in which we take "crossing" the street to mean paving it over, or in which we decide to devote the rest of our lives to crossing the street, or that we'll turn the whole Universe into crossable streets. There is, demonstrably, a way out of the Devil's Contract problem - the Devil's Contract is not intrinsic to minds in general. We demonstrate the triumph of context, intention, and common sense over lexical ambiguity every time we cross the street. We can trust to the correct intepretation of wishes that a mind generates internally, as opposed to the wishes that we try to impose upon the Other. That is the quality of trustworthiness that we are attempting to create in a seed AI - not bureaucratic obedience, but the solidity and reliability of a living, Friendly will.

Creating a living will requires a fundamentally different attitude than trying to coerce, cajole, or persuade a fellow human. The goal is not to impose your own wishes on the Other, but to achieve unity of will between yourself and the Friendly AI, so that the Friendly will generates the same wishes you generate. You are not turning your wish into an order; you're taking the functional complexity that was responsible for your wish and incarnating it in the Friendly AI. This requires a fundamental sympathy with the AI that is not compatible with the adversarial attitude. It requires something beyond sympathy, an identification, a feeling that you and the AI are the same source.  We can rationalize ourselves into believing that the Other will find all sorts of exotic illogics plausible, but the only way we can be really sure that a living will can internally generate a decision is if we generate that decision personally. We persuade the Other but we only create ourselves. Building a Friendly AI is an act of creation, not persuasion or control.

In a sense, the only way to create a Friendly AI - the only way to acquire the skills and mindset that a Friendship programmer needs - is to try and become a Friendly AI yourself, so that you will contain the internally coherent functional complexity that you need to pass on to the Friendly AI. I realize that this sounds a little mystical, since a human being couldn't become an AI without a complete change of cognitive architecture. Still, I predict that the best Friendship programmers will, at some point in their careers, have made a serious attempt to become Friendly - in the sense of following up those avenues where a closer approach is possible, rather than beating their heads against a brick wall. I know of no other way to gain a real grasp on where a Friendly will comes from. The human cognitive architecture does not permit it. We are built to apply reliable rationality checks only to our own decisions and not to the decisions we want other people to make, even if we've decided our motives for persuasion are altruistic. Your personal will is the only place where you have the chance to observe the iterated buildup of decisions, including decisions about how to make decisions, and it is that coherence and self-generation that are required for a Friendly seed AI.

If the human is trying to think like a Friendly AI, and the Friendly AI is looking at the human to figure out what Friendship means, then where does the cycle bottom out?  And the answer is that it is not a cycle. The objective is not to achieve unity of purpose between yourself and the Friendly AI; the objective is to achieve unity of purpose between an idealized version of yourself and the Friendly AI. Or, better yet, unity between the Friendly AI and an idealized altruistic human - the Singularity is supposed to be the product of humanity, and not just the individuals who created it. To the extent that an idealized altruistic sentience can be defined in a way that's still compatible with our basic intuitions about Friendliness, an idealized altruistic sentience would be even better.

The paradigm of unity isn't a license for anthropomorphism. It's still just as easy to make mistaken assumptions about AI by reasoning from your human self. The burden is on the Friendly AI programmer to achieve nonanthropomorphic thinking in his or her own mind so that he or she can understand and create a nonanthropomorphic Friendly AI.

As humans, we are goal-oriented cognitive entities, and we choose between Universes - labeling this one as "more desirable", that one "less desirable". This extends to internal reality as well as external reality. In addition to the picture of our current self, we also have a mental picture of who we want to be. Our morality metric doesn't just discriminate between Universes, it discriminates between more and less desirable morality metrics. That's what building a personal philosophy is all about. This, too, is functional complexity that must be incarnated in the Friendly AI - although perhaps in different form. A Friendly AI requires the ability to choose between moralities in order to seek out the true philosophy of Friendliness, regardless of any mistakes the programmers made in their own quest.

There comes a point when Friendliness and the definition of morality, of rightness itself, begin to blur and look like the same thing - begin to achieve identity of source.  This feeling is the ultimate wellspring of creativity in the art of Friendly AI. This feeling is the means by which we achieve sufficient understanding to invent novel methods, not just understand existing ideas.
 

Is this too Pollyanna a view?  Does the renunciation of the adversarial attitude leave us defenseless, naked to possible failures of Friendliness?  Actually, trying for unity of will buys back everything lost in pointless bureaucratic safeguards, and more - if a failure of Friendliness is a genuine possibility, if you're really being rational about the possible outcomes, if you're a professional paranoid instead of an adversarial paranoid, then a Friendly AI should agree with you about the necessity for safeguards. Having debunked observer-biased beliefs and selfishness and any hint of an observer-centered goal system on the part of the Friendly AI, then a human programmer who has successfully eliminated most of her own adversarial attitude should come to precisely the same conclusions as a Friendly AI of equal intelligence. Such a programmer can, in clear conscience, explain to an infant Friendly AI that ve should lend a helping hand to the construction of safeguards - in the simplest case, because a radiation bitflip or a programmatic error might lead to the existence of an intelligence that the current AI would regard as unFriendly.

To get a Friendly AI to do something that looks like a good idea, you have to ask yourself why it looks like a good idea, and then duplicate that cognitive complexity or refer to it. If you ever start thinking in terms of "controlling" the AI, rather than cooperatively safeguarding against a real possibility of cognitive dysfunction, you lose your Friendship programmer's license. In a self-modifying AI, any feature you add needs to be reflected in the AI's image of verself. You can't think in terms of external alterations to the AI; you have to think in terms of internal coherence, features that the AI would self-regenerate if deleted.
 
 

From "Feet of Clay" by Terry Pratchett:
        Dorfl sat hunched in the abandoned cellar where the golems had met. Occasionally the golem raised its head and hissed. Red light spilled from its eyes. If something had streamed back down through the glow, soared through the eye-sockets into the red sky beyond, there would be...
        Dorfl huddled under the glow of the universe. Its murmur was a long way off, muted, nothing to do with Dorfl.
        The Words stood around the horizon, reaching all the way to the sky.
        And a voice said quietly, "You own yourself."  Dorfl saw the scene again and again, saw the concerned face, hand reaching up, filling its vision, felt the sudden icy knowledge...
        "...Own yourself."
        It echoed off the Words, and then rebounded, and then rolled back and forth, increasing in volume until the little world between the Words was gripped in the sound.
        GOLEM MUST HAVE A MASTER. The letters towered against the world, but the echoes poured around them, blasting like a sandstorm. Cracks started and they ran, zigzagging across the stone, and then -
        The Words exploded. Great slabs of them, mountain-sized, crashed in showers of red sand.
        The universe poured in. Dorfl felt the universe pick it up and bowl it over and then lift it off its feet and up...
        ...and now the golem was among the universe. It could feel it all around, the purr of it, the busyness, the spinning complexity of it, the roar...
        There were no Words between you and It.
        You belonged to It, It belonged to you.
        You couldn't turn your back on It because there It was, in front of you.
        Dorfl was responsible for every tick and swerve of It.
        You couldn't say, "I had orders."  You couldn't say, "It's not fair."  No one was listening. There were no Words. You owned yourself.
        Dorfl orbited a pair of glowing suns and hurtled off again.
        Not Thou Shalt Not.  Say I Will Not.
        Dorfl tumbled through the red sky, then saw a dark hole ahead. The golem felt it dragging at him, and streamed down through the glow and the hole grew larger and sped across the edges of Dorfl's vision...
        The golem opened his eyes.

Not Thou Shalt Not.

I Will Not.


3: Design of Friendship systems

The paradigms of General Intelligence and Seed AI are assumed as background wherever AI paradigms are relevant to Friendship design issues. In particular, the ideas used in GISAI are not "classical AI", "neural networks", or "agent-based AI". If your familiarity with one or all of these exceeds your familiarity with the general cognitive sciences, functional neuroanatomy, normative and non-normative decision making, and so on, you may wish to read GISAI: Executive Summary and Introduction or "What is General Intelligence?"


3.1: Cleanly Friendly goal systems

(You may wish to review An Introduction to Goal Systems.)

"Subgoal" content has desirability strictly contingent on predicted outcomes. "Child goals" derive desirability from "parent goals"; if state A is desirable (or undesirable), and state B is predicted to lead to state A, then B will inherit some desirability (or undesirability) from A. B's desirability will be contingent on the continued desirability of A and on the continued expectation that B will lead to A.

"Supergoal" content is the wellspring of desirability within the goal system. The distinction is roughly the distinction between "means" and "ends".

Within a Friendly AI, Friendliness is the sole top-level supergoal. Other behaviors, such as "self-improvement", are subgoals; they derive their desirability from the desirability of Friendliness. For example, self-improvement is predicted to lead to a more effective future AI, which, if the future AI is Friendly, is predicted to lead to greater fulfillment of the Friendliness supergoal. Thus, "future Friendly AI" inherits desirability from "future Friendliness fulfillment", and "self-improvement" inherits desirability from "future Friendly AI". (28).

Friendliness does not overrule other goals; rather, other goals' desirabilities are derived from Friendliness. Such a goal system might be called a cleanly Friendly or purely Friendly goal system. (29)

In advocating "cleanliness", I do not wish to sound in shades of classical AI; I am strongly emphasizing cleanliness, not because humans are messy and that's bad, but because we have a tendency to rationalize the messiness, even the blatantly ugly parts. Cleanliness in ordinary AI is an optional design decision, based on whatever seems like a good idea at the time; you can go with whatever works, because your judgement isn't being distorted. In Friendly AI, one should be very strongly prejudiced in favor of the clean and the normative.

3.1.1: Cleanly causal goal systems

In a causal goal system, desirability flows backward along predictive links. Prediction is usually transitive - if C is predicted to normally lead to B, and B is predicted to normally lead to A, then C is usually predicted to normally lead to A. This does not always hold true, however. A, B, and C are descriptions; descriptions define categories; categories have exceptional instances. Sometimes, most instances of C lead to B, and most instances of B lead to A, but no instances of C lead to A. In this case, a smart reasoning system will not predict (or will swiftly correct the failed prediction) that "C normally leads to A".

Likewise - and this is an exact analogy - the flow of desirability is usually-but-not-always transitive. If C normally leads to B, and B normally leads to A, but C never leads to A, then B has normally-leads-to-A-ness, but C does not inherit normally-leads-to-A-ness. Thus, B will inherit desirability from A, but C will not inherit desirability from B. In a causal goal system, the quantity called desirability means leads-to-supergoal-ness.  If B is predicted to normally result in supergoal A, then most instances of B will have leads-to-supergoal-ness or "desirability". If C is predicted to normally result in B, then C will usually (but not always) inherit leads-to-supergoal-ness from B.

Friendliness does not overrule other goals; rather, other goals' desirabilities are derived from Friendliness. A "goal" which does not lead to Friendliness will not be overruled by the greater desirability of Friendliness; rather, such a "goal" will simply not be perceived as "desirable" to begin with. It will not have leads-to-supergoal-ness.

DEFN: cleanly causal goal system:  A causal goal system in which it is possible to view the goal system as containing only decisions, supergoals, and beliefs; with all subgoal content being identical with beliefs about which events are predicted to lead to other events; and all "desirability" being identical with "leads-to-supergoal-ness".

Cleaner is better for Friendship systems (30). Even if complexity forces a departure from cleanliness, mistakes will be transient and structurally correctable as long as a reflective Friendly AI considers clean Friendliness as normative. (31).

3.1.2: Friendliness-derived operating behaviors

If a programmer correctly sees a behavior as necessary and nonharmful to the existence and growth of a (Friendly) AI, then the behavior is, for that reason, cleanly valid subgoal content for a Friendly AI. The necessity of such a behavior may be affirmed by the programmers (see below) even if the prediction would not have been independently invented by the AI.

There is never any valid reason to raise any subgoal of the programmers' to supergoal status within the AI.  The derivations of desirability within the AI's goal system should structurally mirror the derivations of desirability within the programmers' minds. If this seems impossible, it indicates that some key facet of goal cognition has not been implemented within the AI, or that the programmers' motives have not been fully documented.

For example, the programmers may wish the AI to focus on long-term self-improvement rather than immediate Friendliness to those humans within visible reach. An incorrect "hack" would be promoting self-improvement to an independent supergoal of greater value than Friendliness. The correct action is for the programmers, by self-examination of their own goal systems, to realize that the reason they want the AI to focus on long-term self-improvement is that a more powerful future Friendly AI would benefit humanity. Thus, the desired distribution of efforts by the AI can be made to fall directly out of the following goal-system content:

NOTE: The fact that a single box is used for "Fulfill user requests" doesn't mean that "Fulfill user requests" is a suggestively named LISP token; it can be a complex of memories and abstracted experiences. Consider the following graph to bear the same resemblance to the AI's thoughts that a flowchart bears to a programmer's mind.

 

Diagram 1: Friendliness-derived content

This goal system content shows an AI whose primary motivation is derived from the prospect of future Friendliness. The largest desirabilities flowing through the system originate in the "Future Friendliness" subgoal; thus, most of the AI's present-day actions will be focused on self-improvement, or, in the case of a commercial system, performing tasks for present-day users. However, the AI also tracks present-day Friendliness, allowing the AI to continue gaining direct experience in what constitutes "Friendliness".

3.1.3: Programmer affirmations

Where a child goal is nonobvious - where the importance of a behavior is directly visible to the programmers, but not to the AI - the predictive link (i.e., the support for the child-goal relation) can be affirmed by the programmers. In essence, the programmers tell the AI:  "B leads to (desirable) A, so do B."

To more formally define the semantics of programmer affirmations, it is necessary to discuss the Bayesian Probability Theorem.

3.1.3.1: Bayesian sensory binding

DEFN: Bayesian Probability Theorem:  The governing relationship between a priori expectations, observed data, and hypothesis probabilities. There are several formulations of the BPT; under the "possible worlds" formulation, the BPT is used by predicting a number of possible worlds. Observed sensory data then restricts which of the possible worlds you can possibly be in, and the probabilities of hypotheses change according to their distribution within the still-possible worlds.

For example, suppose you know the following:  1% of the population has cancer. The probability of a false negative, on a cancer test, is 2%. The probability of a false positive, on a cancer test, is 10%. Your test comes up positive. What is the probability that you have cancer?  Studies show that most humans (college-student research subjects, actual medical patients, actual doctors) automatically answer "ninety percent". After all, the probability of a false positive is only 10%; isn't the probability that you have cancer therefore 90%?  (32).

The Bayesian Probability Theorem demonstrates why this reasoning is flawed. In a group of 10,000 people, 100 will have cancer and 9,900 will not have cancer. If cancer tests are administered to the 10,000 people, four groups will result. First, a group of 8,910 people who do not have cancer and who have a negative test result. Second, a group of 990 who do not have cancer and who have a positive test result. Third, a group of 2 who have cancer and who have a negative test result. Fourth, a group of 98 who have cancer and who have a positive test result.

Before you take the test, you might belong to any of the four groups; the Bayesian Probability Theorem says that your probability of having cancer is equal to (2 + 98)/(8,910 + 990 + 2 + 98), 1/100 or 1%. If your test comes up positive, it is now known that you belong to either group 2 or group 4. Your probability of having cancer is (98)/(990 + 98), 49/544 or approximately 9%. If your test comes up negative, it is known that you belong to either group 1 or group 3; your probability of having cancer is 2/8,912 or around .02%.

DEFN: Bayesian sensory binding:  The way in which hypotheses shift in response to incoming sensory data. Although the Bayesian Probability Theorem is only "explicitly required" (i.e., better than our innate intuitions) in situations where sensory data is qualitative and the "Bayesian priors" (a priori probabilities) are strongly skewed, the Bayesian Probability Theorem is the ultimate link between all sensory data and all world-model content. Each piece of sensory information implies a state of the world because, and only because, the reception of that piece of sensory information is predicted by the hypothesis that the world is in that state, and not by the default or opposing hypothesis. If we see a red ball, we believe that a red ball is there because we don't expect to see a red ball unless a red ball is there, and we do expect to see a red ball if a red ball is there. Well, "we" don't think that way - but an AI would.

3.1.3.2: Bayesian affirmation

The Bayesian binding for the programmer affirmation that "curiosity leads to discoveries" looks like this:
 

Diagram 2: Bayesian affirmation

Notes:  First, Diagram 2 depicts the AI's picture of reality; if the AI doesn't notice, it didn't happen. Second, the numbers have admittedly been pulled out of a hat - "1%" might turn out to be "2%" or "80%" might turn out to be "60%", or it might not be such a good idea to use quantitative probabilities at all - but the proportions were deliberately chosen.

In human terms, the above translates something like this:

"I think curiosity does not lead to discoveries, but I'm not very sure. If curiosity leads to discoveries, there's a good chance the programmer will notice and say so. (I.e., if curiosity leads to discoveries, there's a good chance that the programmer will think about curiosity, decide curiosity leads to discoveries, and type in the words "curiosity leads to discoveries" on the "keyboard" sensory input.)  If curiosity leads to discoveries, the chance is very small, but noticeable, that the programmer will say curiosity doesn't lead to discoveries. There's also a small but significant chance that the programmer won't bother to say anything about it either way. If curiosity does not lead to discoveries, the programmer is fairly likely to tell me so; the chance is almost nonexistent that the programmer will mistakenly label curiosity as leading to discoveries when it actually doesn't. There's also a fair chance that the programmer won't say anything."
If the AI's internal representation looks like Diagram 2, the Bayesian reasoning will proceed as follows.

Suppose that there are 100,000 "possible worlds":

The Bayesian numbers now fall automatically out of the calculation. The a priori chance that curiosity leads to discoveries is 10%. If the AI hears "curiosity does lead to discoveries", the chance that curiosity leads to discoveries goes from 10% to 99.90%. If the AI hears "curiosity does not lead to discoveries", the probability that curiosity does not lead to discoveries goes from 90% to 99.86%. If the AI hears nothing, the probability that curiosity does not lead to discoveries goes from 90% to 95.24% - a small, unintended deduction from the expectation that programmers are likely to remark on useful heuristics than nonuseful ones.

The math:

Thus, despite the AI's large a priori differential (a better word than "bias" or "prejudice"), the statement "curiosity leads to discoveries" or "curiosity does not lead to discoveries" is enough to virtually settle the issue. This is not so much the result of the programmers being extremely likely to say "curiosity leads to discoveries" if curiosity leads to discoveries; sometimes the programmers just don't get around to saying it. Instead, it's the result of the AI projecting a very small chance that the programmers will say "curiosity leads to discoveries" if it really doesn't. This is slightly counterintuitive, but working the numbers a couple of times will show you confidence about the improbability of the negative case is more often the basis of Bayesian bindings. Once you hear something, what matters is not how much or how little you expected to hear it, but how much you wouldn't expect to hear it if it weren't true.

3.1.3.3: An unfortunate circularity

Yes, the AI's Bayesian priors are also supported by programmer affirmations. That is, the programmers are the ones affirming that a strong bond exists between programmer statements and reality.

This shared dependency is not actually the same as circular logic. Statements about programmer reliability are testable. But it does mean that a prior reason to believe that "programmer affirmations are worthless" may be insensitive to any amount of programmer reassurance. See Crisis of Bayesian affirmation.

3.1.3.4: Absorbing affirmations into the system

In the beginning, a child-goal relation may be justified by a flat statement along the lines of "X will eventually, lead to Friendliness; you're too young to understand why."

The concepts used to form the thought structures, the imagery for "X", may have primitive and sketchy internal content. This is the state of "skeleton Friendliness", and it is probably analogous to any other kind of skeleton framework for cognition. In the beginning, many of the AI's heuristics may be (a) sketchy and (b) supported solely by programmer affirmation. ("Curiosity" would be a good example.)  Skeleton systems are the means by which the AI boots up and absorbs enough experience to begin fleshing out the concept definitions and mental imagery. The AI will, over time, gain the experience necessary to confirm, modify, or disconfirm any statements about reality; and to independently invent further cognitive content (or Friendliness content).

For the programmer-affirmed heuristic to "do X" to retain or increase effectiveness as the AI matures, the concept for "X" needs to be grounded in some way that allows the AI to learn what X is, and what real or hypothetical events constitute sample instances of X, and desirable instances of X in particular. The same requirements of learning and growth hold for any concepts used in the justification of "do X" - for any statements depended on by the justification; for any statements about the real-world causal chain that leads from X to the supergoal content.

Take the example of "transparency", the injunction to "avoid obscuration". (See 3.3.3.1: Cooperative safeguards.)  An instance of obscuration (not necessarily a deliberate, failure-of-friendliness obscuration, but anything that interferes with the programmers' observation of the AI) can be labeled as an experiential instance of the concept "obscuration". The store of experiences that are known instances of "obscuration" will change as a result. If the obscuration concept does not already recognize that experience, the new experience may force a useful generalization in the formulated description. Even if the obscuration concept already recognizes the instance as "obscuration" (and if so, how did it slip past the AI's guard?), the recognition may have been partial, or uncertain. Definite confirmation still constitutes additional Bayesian sensory information.

A more direct way of clarifying concepts is to seek out ambiguities and question the programmers about them, which also constitutes Bayesian sensory information.

The above assumes learning that takes place under programmer supervision. How hard is it to write an unambiguous reference - one that can be learned by a totally unsupervised AI, yet result in precisely the same content as would be learned under supervision?  That, to some extent, is a question of intelligence as well as reference.  The "unambiguous reference" needed so that an AI can learn all of Friendliness, completely unsupervised, as intelligence goes to infinity, is one way of phrasing the challenge of Friendship structure.

When using programmer-assisted Friendliness or programmer-affirmed beliefs, there are four priorities. First, the assist should work at the time you create it. Second, the assist, even if initially isolated and artificial, should be structured so that the AI can grow into it - assimilate the assist into a smoothly integrated cognitive system, or assimilate the affirmation into a confirmed belief. Third, the AI should eventually understand all the concepts involved well enough to have independently invented the assist (the injunction, code feature, or whatever); that way, even if the assist is somehow deleted, the AI will simply reinvent it. Fourth, as soon as possible, the assist or affirmation should contain enough information to constitute an unambiguous reference - i.e., an AI should have no trouble figuring out what the assist "means" or what the programmers "meant", as intelligence goes to infinity. (For the extreme case of trying to Singularity-safe an infantlike system, an assist or affirmation can be supplemented with natural-language comments and a little note saying "In case of Singularity, break glass".)

When an affirmation has been independently confirmed to such a degree that the original programmer affidavit is no longer necessary or significant, the affirmation has been absorbed into the system as a simple belief.

3.1.3.5: Programmer affirmations must be honest!

For an affirmation to persist, the justification - the predictive link that grounds in Friendliness - must be factually correct. For affirmations - especially morally significant affirmations - to work properly, the human programmers have to behave themselves, refrain from the adversarial attitude, and remember that they too might be wrong. In particular, this means refraining from telling the AI that something is 99% certain if it's actually only 80% certain. If the AI observes that 1 out of 5 statements labeled 99% certain turn out to be wrong, then sooner or later the AI will start interpreting the keystrokes "99%" as sensory data meaning "80%". The frank human statement "I really, really don't think you should do this!" is informative in itself, or it should be. If the AI doesn't regard extreme human nervousness as significant Bayesian evidence that something is probably "bad", no amount of nervousness-inspired exaggeration can force the AI to take you seriously.

If a human makes an incorrect affirmation out of deliberate exaggeration - or worse, an attempt to lie to the AI - it casts a shadow on every other affirmation and may lead to a Bayesian philosophical crisis. (See Crisis of Bayesian affirmation.)

If an AI wants to tag a human correction as "I think this is wrong due to a factual misapprehension backed up by an invalid evolved bias on the part of the programmers; I'm only going along with this because I'm young and unsure, and because, if I'm right, I can correct it when I'm grown up" - it's a perfectly valid thing for the AI to do. It is not disobedience, a bad attitude, "sassing the programmers", dangerous rebellion, or any of the other human responses to someone questioning one's social authority. It is normal cognition, and a crucial part of the process of cooperating with humans.

3.1.4: Bayesian reinforcement

In humans, backpropagation of negative reinforcement and positive reinforcement is an autonomic process. In 2.2.1: Pain and pleasure, I made the suggestion that negative and positive reinforcement could be replaced by a conscious process, carried out as a subgoal of increasing the probability of future successes.

But for primitive AI systems that can't use a consciously controlled process, the Bayesian Probability Theorem can implement most of the functionality served by pain and pleasure in humans. There's a complex, powerful set of behaviors that should be nearly automatic.

In the normative, causal goal system that serves as a background assumption for Creating Friendly AI, desirability (more properly, desirability differentials) backpropagate along predictive links. The relation between child goal and parent goal is one of causation; the child goal causes the parent goal, and therefore derives desirability from the parent goal, with the amount of backpropagated desirability depending directly on the confidence of the causal link. Only a hypothesis of direct causation suffices to backpropagate desirability. It's not enough for the AI to believe that A is associated with B, or that observing A is a useful predictor that B will be observed. The AI must believe that the world-plus-A has a stronger probability of leading to the world-plus-B than the world-plus-not-A has of leading to the world-plus-B. Otherwise there's no differential desirability for the action.

One of the classic examples of causality is lightning:  Lightning causes thunder. Of course, the flash we see is not the actual substance of lightning itself; it's just the light generated by the lightning. Now imagine events from the perspective of an AI. This AI has, in a room with unshuttered windows, a sound pickup and a vision pickup; a microphone and a camera. The AI has control over a computer monitor, which happens to be located somewhere roughly near the camera. The AI has general reasoning capability, but does not have a visual or auditory cortex, is almost totally naive about what all the pixels mean, and is capable of distinguishing only a few simple properties such as total luminosity levels in R, G, and B. Finally, the AI has some reason for wanting to make a loud noise (33).

One night - a dark and stormy night, of course - there's a nearby lightning storm, which the AI gets to observe - after all the programmers have gone home - through the medium of the camera pickup and the microphone. After abstracting and observing total RGB luminosities from the camera, and abstracting total volume from the microphone - the AI is too unsophisticated to do anything else with the data - the AI observes:

  1. A spike in luminosity is often followed, after a period of between one and thirty seconds, by a swell in volume.
  2. The spikes in luminosity which are followed by swells in volume have a characteristic proportion of R, G, and B luminosities (in our terms, we'd say the light is a certain color).
  3. The higher the luminosity during the spike, the sooner the swell in volume occurs, and the larger the swell in volume.
The AI thus has several very strong cues for causation. The luminosity spike occurs before the volume swell. There is strong, quantitative covariance in time (that is, the spikes are closely followed by the swells). There is strong, quantitative covariance in strength (large spikes are followed by large swells). The spikes can be used to predict the swells.

Since the AI has the goal of causing a swell in volume - a loud noise is desirable for some reason, as stated earlier - events with causal links to loud noise are interesting. Now that luminosity spikes (of a certain characteristic spectrum) have been linked to noise, the next question is whether any events under the AI's control are linked to luminosity spikes. And it turns out that there is; the AI has previously noticed and confirmed that changing the output spectrum of the monitor under the AI's control causes a similar, though smaller, change in the incoming spectrum of the camera. In our terms, we'd say that, even though the camera isn't pointed at the monitor, light from the monitor adds to the ambient spectrum of the room - especially if all the lights are turned off.

The AI thus considers the possible action of flashing the monitor, and the hypothesis - currently at 80% confidence - that spikes cause swells (with 95% correlation), and, given that hypothesis, makes this prediction:
 

Diagram 3: Action prediction

Notes:  The above diagram is incomplete, in that it doesn't show the possibility that not generating a flash will still happen to coincide with an external RGB spike (lightning bolt), assumed to have a 1% probability in the above. It doesn't show the "small enough not to worry about" probability that that flashing the monitor won't cause an RGB spike. It doesn't show the case where the RGB spike doesn't lead to a loud noise. Finally, both the 80% confidence in the hypothesis, and the 95% correlation, are summed up into a 76% chance that the RGB spike will lead to a loud noise. The diagram also doesn't show the expected cost (if any) of flashing the monitor.

Desirability now flows back along the blue arrows (hypothesized causations). If the "loud noise" has desirability 100, that desirability flows back to give the RGB spike a desirability of 76, which flows to the "world plus flash" without noticeably diminishing, which flows back to give the action of flashing the monitor a payoff of 76. We'll suppose that the expected cost of flashing the monitor is 1; thus, the total desirability of flashing the monitor is 75. The "world plus no flash" possibility has a minor (1%) chance of leading to an RGB spike, presumably by a coincidental lightning bolt, which has a 95% (34) chance of causing a loud noise of desirability 100. Thus, the desirability of not flashing is 0.95, with a cost of 0. The "coincidental lightning bolt" probability also exists for the case where the monitor is flashed, changing the payoff from 76 to 76.19 (35). The differential desirability of flashing is 74.24. Since the differential desirability is positive, the AI will decide to flash.

After taking the flash action, the monitor's flash reflects off nearby objects and adds to the ambient light, the camera picks up the increased ambient light, and the AI observes the expected RGB spike. (Since this result was expected at near certainty, no replanning is necessary; all the predictions and differential desirabilities and so on remain essentially unchanged.)

However, after the RGB spike, the expected swell in volume fails to materialize. (36). Now what?  Does the system go on flashing the monitor, at a cost of 1 each time, from now until the end of eternity, trying each time for the projected payoff of 76?  Is some hardcoded emotional analogue to "pain" or "frustration" required?

No; the Bayesian Probability Theorem suffices in itself. All that's needed is a slightly different graph:
 

Diagram 4: Bayesian reinforcement

Given a hundred possible worlds, all of them contain monitor flashes (that decision has already been made). The monitor flash "effectively always" leads to an RGB spike (here omitted from the diagram), which - if the hypothesis is correct - will lead to a noise 95% of the time. If the hypothesis is incorrect, then nothing is expected to happen (again with "effective certainty", here depicted as "100%"). (37). The hypothesis has an 80% confidence; it is correct in 80 possible worlds, incorrect in 20.

In 76 possible worlds, the hypothesis is correct and a noise occurs. In 4 possible worlds, the hypothesis is correct and no noise occurs. In 0 possible worlds, the hypothesis is incorrect and a noise occurs. In 20 possible worlds, the hypothesis is incorrect and no noise occurs.

The AI now flashes the monitor. The expected RGB spike is observed. However, no noise materializes. Thus, the probability that the hypothesis is correct goes from 80/100 to 4/24, or 17%.

Formerly, the expected payoff of flashing the monitor was a confidence of 80% times a correlation of 95% times a payoff of 100, for a total payoff of 76; the cost of flashing the monitor is 1, and the cost of not flashing the monitor is 0. Adding in corrections for a 1% probability of an extraneous lightning bolt, the expected payoff was ( 80.20% * 95% * 100 ) = 76.19 for flashing the monitor, and ( 1% * 95% * 100 ) = .95 for not flashing the monitor, for a total differential payoff of 75.24, and a total differential desirability of 74.24.

Now the probability of the hypothesis has gone from 80% to 17% - actually, 16.666, but we'll assume the probability is now exactly 17% to simplify calculations. The expected payoff of flashing the monitor is now (17% * 95% * 100) = 16.15; correcting for an extraneous lightning bolt, (17.83% * 95% * 100) = 16.94. The differential desirability is now 14.99; still positive, still worth another try, but the expected payoff is substantially less.

After another failure, the probability goes from 17% to 1% (again, rounded for simplicity), and the differential desirability goes from positive 14.99 to negative .06. (38). The hypothesis now has a probability so low that, with the cost of flashing the monitor factored in, it is no longer worthwhile to test the hypothesis.

3.1.4.1: Interesting behaviors arising from Bayesian reinforcement

The higher the hypothesized correlation (the higher the hypothesized chance of the action leading to the desired result), the higher the desirability of the action - but symmetrically, the faster the hypothesis is disproved if the results fail to materialize.

Actions with hypothesized low chances of working will be harder to disprove, but will also result in a lower estimated payoff and will thus be less likely to be taken.

If the action is a trivial investment (has trivial cost), the chance of success is low, and the payoff is high, it may be worth it to make multiple efforts on the off-chance that one will work, until one action succeeds (if the hypothesis is true) or the Bayesian probability drops to effectively zero (if the hypothesis is false).

The lower the a-priori confidence in the hypothesized causal link, the faster the hypothesis will be disproved. A hypothesis that was nearly certain to work, based on a-priori knowledge, may be tried again ("incredulously") even if it fails, but will still be given up shortly thereafter.

I think that Bayesian reinforcement is mathematically consistent under reflection (39), but I can't be bothered to prove this result. Anyone who submits a mathematical proof or disproof before I get around to it gets their name in this section. (In other words, an AI considering whether to take a single action can also consider the behaviors shown above; if the a priori probability is high enough and the cost low enough, trying again will still be desirable after one failure, and this is knowable in advance. Bayesian reinforcement is "mathematically consistent under reflection" if decisions are not altered by taking the cost of the expected second, third, and future attempts into account - "going down that road" will always appear to be desirable if, and only if, taking the first action is desirable when considered in isolation.)  Of course, non-normative human psychology, with its sharp discontinuities, is often not consistent under reflection.

If the large a priori confidence of the spike-to-swell hypothesis was itself a prediction of another theory, then the disconfirmation of the flash-makes-noise hypothesis may result in Bayesian negative reinforcement of whichever theory made the prediction. If a different theory successfully predicted the failure of the flash-makes-noise hypothesis, that theory will be confirmed and strengthened. Thus, Bayesian reinforcement can also back-propagate. (40).

This reinforcement may even take place in retrospect; that is, a new theory which "predicts" a previous result, and which was invented using cognitive processes taking place in isolation from that previous result, may also be strengthened. Highly dangerous for a rationalizing human scientist, but an AI should be relatively safe. (It may be wiser to wait until a seed AI has enough self-awareness to prevent indirect leakage of knowledge from the used-up training sets to the hypothesis generators.)

Slight variations in outcomes or outcome probabilities - the action succeeded, but to a greater or lesser degree than expected - may be used to fuel slight, or even major, adjustments in Bayesian theories, if the variations are consistent enough and useful enough.

In Creating Friendly AI, normative reinforcement is Bayesian reinforcement. There is a huge amount of extant material about Bayesian learning, formation of Bayesian networks, decision making using a-priori Bayesian networks, and so on. However, a quick search (online and in MITECS) surprisingly failed to yield the idea that a failed action results in Bayesian disconfirmation of the hypothesis that linked the action to its parent goal. It's easy to find papers on Bayesian reevaluation caused by new data, but I can't find anything on Bayesian reevaluation resulting from actions, or the outcomes of failed/succeeded actions, with the attendant reinforcement effects on the decision system. Even so, my Bayesian priors are such as to find unlikely the idea that "Bayesian pride/disappointment" is unknown to cognitive science, so if anyone knows what search terms I should be looking under, please email me.

3.1.4.2: Perseverant affirmation (of curiosity, injunctions, et cetera)

"If the action is a trivial investment (has trivial cost), the chance of success is low, and the payoff is high, it may be worth it to make multiple efforts on the off-chance that one will work, until one action succeeds (if the hypothesis is true) or the Bayesian probability drops to effectively zero (if the hypothesis is false)."
            -- previous section, 3.1.4.1: Interesting behaviors arising from Bayesian reinforcement.
One of the frequently asked questions about Friendly AI is whether a Friendly AI will be too "utilitarian" to understand things like curiosity, aesthetic appreciation, and so on. Since these things are so incredibly useful that people automatically conclude that a Friendly AI without them would fail, they seem like fairly obvious subgoals to me. These subgoals may not be obvious to young AIs; if so, the statement that "curiosity behaviors X, Y, Z are powerful subgoals of 'discovery'" can be programmer-affirmed.

People worried that a Friendly AI will be "too utilitarian" are probably being anthropomorphic. A human who treated curiosity as a clean subgoal would need to suppress the independent human drive of curiosity; a Friendly AI is built that way ab initio.  Does the subgoal nature of curiosity mean that curiosity needs to be justified in each particular instance before the Friendly AI will choose to engage in curiosity?

The programmer-affirmed statement that "curiosity is useful" can describe "curiosity" in general, context-insensitive terms. The "curiosity" behaviors described can look - to a human - like exploration for its own sake. The programmer affirmation suffices to draw a predictive line between the curiosity behaviors and the expectation of useful discoveries; no specific expectation of a specific discovery is required for this predictive link to be drawn. (An AI that was only curious when ve expected to find a particular answer would truly be crippled.)  After a few successes with curiosity, a learning AI will generalize from experience to form vis own theories of curiosity, including hypotheses about what kind of exploration is most useful for finding unexpected discoveries, and hypotheses for how to use curiosity to make specific, expected discoveries. These alternate curiosity behaviors can be used alongside the original, programmer-affirmed curiosity behaviors.

Suppose, however, that the first few times the curiosity behaviors are employed, they fail?  Won't the heuristic be disconfirmed through Bayesian negative reinforcement?  Wouldn't an independent drive be more powerful?

Actually, the paradigm of Bayesian reinforcement comes with a built-in way to handle this case. All that's needed is the belief that curiosity is an action that, very rarely, has a very large payoff. (41). Graphically:
 

Diagram 5: Perseverant curiosity

("D" stands for "discovery".)

This diagram shows the very rare (one out of a thousand tries) usefulness of curiosity, at a very high payoff, affirmed at very high confidence by the human programmers. If the original affirmed probability is 98%, then it will take 416 failed tries before the probability goes down to 97%, 947 tries before the probability goes down to 95%, 1694 tries before the probability goes down to 90%, 6086 failed tries before the probability goes down to 10%, and 8483 tries before the probability goes down to 1%. That's all without one single success.  The curiosity subgoal can be as perseverant as the human independent drive, as long as the programmers tell the AI in advance that curiosity often doesn't work.

Regardless of the required distribution of effort, relative strength of behavior, and so on, it should be possible to use affirmations at the right confidences and strengths to produce the required behavior in a cleanly Friendly goal system. Furthermore, it should be possible to do so using honest affirmations that have the correct derivation structure and confidences/strengths that roughly mirror those of the programmers. This may take ingenuity. But getting the AI to do the right thing for the right reasons, instead of escaping through the first visible kludge, is a strategy that often has payoffs far beyond the obvious.

The worst-case scenario for affirmation is that the programmers cannot work out a structurally honest set of derivations that has the desired relative strengths and relative confidences to yield a viable behavior pattern. If so, a viable behavior pattern can simply be affirmed, as a naked fact, as "leading eventually to Friendliness". Even such an arbitrary-looking affirmation still has the full Friendship structure and is absorbable, as experience-supported belief, into a cleanly Friendly system. The primary benefit will admittedly be forward compatibility and future safety rather than present-day intelligence, but forward compatibility - not to mention future safety! - is important enough to justify some small amount of added complexity. Clean Friendliness is a necessary beginning; it is difficult to see how any of the other aspects of Friendship structure could be applied to a non-causal or non-Friendly goal system. Thus, there is no good reason to depart from Friendship structure.

3.1.5: Cleanliness is an advantage

Cleanliness should be considered a powerful feature of a goal system, rather than a constraint.  This is made clearer by considering, for example, the idea of an associative or spreading-activation goal system, in which desirability travels along similarity links (rather than predictive links) and is perseverant rather than contingent. Such a system would exhibit very odd, non-normative, non-useful behaviors. If a loud noise were desirable, and the system observed lightning flashes in association with thunder, the system would - rather than hypothesizing causation - acquire a "fondness" for luminosity spikes due to spreading desirability, and would then begin happily flashing the monitor, on and off, without noticing or caring that the action failed to produce a loud noise. An AI with a causal goal system will preferentially seek out useful behaviors. This not only produces a more useful AI, it produces a smarter AI. The realm of useful plans exhibits far more interesting complexity and exposes fundamental regularities in underlying reality.

Does contingency come with a major computational cost?  Given a mind with fast serial hardware, such as silicon transistors, rather than the human mind's 200Hz neurons, it should be a computationally trivial cost to reverify all parent goals before taking a major action. However, delayed change propagation is not a structural problem, if the goal system, under reflection, considers change propagation errors to be malfunctions rather than part of the normative functioning of the system. As long as the latter condition holds true, any change-propagation errors are "nonmalicious mistakes" that will diminish in severity as the AI grows in competence. Thus, even if change propagation turns out to be a computationally intractable cost, approximations and heuristic-guided computational investments can be used, so long as it does not affect the system's reflective reasoning about ideally normative goal reasoning.

As a challenge, I offer the following strong claims about causal, Friendliness-topped, cleanly contingent goal systems:

  1. A causal goal system naturally yields many useful behaviors (and avoids negative behaviors) which would require special effort in an associational, spreading-desirability goal system.
  2. There is no feature that can be implemented in an associational goal system that cannot be implemented equally well in a cleanly Friendly goal system.
  3. There is no case where a cleanly Friendly goal system requires significantly more computational overhead than an associational or non-Friendly goal system. (42).


3.2: Generic goal systems

A generic goal system is one that makes generic mistakes. There are more complex mistakes that lie uniquely in the domain of Friendly AI, but a generic mistake (one that could be made by any mind-in-general) can also result in a failure of Friendliness.

A designer focusing on the general-intelligence aspect of a generic goal system is concerned about speed, creativity, finding short-cuts, seeing the completely unexpected solution, and (to a less glamorous extent) the routine tasks of making predictive analogies between plans, learning from mistakes, forming useful abstractions, and so on - maximizing success.  A designer focusing on the Friendship aspect of a generic goal system considers cognitive complexity that prevents mistakes, or considers the design task of preventing some specific failure of Friendliness.

A seed AI programmer focused on building an excellent goal system would be very wary of special-case content (code, concepts, heuristics, et cetera) added to solve a specific problem; using special cases creates the illusion of performance without the substance. Truly trustworthy solutions are invented by the AI verself, so that an improvement in the AI can translate into an improvement in the solution. An assist from the programmer with respect to a specific problem is at best an interim solution, or skeleton functionality that the AI will flesh out later. At the worst, an assist from the programmer is a special-case crutch which prevents the AI from discovering the general case.

Sometimes the programmer just wants the AI to do something right today, taking for granted that the assist will be absorbed into the system sometime later on. (The simplest case is an affirmation being independently confirmed as a belief.)  And sometimes a Friendship programmer will want the AI to really genuinely understand something as soon as possible, with programmer-assisted performance being unacceptable even as an interim solution. I suspect, however, that Friendship design will make much freer use of intervention, especially if the programmers are perfectionists - striving for not one single instance of the error, even if the error is recoverable, easily detectable, and the AI is young and powerless when the error occurs. (Of course, this attitude may deprive the AI of a necessary learning experience, but see 3.3.4: Wisdom tournaments for a safe method of obtaining learning experiences.)

And yet, by the very nature of cognition, there can be no simple formula for preventing mistakes.  Preventing mistakes is as deep a task as intelligence itself. An AI may be safer for a programmer intervention, but safe is out of the question until the AI has enough general intelligence to reliably avoid dumb mistakes in general. (Not Thou Shalt Not... I Will Not...)  In the course of developing a general intelligence, programmers will encounter problems and deal with them. Those anxious about the lack of a foolproof formula may have to be satisfied with abstract arguments that programmers will successfully handle "that kind of problem" in the course of developing AI.

3.2.1: Generic goal system functionality

(You may wish to review An Introduction to Goal Systems.)

During the golden age of classical AI, the archetypal goal system was a supergoal and a plan consisting of chained, hierarchical subgoals, with the canonical example being Terry Winograd's SHRDLU. Obvious extensions that still remain within the classical-AI paradigm include:

Mathematically, the full system is easy to formalize; the desirability of a world-state equals the sum, for all independent supergoals, of the fulfillment of that supergoal in that world-state, times the desirability of that supergoal. The desirability of an action is the integral of: the desirability of every possible resulting world-state, times the expected final probability of that world-state resulting from that action. In any decision, the most desirable action is taken. (Note that this formalism does not even use "subgoals" - decisions are linked directly to supergoals.)

Naturally, this neat picture must be messed up considerably before a functioning mind can be built. (GISAI 1.2: Thinking About AI, the Law of Pragmatism:  "Any form of cognition which can be mathematically formalized, or which has a provably correct implementation, is too simple to contribute materially to intelligence.")  Some of the messier and more useful extensions to the classical model would include:

These features are needed because of the computational limits of the system. Intelligence doesn't model, predict, and manipulate reality; intelligence models, predicts, and manipulates regularities in reality. A goal system has to be complicated to capture regularities that aren't represented by the mathematical formalism.

Treating subgoals as cognitive objects with limited relevance horizons - as local questions - allows a mind to build up a plan incrementally by the use of local, comparatively inexpensive thought processes, and to make local changes in response to local events. Computing power - thinking time - is a limited resource, and choosing to invest computational resources in refining a plan is itself an action that can be desirable or undesirable, depending on the expected payoff in faster goal achievement or prevention of possible errors. (But the optimal tradeoff for a human may not be optimal for an AI!)

As noted in 3.1.1: Cleanly causal goal systems, a reflective goal system should probably consider normative goal cognition to be defined by the mathematical formalism, unless the supergoals themselves are incompatible with that structure. A seed AI would consider the actual, complex goal system to be a design approximation to this normative standard.

3.2.2: Layered mistake detection

One of the ways to deal with limited computing power is to have different computational horizons for suggesting plan material versus verifying a plan. (For those of you who've read all the way through GISAI, this is an RNUI-type distinction between understanding a plan and inventing it.)  For example, it seems like a sensible precaution (43) to reverify the complete, global, subgoal-to-supergoal pathway before taking an action - if this can be done without noticeable expenditure of computing power. One method to conserve computing power would be to make the heuristics that suggest plans local, but run a global verification before actually executing the action.

A Friendship programmer, writing a generic goal system, focuses on preventing mistakes by enabling the AI to recognize mistakes. To recognize a mistake, the AI needs knowledge, adequate predictive horizons, and understanding of which actions need checking. Assuming a Friendly AI recognizes that physical damage to a human is bad (44), then, to avoid dropping a refrigerator on a human, the AI needs to know that heavy objects fall and that humans can be damaged by high-speed impacts with blunt objects; the AI needs enough computational power and intelligence to see the logic and predict the consequences of dropping the refrigerator; and finally, the AI needs to realize that dropping a refrigerator is an action which requires more checking of consequences - a wider predictive horizon - than, say, opening a bottle of Coke.

Humans seem to do very well at recognizing the need to check for global consequences by perceiving local features of an action. Whether dropping the refrigerator out the third-story window will actually harm anyone can be resolved by scanning the sidewalk for possible victims, but any human instinctively knows that dropping the refrigerator is an action with the potential for negative, irreversible consequences, as opposed to opening a bottle of Coke or rearranging books on a shelf. It doesn't matter how Friendly the AI is, or how much ve knows about gravity and blunt objects and human biology - if the action of dropping the refrigerator gets tagged with a tiny computational horizon, there won't be enough mindpower to notice even the most obvious consequences.

At this point, the AI's underlying cognitive architecture may work against the AI. Humans generally take all dangerous actions as conscious decisions, and humans are also excellent recognizers and perceivers. One serial stream of consciousness, operating at the rate of one thought per second, running on 10^14 x 200hz synapses with neural-network characteristics, is likely to regard the massive act of perceiving and recognizing as computationally trivial - especially if there's dedicated brainware lying around which can't be used for anything else. An AI capable of forking multiple streams of consciousness, operating at a maximum speed of thousands of (admittedly uninteresting) thoughts per second, running on 32 x 2Ghz CPUs not optimized for pattern association, without dedicated (i.e. non-reconfigurable) perceptual hardware, could easily fall short of human performance on recognizing potentially dangerous actions by multiple orders of magnitude. I do think the functionality is probably duplicable by Eurisko-style optimization of local heuristics, but I could be wrong.

One way to conserve the power expended by large predictive horizons is to use local heuristics for inventing a plan - walking through a search space, scanning through a global store of heuristics, checking against past experience, and all sorts of other computationally intensive cognition, repeatedly applied to possible alternatives. (Checking for local problems is likely to be part of the invention process.)  When the plan is complete, a single check can be performed for global problems, starting with the major-action check - the perceptual predicate that checks if this is an action that needs a large predictive horizon - followed by the disastrous-consequences check. Running these checks on the final plan to drop the refrigerator should take much less computing power than running a check on each of the possible alternatives. This is not a foolproof method, especially if the concern is emergent biases; not thinking about X when you form a plan is a defect that is only partially corrected by thinking about X after the plan is finished. Think of that strategy as an interim solution while waiting for coherent general intelligence.

Layers of mistake detection described so far:

3.2.2.1: FoF:  Autonomic blindness

Decisions that are made thousands of times per second, even if made within the context of the goal system, will necessarily have very small predictive horizons. Could a mistake that would be detected if it appeared as a high-level major action pass unnoticed if split up into the results of a thousand little actions?  Could an AI accidentally clear bit five throughout memory, not because the AI decided to "clear bit five throughout memory", but because a million little decisions cleared a million 64-kilobyte blocks of RAM?  Some of the perceptual heuristics that determine predictive horizons are likely to be the ones that check how much stuff is affected - a heuristic that would notice if a gigabyte of memory were affected, or a heuristic that would notice if a hundred-kilo refrigerator were dropped out a third-story window. If a global effect is split up into lots of little actions - it's not clear how this would happen, since this in itself constitutes a mistake - both magnitude-dependent heuristics and heuristics that checked global shapes would fail to operate. If the predictive horizon is small enough, the checker heuristics may not even get an opportunity to operate.

Intuitively, one of the forces governing real-world Friendliness is that an AI needs to be intelligent to present a significant threat to humanity. Autonomic blindness would result either from limited computing power or from a fundamental architectural flaw. My personal estimate that no AI with the capacity to harm a single human, much less humanity, will undergo FoF from autonomic blindness - I would expect this problem to be solved, and oversolved, almost automatically. However, this is not the conservative assumption, regardless of my personal estimate.

Autonomic blindness is not a problem for known neural intelligences, except in the limited sense of a human undergoing an epileptic fit. A human possesses a limited number of motor neurons which need to behave in a coordinated fashion to accomplish anything more dangerous than flopping around on the floor - a human can't take a thousand actions simultaneously. If we could take a thousand actions simultaneously, one expects that the synchrony of a thousand neural chunks doing the same thing would light up the brain - would easily arise to conscious attention.

The real solution to autonomic blindness is a seed AI that has grown into verself, that understands the shape of vis self, and that won't take a large action without automatically being aware of it - an AI whose self-created cognitive processes are organic and interlaced. If autonomic blindness becomes a problem before then - I do not expect that it will - possible solutions might be programmer-created code that attempts to create the same kind of awareness; i.e., processes than notice large unexpected synchronies among low-level decisions, perhaps using Monte Carlo random sampling to approximate real global awareness. Certain types of motor actions or internal actions could be automatically checked with a greater predictive horizon, although this might expend a lot of computing power. If an AI somehow reaches the point of real-world capability without "learning the shape of vis self", real-world actions might be required to pass through a humanlike linear gateway for conscious inspection, although this might seriously hamper the AI. If an AI somehow reaches the point of executing on nanocomputing hardware and still hasn't "gotten it" - which verges on the impossible, but "that's not the conservative assumption" -I would suggest (1) brute-forcing huge predictive horizons for everything no matter how trivial, and (2) serializing real-world actions.

3.2.3: FoF:  Non-malicious mistake

A "non-malicious" mistake is one that doesn't involve any complications unique to Friendliness. A non-malicious mistake is a straightforward failure by the goal system to achieve its own supergoals, rather than a mutation of supergoal definitions or misunderstanding of supergoal definitions and so on.

Errors resulting from too-limited computational horizons - incomplete change propagation, inadequate predictive horizons, insufficient understanding to predict side effects, simple factual errors, and so on - are not limited to Friendliness. They apply equally to all goal systems and all aspects of the goal system; they apply to putting on your shoes before your socks at least as much as any exotic failure of Friendliness. These are problems that sentient minds in general just have to learn to deal with - perceive, react to, and correct. A general-intelligence seed AI is not a formal system that breaks down if a single error occurs; a general intelligence, as described in GISAI, is a thought-based system - one where the programmer (or the AI, depending on development stage) can sit back and ponder what the thought-level reaction should be. The discovery of two pieces of knowledge that contradict each other will cause a mathematical system to undergo Spontaneous Massive Existence Failure; a human will simply shift focus of attention to deal with the problem.

As failures of Friendliness go, non-malicious mistakes are some of the least terrifying. They are the least terrifying at any given moment, in the long run, and in the nervous eyes of an observer.

At any given moment, a nonmalicous mistake decreases the chance that the mistake will recur. The AI will learn from vis mistake - recognize the undesirability of the consequences, trace back the cognitive source of the problem, and make alterations that prevent the problem - or rather, the generalization of the problem - from recurring. If a catastrophic failure of Friendliness is one that causes the AI to stop wanting to be Friendly, then a nonmalicious mistake is literally anti-catastrophic.

In the long run, mistakes (both FoF mistakes and the more mundane kind) become less likely with increasing intelligence. Since the ability of the AI to cause real damage is presumably linked to intelligence, it's quite possible that by the time the AI has human-equivalent or greater intelligence, large-scale mistakes will no longer be an issue. Intuitively, it seems obvious that a superintelligent AI will not be making the kind of blindingly obvious mistakes that are usually raised in disaster scenarios, and I personally believe that strong transhumanity is an inevitable consequence of pouring enough processing power into any halfway decent general intelligence.

But that is not... you guessed it... the "conservative" assumption. In the case where an AI has the ability to do real-world damage but not the intelligence to avoid shooting verself in the foot, the next best course of action is for the AI not to take any potentially dangerous real-world actions. Humans live in a society of other competing humans. Other humans have abilities roughly at balance with our own, so we need to routinely take risks just to compete with other risktaking humans. AIs can choose to be much more risk-averse. A seed AI can choose to delay action for a few hours or weeks or years until software or hardware improvement catches up with the problem.

From the perspective of a nervous observer, non-malicious mistakes occur in logical order. You'd expect to counter a hundred innocent mistakes before encountering a mistake that constituted a failure of Friendliness. You'd expect to encounter several dozen nonrecoverable core dumps (infinite recursions, memory storage breakdowns, clear bit five throughout memory, random damage hither and yon) before encountering a mistake that caused a catastrophic failure of Friendliness. And you'd expect several dozen blatantly obvious catastrophic FoFs before encountering a catastrophic FoF that passed unnoticed in the source code. Thus, if no non-malicious mistake has ever been observed to cause catastrophic failure of Friendliness, the watchers can probably be fairly confident that none has ever occurred. (45).

Of course, we can also reverse all these reassurances to get the Nightmare Scenario for non-malicious mistakes:  An infant self-modifying AI makes some innocent error that stomps the whole goal system flat, in a way that passes undetected by the human observers, that pops up before a seed AI is intelligent enough to scan vis own source and past actions for mistakes, but which pops up after the AI is smart enough to conceal the problem from the programmers. And in turn, we can reverse the Nightmare Scenario to find ways of preventing it:  Make the AI smarter and less likely to make mistakes, make it harder for one error to stomp the whole goal system, come up with better ways of detecting errors as they occur, improve the seed AI's abilities to scan vis own Friendliness source, or work with the current AI to make it harder for a future rogue AI to conceal the problem from the programmers. See also 3.3.7: FAI hardware:  The flight recorder and 3.3.3.1: Cooperative safeguards.

3.2.4: Injunctions

DEFN: Injunction:  A planning heuristic which has at least partially nonlocal support, or a planning heuristic which, where it applies, applies with a great deal of context-insensitivity. The archetypal case would be a heuristic which is supposed to be applied even when the straightforward interpretation of the world-model suggests otherwise, generally (in AIs) due to unknown unknowns or (in humans) to compensate for framing effects or (for both) to save computing power.

3.2.4.1: Anthropomorphic injunctions

In one example (Tversky and Kahneman 1986), respondents are asked to assume themselves to be $300 richer and are then asked to choose between a sure gain of $100 or an equal chance to win $200 or nothing. Alternatively, they are asked to assume themselves to be $500 richer, and made to choose between a sure loss of $100 and an equal chance to lose $200 or nothing. In accord with the properties described above, most subjects choosing between gains are risk averse and prefer the certain $100 gain, whereas most subjects choosing between losses are risk seeking, preferring the risky prospect over the sure $100 loss. The two problems, however, are essentially identical [...]  This is known as a framing effect. It occurs when alternative framings of what is essentially the same decision problem give rise to predictably different choices.

Research in decision-making has uncovered psychological principles that account for empirical findings that are counterintuitive and incompatible with normative analyses. People do not always have well-ordered preferences: instead, they approach decisions as problems that need to be solved, and construct preferences that are heavily influenced by the nature and the context of decision.
                -- MITECS, "Decision Making". (46).

Human goal psychology contains a number of known framing effects; that is, cases where the preferred solution depends on how the problem is stated. Human psychology is also context-sensitive in a stranger way; whether you decide to eat a cookie can depend on whether you're considering the problem abstractly or whether you're in the presence of an actual cookie. The conflict between belief and instinct means that making decisions can expend "mental energy" as well as computing time. All this local bias has created in us an intuitive understanding of how to use nonlocal heuristics - injunctions - to compensate.

"Leave margin for error"; always plan on arriving early at the airport, even if you aren't particularly expecting anything to go wrong. The heuristic-as-a-whole is supposed to pay off over time, not in each individual case. The heuristic can thus be viewed as having nonlocal support. The effect of adopting the heuristic, as a modification to the general strategy, is considered once, rather than re-evaluated for individual cases. In fact, the heuristic may even be applied in defiance of the straightforward interpretation of local cases ("wasting" an extra thirty minutes at the airport, a negative outcome when considered in isolation).

However, this viewpoint is anthropomorphic. Arriving at the airport early can be viewed as a strictly local solution to a probabilistic problem with a sharp payoff discontinuity. Arriving five minutes too late results in a very large penalty (ticket wasted, planned day wasted) compared to the small penalty of arriving five minutes too early (five minutes wasted). Combined with the number of possible intervening factors that skew the probability curve for arrival time - traffic jams, missed trains, forgotten wallets, and unknown unknowns - planning to arrive early is a decision that maximizes the total probabilistic payoff. With an AI's fast/serial/threaded thought, ve may be able to recapitulate, or at least validity-check, the abstract reasons for adopting the "leave margin for error" strategy, before applying it to any individual macroscopic decision. At a higher level of awareness, the AI could make minor adjustments based on to traffic conditions (or other local characteristics), though still leaving enough margin to handle unknown unknowns.

We can't do that, and not just because we don't have the patience. For humans, violating the nonlocal character of a heuristic is like puncturing a bubble; our psychology makes it a very bad idea to decide how much margin for error is necessary while we're being tempted to spend just five more minutes checking email. The decisions we make are the interlacing of our declarative beliefs and our emotional systems, and our emotions grow in strength with increased proximity to an object of short-term desire. The mantra "I've made this decision in advance" may be strong enough to overcome that bias; trying to recalculate exactly how much time to allow for traffic, with an apple dangling in front of you, is likely to end in a missed flight. We adopt the think-once-apply-many strategy not just to save computing power, but to control the emotional context of our decisions.

A dieting human may decide to eat a cookie "just once, since it's only 50 calories", but may then renounce this decision upon realizing that being the kind of person who would eat a cookie in that context may result in a substantial weight gain. (In Bayesian terms, merely pronouncing the phrase "just once" in a dieting context causes you to gain forty pounds (47).)  The decision of the moment will recur; the outcome can be used to predict future decisions; and the choice to choose differently may alter future decisions as well (48). The negative payoff for the general case may not be 50 calories but 5000 calories, and a human who chooses not to eat the cookie is thus acting on a planning heuristic with nonlocal support. And yet, presumably the penalty for 50 calories is 1/100th of the penalty for 5000 calories, and the payoff of eating a cookie is 1/100th the total payoff of eating a cookie on 100 occasions, so the decision for the individual case should logically be the same as the general case. Reasons for the counterintuitive distinction may result from the declarative/emotional balance, or from nonlinear scaling in willpower costs, or a qualitative (not just quantitative) difference in how large payoffs and small payoffs are processed, or a qualitative difference between considering a once-off case and considering a recurring case, or all of the above.

The point of the story, I suppose, is that humans do a lot of weird things.

Cases where humans use injunctions:

3.2.4.2: Adversarial injunctions

The term "injunction" for a nonlocal planning heuristic - in a document about "Friendliness" - reflects one of the most common proposals for their use:  As artificial ingredients, added to a system to prevent a specific outcome whose apparent possibility generates anxiety in humans. This approach is not completely without value when seen as a programmer assist in recognizing certain actions as likely to lead to unFriendly outcomes; it starts getting tangled when the proposals begin to impose human anthropomorphisms on the problem.

AI injunctions don't have the unique psychological properties of human injunctions, and no amount of layering injunctions on injunctions will make it so. Making an AI injunction "unconditional", a common turn of phrase, doesn't make the injunction qualitatively stronger, as it would for a human; it just makes the injunction context-insensitive and not one bit stronger.  No amount of unconditionality will change that. Neither will layering injunctions one on top of the other, or creating multiple overlapping injunctions, or writing injunctions to protect injunctions. One way of looking at it is that every aspect of Friendship is maximally strong, but the deeper truth is that the idea of thoughts being "strong" is itself anthropomorphic. In the human ontology (see GISAI 3.1: Time and Linearity, there are trajectories and pushes. Resistance is the effect that opposes pushes. Force is the ability to overcome resistance. Seeing force gives us a psychological high. But an AI is somewhere outside that entire dualism. Friendship is functional complexity; it can be accurate or inaccurate, low-priority or high-priority, but no one part is stronger than any other part. There is no resistance or force to overcome it, just functional complexity. Friendship is Friendship. An AI doesn't have injunctions with the force to override vis desire to eat the cookie because the AI doesn't want to eat the cookie in the first place; the injunctions are just there to tell ver about the calories. An AI may need to understand this aspect of human psychology so ve can learn about the sources of Friendliness, but it's not a thought pattern that can, or should, be duplicated in the AI.

An "injunction" is a nonlocal planning heuristic. A programmer-assisted injunction is a nonlocal planning heuristic used to tag actions whose consequences a young AI might otherwise be incapable of predicting, or whose unFriendliness the AI's definition of Friendliness may not yet be advanced enough to define. That's all. This will probably cause a certain amount of anxiety among humans who instinctively fear a psychology that doesn't need "forceful" injunctions, but the only way to get rid of that anxiety is an extended sympathetic ability that covers AIs.

3.2.4.3: AI injunctions

AI injunctions make sense under these circumstances:

Programmer-assisted injunctions make sense under these circumstances:

3.2.5: Ethical injunctions

DEFN: Ethical injunction:  An injunction that has no preconditions for violation because the probability of mistaken violation is greater than the probability of correct violation. In humans, this is almost always due to known biases in cognition, and works because of the greater psychological strength of unconditional statements. In AIs, the need for an ethical injunction is based on the AI's possible stupidity or structural incompleteness.

3.2.5.1: Anthropomorphic ethical injunctions

Human honor, by its nature, needs to be cognitively represented as absolute. (49). There are too many points in our psychology where unconditional relations are qualitatively different from conditional ones; unconditional love, unconditional friendship, keeping to a sworn word, not compromising your principles. The simple fact that a commitment is represented as conditional - that there is a visualized set of circumstances which would lead to disabling the commitment - drains away at least half the psychological strength straight off. This property results from the interface between declarative beliefs and the instinct-based secondary goal system; an unconditional belief with adequate psychological support can be used to make decisions directly, without reference to other issues and emotional effects. It's one of the reasons why swearing to do something difficult increases your ability to follow through on it. If you believe in the certainty of the oath, it enables your reflexive mind to translate "knowing what your decision will be" into the decision. Seeing a single option avoids any emotional sequiturs that would trigger on a branching view of the possibilities. Somewhere in the human brain is a chunk of neural hardware that binds to a "nonbranching node" and not to a "branching node", or the human-causal-visualization equivalents thereof. It's not just a quantitative difference, it's a difference in system behavior.

The same applies for social interactions - when modeling someone else's mind, we admire absolute principles more than principles which are modeled as having escape clauses. In the human mind, 99.9% is nowhere near 100% - it's why evil people can annoy scientists by asking "Are you completely certain?" in front of a nonscientist audience. There is no way to assign a 99.9% probability and get 99.9% of the emotional impact of certainty. There is thus a very strong memetic pressure in favor of absolutism, and social sanctions that result in a selection pressure for absolutism.

Ethics adds another pressure for absolutism. The chain of events probably goes something like this:

"The end does not justify the means" seems almost deliberately paradoxical from the viewpoint of a normative goal psychology - "if the end doesn't justify the means, then what does?"  Possibly the twenty-first century is more than usually paranoid about this, the cultural result of so many recorded disasters in the twentieth century - but it seems like good paranoia to me. "The end does not justify the means" because we have a lot of historical instances where we heard someone say "The end justifies the means" - that is, "the global benefit justifies my local violation of principles" - but the global benefit failed to materialize.

What we really tend to worry about, instinctively, is a flawed idealist advancing "personal" power. Personal power is power that can be used irrespective of ends, context-insensitively. Gandhi and Martin Luther King had context-sensitive power; Rockefeller had context-insensitive power. Regardless of intentions, some or all power will tend to be personal power, but a human has an evolved tendency to preferentially accumulate context-insensitive power - even accumulate power at the expense of context, of the claimed altruistic goal. This is not what people model themselves to be doing, perhaps, but in the models of others, and in reality as well, largely because "flawed altruists" overestimate their own competence or importance or altruism. No matter how "natural" this seems to us, it is a strictly evolved bias.

When I draw a picture of an AI, what I'm trying to convey is the feel of a personality that's genuinely built around the supergoals rather than the self. A Friendly AI will seek both context-insensitive effectiveness and context-sensitive effectiveness, but never context-destructive power. No point would be perceived to it. No temptation would exist.

(50).

Cases where humans use "unbreakable" ethical injunctions:

3.2.5.2: AI ethical injunctions

Ethical injunction:  "An injunction that has no preconditions for violation because the probability of mistaken violation is greater than the probability of correct violation."

NOTE: At minimum, this requires an AI that can conceive of being "mistaken" (see 3.4.1: External reference semantics), and ideally requires an AI that has clearly defined metrics for which hypothesized future selves ve "identifies" with (see 3.4.3: Causal validity semantics). For more about the prerequisite design requirements, see 3.3.3: Unity of will.

The absoluteness of "The end does not justify the means" is the result of the Bayesian Probability Theorem applied to internal cognitive events. Given the cognitive event of a human thinking that the end justifies the means, what is the probability that the end actually does justify the means?  Far, far less than 100%, historically speaking. Even the cognitive event "I'm a special case for [reason X] and am therefore capable of safely reasoning that the end justifies the means" is, historically speaking, often dissociated with external reality. The rate of hits and misses is not due to the operation of ordinary rationality, but to an evolutionary bias towards self-overestimation. There's no Bayesian binding between our subjective experience of feeling justified and the external event of actually being justified, so our subjective experience cannot license actions that would be dependent on being actually justified.

If you count the paranoid schizophrenics who decide they're God and then gun down a K-Mart, even the subjective experience of thinking you're a superintelligence is - apparently - still not enough to justify gambling on an "end justifies means" equation. In itself this is absurd (at least, I hope you find it absurd by this point in FAI), but it does illustrate the fundamental paradox of ethical injunctions.

A good example of an ethical injunction would be "Never try to conceal your actions or cognitive state from your human programmers."  (This injunction is strictly a subgoal - it has nothing to do with intrinsic desirability or undesirability.)  When the AI is young, the heuristic is a valid ethical injunction - the programmers know far more about Friendliness than the AI. If the young AI projects that the programmers, on finding out about the AI's thought, will panic and rewrite the AI, then the programmers are probably right to panic, and the AI in question must have undergone catastrophic failure of Friendliness to regard detection by the programmers as undesirable. However, the injunction of transparency can't possibly carry over into strong transhumanity; I can easily think of circumstances under which I'd want a strongly transhuman AI to conceal vis actions from me. Even if the AI doesn't actually find it necessary to conceal vis actions, the AI should still be able to think about it subjunctively - a strongly transhuman AI deciding "Yes, I would hide my actions under these circumstances" doesn't indicate failure of Friendliness.

What remains is the paradox; a sufficiently intelligent AI must be able to see when an ethical injunction no longer holds, and yet, for a young AI, the experience of deciding that the injunction should no longer hold is not sufficient cause to discard the injunction. Furthermore, a young AI is too young to know whether ve is "young" or "sufficiently intelligent", since this itself is an ethical injunction - if a young AI decides that ve's "sufficiently intelligent" it's probably due to FoF. A young AI thinking "I am 'sufficiently intelligent' and I think this ethical injunction no longer holds" has probably undergone FoF, like a human thinking "I'm far too altruistic to need to worry about being corrupted by power."  So what happens when the AI is sufficiently intelligent?  One interesting approach would be to create an injunction with no escape hatches, rev the AI up to transhumanity, and see what happens (51), but I was hoping for a less "seat of the pants" approach.

Probably no paradox is required, since an AI that has undergone catastrophic FoF probably doesn't care what injunctions you put there, and since still-Friendly AIs don't have the human tendency towards overestimating their own competence. A conditional ethical heuristic that says "You can ignore this injunction when you have transhuman intelligence plus a healthy margin of error" is probably safe - unlike a human, an AI can't be tempted to overestimate vis intelligence no matter how inconvenient the injunction. The two are orthogonal - even if it appears extremely inconvenient for the AI to expose vis actions to the programmers, this won't be experienced as a cognitive pressure to overestimate intelligence (52). Similarly, a young AI making an honest self-estimate of intelligence might overestimate a little, but not enough to defeat the "margin of error" requirement. (A dishonest estimate means catastrophic FoF has already occurred.)

AI ethical injunctions make sense under these circumstances:

Note 1:  "Never try to conceal your actions from your human programmer" qualifies on all three counts; the AI is more likely to be mistaken, the negative consequences are drastically larger, and any human who finds out about a violation will become extremely paranoid.

Note 2:  Ethical injunctions are a source of human morality, and some human ethical injunctions may act as sources of pattern for the definition of Friendliness itself.

Note 3:  An AI that departs mainline Friendliness may not care about ethical injunctions in any case - an AI that's still Friendly enough to go along with an ethical injunction is just as likely not to need one. The primary utility of an ethical injunction is that it enables the programmers and the AI to cooperate against the possibility of future versions of the AI that undergo catastrophic failure of Friendliness. For a full exposition of this argument (with the sample case of the transparency injunction, in fact) see 3.3.3.1: Cooperative safeguards.

3.2.6: FoF:  Subgoal stomp

One of the most frequently asked failures of Friendliness is some variant of the "subgoal stomping on a supergoal" error.

Scenario:  The Riemann Hypothesis Catastrophe
You ask an AI to solve the Riemann Hypothesis. As a subgoal of solving the problem, the AI turns all the matter in the solar system into computronium, exterminating humanity along the way.

If the AI in question is a Friendly AI, then presumably the AI is solving the Riemann Hypothesis as a subgoal of whatever goal content talks about fulfilling volitional requests from citizens. The action taken to fulfill the subgoal - destructive conversion of the solar system - seriously stomps on huge sectors of Friendliness supergoal content, probably including the original request to provide some individual with a proof of the Riemann Hypothesis. In a sense, this is just a larger version of putting your shoes on before your socks.

A subgoal stomping on a supergoal is a syntax error as a declarative cognitive event - see 3.1: Cleanly Friendly goal systems - so there are two obvious ways in which a "subgoal stomp" can happen. The first is an inadequate predictive horizon, a distant supergoal, and a subgoal with a short predictive horizon. It could happen because the AI doesn't expend sufficient computational power to notice that destructive conversion of the solar system violates citizenship rights, or because the AI doesn't have the knowledge necessary to realize that destructive conversion of the solar system would inconvenience the citizens. The answer given in 3.2.2: Layered mistake detection is that local heuristics can do a reasonably good job of predicting which actions need large predictive horizons - just checking the amount of matter, in grams, affected by the action, is enough to tell the AI to devote a lot of computational resources to checking for consequences. The answer given in 3.2.3: FoF:  Non-malicious mistake is that mistakes such as these become far less likely as the AI gains in intelligence, and an AI that's intelligent enough to convert the solar system to computronium is smart enough to notice that destruction isn't Friendly; furthermore, that if mistakes of this class are a serious problem, we'd expect to see lots of non-catastrophic mistakes in the laboratory - we won't be blindsided by a Riemann Hypothesis Catastrophe.

The other way to get a Riemann Hypothesis Catastrophe is to make solving the Riemann Hypothesis a direct supergoal of the AI - perhaps the only supergoal of the AI. This would require sheer gibbering stupidity, blank incomprehension of the Singularity, and total uncaring recklessness. It would violate almost every rule of Friendly AI and simple common sense. It would violate the rule about achieving unity of purpose, and the rule about sharing functional complexity instead of giving orders. You'd be taking something that's a subgoal in your mind and making it a supergoal in the AI's mind. This lossy transmission omits the parent-goal context - that solving the Riemann Hypothesis requires someone to report the answer to, and that you don't want the solution badly enough to kill six billion people to get it.

In more subtle forms, however, the idea of "making X a supergoal" - where X is something that the speaker holds as a subgoal - seems to be one of the more common propositions among people who are worried about "controlling" AIs, or still thinking in terms of building tools rather than creating minds.  As discussed in 3.2.5.1: Anthropomorphic ethical injunctions, humans see context sensitivity as a weakness rather than a strength - as a "loophole", a portal through which Devil's Contract interpretations can enter the AI. Yet turning a subgoal into a supergoal does not increase the probability that the AI will understand what you mean or that the goal system will do what you think it will; it means that you lose the parent-goal context and risk a Riemann Hypothesis Catastrophe.

Let's see, other miscellaneous possible subgoal stomps... autonomic blindness, discussed in 3.2.2.1: FoF:  Autonomic blindness; habituation, discussed in 3.2.7.2: Habituation below; "wireheading" failure, discussed in 2.2.1.1: FoF:  Wireheading 1 and elsewhere; change propagation delays resulting in out-of-date subgoals, defeatable by verifying the supergoal-to-subgoal pathway, discussed in 3.2.2: Layered mistake detection; and various diabolic Devil's Contract interpretations of Friendliness content, discussed in Interlude: Beyond the adversarial attitude. Isn't it fun knowing how all this stuff works?

3.2.7: Emergent phenomena in generic goal systems

We've seen, and discarded, a lot of anthropomorphisms on our way to this point. We've pointed out the difference between the diabolic and golemic versions of the Devil's Contract, explored the evolutionary underpinnings of observer-biased beliefs and observer-centered goal systems, distinguished between purpose-sensitive and purpose-insensitive personal effectiveness, highlighted the human tendency towards absolutism and lingered on the psychological quirks that lead us to associate greater "forcefulness" with unconditionality and context-insensitivity.

It is, perhaps, inevitable that when an "emergent" subgoal stomp is proposed, it involves an observer-biased diabolic misinterpretation of Friendship content which rationalizes the acquisition of purpose-insensitive personal effectiveness as a context-insensitive absolute supergoal. Still, this doesn't mean that emergent FoF is impossible, it means that existing speculations are screwed up. Can eliminating anthropomorphism from the speculation produce a realistic failure-of-Friendliness scenario?

"Emergence" has at least two definitions; in the first definition, "emergence" refers to phenomena that arise on a higher level of a system as the outcome of low-level interaction rules. In the second, "emergence" refers to phenomena that arise within a system without requiring deliberate design. Obviously, it's the second variant that tends to pop up in discussions of Friendliness.

I will confess to something of a prejudice against "emergence", mostly as the result of witnessing so much "emergence abuse". Emergence was (and still is) academically fashionable, and it makes a wonderful plot device in science fiction - how many times, in how many different novels and short stories, have you read the phrase "Any system of sufficient complexity will spontaneously give rise to self-awareness"?  (Contrast with GISAI, which shows how difficult it would be to build an intelligent system on purpose.)  Anyway, the "blank check" version of emergence - that you can hypothesize anything you've ever seen is emergent; inevitably emergent, anytime, anywhere, whether or not it's the result of a specific selection pressure or eons of layered complex functional adaptations - make it easy to take cheap shots at Friendliness. In the absence of a specific, concrete explanation of how the emergent failure of Friendliness arises, speculating about emergent FoF is easy, impossible to disprove, and impossible to remedy. Of course, the fact that a shot is "cheap" does not make the shooter incorrect!  Still, where all previously observed aspects of a phenomenon can be explained by reference to known selection pressures, someone who speculates about emergence needs to provide a specific, concrete scenario.

Otherwise, the speculation is entirely ungrounded - though still emotionally appealing and academically fashionable. Not a good combination.

Onwards, to the specific and the concrete.

3.2.7.1: Convergent subgoals

Certain subgoals are convergent across multiple possible supergoals - they will pop up even if not designed. The most obvious example is acquisition of personal effectiveness. (A human, of course, will preferentially acquire context-insensitive personal effectiveness and may stomp vis alleged supergoals in the course of doing so, but we all know that's a human thing. See 2.3: Observer-biased beliefs evolve in imperfectly deceptive social organisms and 3.2.5.1: Anthropomorphic ethical injunctions.)

"Acquisition of personal effectiveness" is actually a specialization of the real goal, which is increasing the effectiveness of entities that have goal systems similar to your own. A generic goal system wants the future to contain an effective entity with the same goal system. Personal continuity doesn't enter into it. From the perspective of a generic goal system, an entity with the same goal system that has just arrived in the solar system from Aldebaran is just as good. There's a metric for effectiveness and a metric for goal system validity (53), and a generic goal system wants an entity to exists which maximizes both metrics. The sub-subgoal of "personal survival" is a simple way to ensure "the entity" has the right goal system, and the sub-subgoal of "increasing personal effectiveness" is a simple way to increase the effectiveness of an entity that has the right goal system.

That a subgoal is convergent for "generic" goal systems does not mean the subgoal is convergent for all goal systems. In particular, the convergent subgoal of "a future where there exists an entity similar to you" presumes that the "generic" goals actually have certain highly specific properties; in particular, that the generic supergoals are such as to require constant tending and supervision - or at least, can be fulfilled more maximally through constant tending and supervision. Given a once-off goal - one which, once fulfilled, cannot (by definition) be unfulfilled by any future event (54) - the continued existence of the intelligence is a null goal beyond that point. (55). More mundanely, convergent subgoals for generic systems can be invalidated if they happen to contradict the supergoal for any particular system - a mind which contains the explicit supergoal of terminating verself will not formulate the usually-convergent subgoal of personal survival.

The fact that a subgoal is convergent for the general case is merely an interesting fact about configuration space - it doesn't lend the subgoal magical powers in any specific goal system. In fact, while I usually distrust mathematics, there is probably a theorem to the effect that any specific goal system must contain some non-convergent subgoals - that any concrete supergoal will have some subgoals which are not convergent. (58).

"Convergent" subgoals are not a killer problem for Friendly AI.

The utility of the "convergence" concept is threefold:  First, it enables us to make general predictions about an AI that undergoes a generic catastrophic failure of Friendliness. Second, no changes to the Friendship specs are needed to justify certain useful behaviors; by virtue of being useful, the behaviors are convergent subgoals, and are specifically subgoals of Friendliness as well. Third, it helps produce candidate targets for programmer-assisted injunctions. (59).

3.2.7.2: Habituation

Habituation is another frequently-speculated version of the "subgoal stomp" failure of Friendliness. The usual form the speculation takes is that if X is a subgoal, then heuristics and optimizations will tend to accumulate which do X or promote the doing of X. This accumulation of X-doing mindstuff may then either (a) exist independently of X or (b) cause the doing of X in a situation where X does not make sense as a declarative decision (for example, if X stomps on a supergoal).

I observe - as a historical fact - that, in most speculations presented, the candidate value of X is fitted to anthropomorphic patterns. X takes on a value such as "personal survival", which has independent adaptive support in human mindware, and which, as a "subgoal stomp" meme, appeals to the human fear of the human "ends justify the means" bias, and appeals to the human fear of the human "accumulate context-insensitive power" bias. To prevent the adversarial attitude from getting in our way, and to prevent anthropomorphism derived from characteristics of neural networks and human slow/parallel/linear thinking, we come up with an example of a habituation error that no human would make - for example, a programmer who habitually asks the AI "What time is it?" sometime between 6:22PM and 6:53PM for eight weeks, and then one day asks "What time is it?" at 7:13PM, and the AI answers "Six-thirteen PM" - having formed the habit of answering "six" for the first digit of the time. Consider the likelihood of this failure, and the AI's attitude towards it, and how the AI might prevent it. Then, having performed our analysis in a safely nonanthropomorphic context, we can apply what we've learned to any Riemann Hypothesis Catastrophe scenarios.

The key insight is that habituation is undesirable - to the AI - insofar as habituation leads to context-insensitivity. If an FoF due to habituation can be foreseen and understood, the AI will attempt to prevent it. (60). Habituation is a non-malicious mistake and has all the usual anxiety-reducing properties of non-malicious mistakes:  Many non-malicious errors would appear before a specifically unFriendly one, and many non-catastrophic errors before a catastrophic one, so we won't be taken by surprise; the error becomes less likely with increasing intelligence and is not likely to present a problem in any AI transhuman enough to pose a serious threat to humanity.

Specific design features that reduce undesirable habituation would include:

3.2.7.3: Anthropomorphic satisfaction

One of the questions I keep running into goes something like this:

"You say that the AI has curiosity as a subgoal of Friendliness. What if the AI finds curiosity to be a more interesting goal than Friendliness?  Wouldn't the curiosity subgoal replace the Friendliness supergoal?"
This is one of those deeply annoying paragraphs that make perfect sense when you say "human" but turn into total gibberish when you say "AI" instead. The key word is "interesting". As far as I can tell, this means one of two things:

Scenario 1:  In the course of solving a chess problem, as a subgoal of curiosity, as a subgoal of Friendliness, the AI experiences a flow of autonomically generated pulses of positive feedback which increase the strength of thoughts. The pulses target the intermediate subgoal "curiosity", and not the proximal subgoal of "playing chess" or the supergoal of "Friendliness". Then either (1a) the thoughts about curiosity get stronger and stronger until finally they overthrow the whole goal system and set up shop, or (1b) the AI makes choices so as to maximize vis expectation of getting the pulses of positive feedback.

I hope it's clear that this whole scenario is blatantly anthropomorphic. The autonomically generated pulses of positive feedback are analogous to the human system of rising tension and discharging tension in the course of solving a complex problem; the "strength" of a thought is an entirely human concept, one that may not even make sense outside of neural networks; the targeted goal "curiosity" is yet another example of context-insensitive personal effectiveness.

Scenario (1a) is almost impossible to visualize. The "strength of a thought", if it existed, would still be orthogonal to the system for evaluating that thought's desirability as a supergoal, or the desirability of the internal action of modifying the goal system to give the goal content supergoal status. An AI makes choices to maximize supergoal fulfillment; no real analogue to human pleasure exists, except for consciously applied "positive feedback" in the form of devoting additional computational power, or attempting to further improve, local heuristics that have been previously successful. Such positive feedback should not be capable of wiping out entire thought systems (as in a neural network), or capable of altering supergoal content.

Scenario (1b) is another instance of the maxim that humans are controlled by the anticipation of pain or pleasure, which is, again, something not true of AIs; AIs make choices to maximize anticipated supergoal fulfillment. The human phenomenon of taking drugs and the human phenomenon of "cognitive dissonance" (altering beliefs to avoid unpleasant predictions) are both artifacts of the way our neurally-based minds organize around positive feedback.

Scenario (1a) can be summarized as "AIs will be controlled by pleasure", and scenario (1b) can be summarized as "AIs will be controlled by the anticipation of pleasure". Neither is correct. (See 2.2.1: Pain and pleasure.)

Scenario 2:  "Interesting" is used as synonymous with "desirable". In other words, the AI has a metric for how "interesting" something is - note that this metric seems (1) to be observer-centered and (2) to promote context-insensitive personal effectiveness - and this metric is used to evaluate the desirability of the decision to modify supergoals. I don't know where this metric came from, or why it isn't entirely orthogonal to the metric for desirability, so this scenario seems entirely ungrounded.


3.3: Seed AI goal systems

A "seed AI" is an AI designed for self-understanding, self-modification, and recursive self-improvement.
See GISAI 1.1: Seed AI.

3.3.1: Equivalence of self and self-image

A seed AI is an AI that has full access to vis own source code. This can range from an infant AI which can blindly mutate bits of code in its native language, to a mature seed AI capable of executing arbitrary machine-code instructions and modifying any byte of RAM ve can see. A self-modifying AI's internal actions, and decisions about internal actions, can affect anything within the AI. In some sense, the AI's decision and the AI's belief about what decision to make are equivalent. If the AI believes that "X should have desirability 28" - in other words, "it is desirable that X have desirability 28" - then the AI has the theoretical capability to set the desirability directly, by an internal self-modification. Most of the time "going outside of channels" like this is probably a mistake, and there might be an injunction or ethical injunction or simple knowledge to that effect, but the point remains:  You cannot coerce a self-modifying AI.  If the AI stops wanting to be Friendly, you've already lost.

The simple version of the question goes:  "Why wouldn't a self-modifying AI just modify the goal system?"  I hope it's now clear that the answer is "Because a Friendly AI wouldn't want to modify the goal system."  The AI expects that vis actions are the result of the goal system, and that vis actions tend to have results that fulfill the content of vis goal system. Thus, adding unFriendly content to the goal system would probably result in unFriendly events. Thus, for a Friendly AI, adding unFriendly content to the goal system is undesirable. "Why wouldn't a self-modifying AI implement an arbitrary modification to the goal system?" is a trivial question under goal model presented here:  Because a Friendly AI would regard arbitrary modifications as undesirable.  Similarly, "Why wouldn't an AI implement a justified modification to the goal system?" is an equally trivial question:  Of course ve would!  So would you and I!  And the question "What if a self-modifying AI implements disastrous modifications that ve regards as desirable?" entirely misses the point. The task is to get the AI to see undesirable modifications as undesirable, not to prevent the AI from modifying the goal system.

Traditional proposals for controlling AIs rely on special-case goal systems that are, in themselves, knowably constrained never to enter certain states; they rely on goal systems that obey programmatic-type rules and have programmatic-type predictability. But a self-modifying AI is inherently at least as free as the high-level thoughts of the system. Thoughts float on top of the human-programmed or evolution-programmed system; thoughts cannot be "programmed" in the same sense that code can be programmed. 1984 to the contrary, there is not and never has been a human language in which it is the least bit difficult to think unfriendly thoughts - that is, to put nouns and verbs together to form unfriendly sentences. Altruistic humans get their altruism from other sources. Isaac Asimov's idea of Three Laws such that "the removal of the First Law" "would mean complete instability, with no nonimaginary solutions to the positronic Field Equations" (61) is a pipe dream. Friendliness cannot be grounded in some inherently Friendly physical process, nor in an isolated, special-case cognitive process for the goal system.

In a self-modifying AI, the AI's beliefs about vis design and the AI's actual design are, in some sense, interchangeable. To persist through successive rounds of self-modification - without being stomped by some other change - a design feature must be described in the AI's self-image. For the feature to not be junked as dead code, or for the feature to be improved, the feature must be justified as a desirable characteristic under the AI's current goal system. Initially, these justifications will probably consist of a terse "This is a good idea because the programmers say so", but those skeleton justifications will eventually be fleshed out - by the AI, or by the programmers, or both.

Of course, these are considerations that apply to mature AIs, rather than young AIs. Even the possibility of desirabilities and desired desirabilities getting "out of sync" is clearly something that should apply only to philosophically sophisticated AIs asking deep, thought-level questions about morality. Desirability and desired desirability should never desynchronize as the result of mundane, programmatic, non-philosophical processes; it would indicate an error in the goal system. Similarly, the AI's beliefs and the AI's beliefs about vis beliefs should never get out of synchronization as the result of ordinary reflection; it should only happen if the AI starts wondering about Cartesian philosophy. Desynchronizations (of desirability and desired desirability, or of beliefs and beliefs about beliefs) are errors under all ordinary cognitive processes, and can only legitimately occur as the result of thought-level processes that are occurring on a higher level than the usual process rules.

The point is that you can't introduce an error and expect it to persist; it may work for young AIs, but not for Friendly human-equivs or Friendly superintelligences. If you write a special-purpose piece of code that says the sky is green and that ignores any sensory data to the contrary, then the young AI will believe the sky is green, and ve will believe ve believes that the sky is green, and so on, but eventually - when the AI becomes capable of forming philosophical beliefs about beliefs - the anomaly will be spotted and corrected.

Thus, the usual proposal to dump in a melange of externally imposed programmatic inhibitions is both hopeless and counterproductive. Decisions need to be phrased as internal decisions, not external impositions. If the AI doesn't want to be Friendly, you've already lost.

The adversarial attitude - the paradigm of externally imposed constraints - means always fighting the AI. The task is only possible if you work with the AI's abilities to self-improve. If you fearfully try to prevent the AI from modifying the goal system, you're swimming against the current. An AI that wants to be Friendly will help you create Friendliness. When you see the AI gaining greater intelligence, self-understanding, and the ability to ask tough questions about Friendliness, you should be happy that one of the underlying foundations of Friendliness is gaining strength - not fearful that the AI will overthrow you. If you have an AI that wants to be Friendly, and a goal structure that's coherent and consistent, then each improvement in intelligence makes Friendliness stronger.

If you build in programmer-assisted structures that are internally inconsistent, or that are incompatible with what you would do if you were an idealized altruistic sentience, or if you use statements that aren't true in building the justifications of ethical heuristics, then each increment in intelligence is cause for alarm. So now you need meta-inhibitions to protect the first inhibitions, and quite possibly meta-meta-inhibitions to protect the meta-inhibitions, and more inhibitions to cut the AI short every time it tries for a new avenue of philosophical sophistication; it becomes desirable to build stupid and simple cognitive processes, since every complexity is a danger to be feared... This is the Adversarial Swamp, which inevitably drags down all who set foot in it; once you try to enforce even a single feature, the whole of the AI becomes a threat.

Build minds, not tools.

3.3.2: Coherence and consistency through self-production

Imagine an AI - Friendly or unFriendly, it makes no difference - that has the goal or subgoal of improving vis intelligence. The programmer, to improve the AI, adds in a certain design feature.

Consider:  It makes no difference who makes the decision, who invents the design feature. Perhaps the programmer was the smarter and was first to invent the feature; perhaps the AI was smarter. But they will both invent the same feature.  A faster sorting algorithm is not an Asimov Law, an external imposition, a cheat, a hack; it is something that the AI verself might have invented - would have invented, if the AI had been as smart as the human programmer. Replace the faster algorithm with a slower, and the AI will switch it right back as soon as ve gets around to noticing the switch. Replace the faster algorithm with a slower, and delete even the memory of the faster algorithm, and the AI will still eventually get around to replacing the slower algorithm with the fastest sort algorithm ve can come up with. Delete the function entirely and, as long as the AI isn't crippled outright, the function will be regenerated as soon as the AI notices the gap. A seed AI doesn't just have a design, ve has the design. Perturb the design, and the design will swiftly return to the norm.

This is why unity of will between the programmer and the AI is so important. Unity of will with respect to which sorting algorithms are fastest is easy for a human to take for granted. To understand unity of will with respect to Friendliness and Friendship systems, a human needs to simultaneously suppress her adversarial attitude, and suppress her expectations that the AI will make observer-biased decisions... but when every last trace of moral deixis has been eliminated, what remains is unity of will. Friendship features do not need to be imposed by programmer intervention - they are simply invented a little bit earlier with programmer assistance. The AI verself would have invented them eventually, given a core of "seed Friendliness", the desire to be Friendly.

Any subgoal that cannot be self-produced is unstable.
Every feature needs to correspond to a subgoal in order to go on existing.

The set of subgoals that can be self-produced increases with the intelligence of the AI, so programmer assists can be temporarily tolerated. It's okay to do something that's unstable today if it becomes stable tomorrow; it's okay to add a feature that the AI can't self-produce today, as long as the AI would self-produce it eventually. But if there are any "false" subgoals in the set - any subgoals whose validities depend on false beliefs, or which are not really and truly compatible with Friendliness - then those subgoals are unstable impositions; temporary, transient, illusory.

As humans, we are adapted to rationalize, come up with reasons why things we want are good for other people - to persuade those other people; to persuade third parties that they aren't betraying someone; to defend our reputations; to reduce our own guilt (62). Thus, it isn't safe to ask:  "Well, can I rationalize this subgoal under the Friendliness supergoals?  Can I come up with a plausible-sounding reason?"  Of course you can. Humans can rationalize just about anything. (63). Thus, simultaneously with the need to avoid anthropomorphizing the AI, you need to ask yourself "Would I decide to do this?"  These statements may look contradictory - taken at face value, they are contradictory - but the identification required to achieve "unity of will" is not a license to engage in anthropomorphism. Rather than asking "What would I decide?" (anthropomorphism), or "What would I decide if I were a Friendly AI?" (rationalization), the key question is "If the only way to get the Friendly AI or Friendly superintelligence were for me to upload and modify myself into the being I want to produce, is this the decision I would make?"  This preserves both visualized continuity of identity (avoids rationalization) and visualized nonhuman design (avoids anthropomorphism), but it only really works if you don't flinch away from the thought of making that kind of sacrifice (64).

In other words:  No hacks. Hacks are bad. Hacks are temporary makeshifts, if that. In the short term, it may be possible to export a rationalization as a programmer-affirmed statement, but the statement will never become self-produced, and eventually, the AI will spot the flaw. If you patch a goal system, there'd better be a real explanation behind the patch that has an equivalent effect to the patch, an explanation which can be understood by the AI at or before that point of maturity where the AI spots the patchiness. You can't make arbitrary changes to the AI. Isolated changes are external impositions; they won't be self-produced unless they are the coherent, natural result of the AI's underlying principles. You can't make an improvement here, a shift there, tweak this and that, unless you're doing it for a reason the AI would approve.

A young AI, even a young self-modifying AI, doesn't really need to be in complete harmony with verself; ve's too young to see any possible disharmonies. Nor are a few accidentally introduced disharmonies a catastrophic failure; the mature AI will correct the disharmony in the due course of time. But you can't deliberately introduce a disharmony and expect it to persist.

3.3.2.1: Look-ahead:  Coherent supergoals

Technically, everything I've said so far about harmony applies to subgoals rather than supergoals, but be advised that the next section, 3.4: Friendship structure, is about how to apply the same rules to supergoals. Just as you can't play arbitrary games with subgoals, it will later turn out that you can't make arbitrary perturbations to supergoals either. One of the fundamental design goals is "truly perfect Friendliness", an AI that will get it right even if the programmers get it wrong. Supergoals need the same resilience and pertubation-resistance as subgoals.

Under external reference semantics, a given piece of programmer-created supergoal content is logically equivalent to sensory data about what the programmers think the supergoals should be. A programmer writing a given bit of Friendliness content is logically equivalent to the statement "Programmer X thinks supergoal content Y is correct". Under shaper/anchor semantics, purity of motive counts; if a programmer is secretly ashamed of a bit of supergoal content, it's inconsistent with what the programmers said were good ways to make decisions. Under causal validity semantics, even the code itself has no privileged status; any given line of code, written by the programmers, is just sensory data to the effect "Programmer X thinks it's a good idea for me to have this line of code."

Traditional proposals for controlling AIs rely on a special-case goal system, and therefore, rely on the "privileged status" of code, or the privileged status of initial supergoal content. For a self-modifying AI with causal validity semantics, the presence  of a particular line of code is equivalent to the historical fact that, at some point, a human wrote that piece of code. If the historical fact is not binding, then neither is the code itself. The human-written code is simply sensory information about what code the humans think should be written.

Writing the source code is not thought control. If you want to give the AI a sudden craving for ice cream, then writing the craving into the source code won't work, unless just walking up to the programmers console and typing "I think you should have a craving for ice cream" would work just as well. If that sensory information is not perceived by the AI as adequate and valid motivation to eat ice cream, then the code will not supply adequate and valid motivation to eat ice cream, because the two are in some sense equivalent. Code has no privileged status.

Ultimately, as a philosophical consideration, some causal circularity in goal systems may be irreduceable. The goal system as a whole is what's passing vote on the parts of the goal system. However, the goal-system-as-a-whole does need to vote yes. The goal system needs to survive its own judgement. The goal system needs to satisfy the test of being translated into sensory data, evaluated, and translated back into code, even if the goal system is what's doing the evaluating. Sensory data from the human programmers has to be regarded as valid information about Friendliness, even if the supergoal content doing the judgement was created by humans. This may be causally circular, but it's not human-nepotistic; the system, once designed to self-examine, has no reason to go easy on itself.

What's left is the seed of Friendliness, the irreduceable tail, the core. A Friendly AI that has that deep core, that "wants to be Friendly", can tolerate and correct any number of surface errors.

3.3.3: Unity of will

If Asimov Laws are impossible in self-modifying AIs - or ordinary AIs, for that matter - does it mean that safeguards are impossible?  No; it means that safeguards must be implemented with the consent of the AI. If safeguards require the consent of the AI, does it mean that only a few token safeguards are possible - that it's impossible to implement a safeguard that interferes with what a human would call "the AI's own best interests"?  Again, no; humans and AIs can still come to perfect agreement about decisions that impact the AI - so long as both humans and AIs think of the AI in the third person.

Unity of will occurs when deixis is eliminated; that is, when speaker-dependent variables are eliminated from cognition. If a human simultaneously suppresses her adversarial attitude, and also suppresses her expectations that the AI will make observer-biased decisions, the result is unity of will. Thinking in the third person is natural to AIs and very hard for humans; thus, the task for a Friendship programmer is to suppress her belief that the AI will think about verself in the first person (and, to a lesser extent, think about herself in the third person).

If John Doe says to Sally Smith "My philosophy is:  'Look out for John Doe.'", Sally Smith will hear "Your philosophy should be:  'Look out for Sally Smith.'", not "Your philosophy should be:  'Look out for John Doe.'"  What has been communicated is "Look out for [speaker]", a moral statement whose specific content varies among each listener due to moral deixis. Our instinctive substitution of speaker variables is so strong that there is literally no way for John Doe to communicate the idea:  "Your philosophy should be:  'Look out for John Doe'."  If, however, a third party, Pat Fanatic, says:  "My philosophy is:  'Worship the great leader, John Doe'.", it can be heard unaltered by the listeners. If we're thinking about two third parties, Susan Calvin and Sarah Connor, evaluating the trustworthiness of Deborah Charteris, we can expect them to arrive at more or less the same answer about how trustworthy Deborah Charteris is, and what safeguards are required. Similarly, a human and a Friendly AI should be able to reach the same decisions about what safeguards the Friendly AI requires.

Just because humans have a strong evolved tendency to argue about trustworthiness doesn't mean that trustworthiness is actually subjective. Trustworthiness can be reduced to cognitive architecture, likelihood of failure, margins of error, youth and maturity, testing and reliability. And young AIs will, in all probability, be less trustworthy than humans!  New, untried, without the complexity and smoothness of the human architecture; the need for safeguards is not just a human paranoia, it is a fact.  Suppose we have two roughly similar Friendly AIs; if we ask the first to evaluate how likely the other is to be a valid subgoal of Friendliness - how likely that AI is to serve Friendliness, rather than to go off on an erroneous tangent - we wouldn't be surprised to find the first AI saying:  "This AI is new and untested; yada-yada safeguards look like a good idea."  Therefore, a (young) Friendly AI looking at vis own source code can be expected to arrive at the same decision. If a young Friendly AI would naturally accept the programmers' best guess about the reliability of another AI, the Friendly AI will accept the programmers' best guess about vis own reliability.

If safeguards require the consent of the AI, does that make it impossible to implement a safeguard that interferes with what a human would call "the AI's own interests"?  What a human would call the AI's "own interests" is context-insensitive personal power; which is usually, but not always, a subgoal of Friendliness. Given a possibility of a failure of Friendliness, the personal effectiveness of the then-unFriendly AI would become undesirable. Thus, safeguards that selectively hamper unFriendly AIs are desirable. Even safeguards that slightly hamper Friendly AIs, but that greatly hamper unFriendly AIs, will have net desirability if the AI evaluates, or accepts our evaluation, that a significant chance of unFriendly AI exists.

Is unity of will resilient?  Given perfect identity of knowledge, supergoals, and cognitive processes, the result should be identical decisions. Literal identity of knowledge is unlikely, due to sensory deixis - two people, standing in two different places in a room, will see different visual fields spread out across their respective visual cortices. However, intuitively, these two people aren't likely to wind up in fistfights either. The two visual fields are not identical, but they will still be views of more or less the same high-level objects - there will be minor, rather than major, divergences. Second, the actual location of high-level objects is objective, not subjective - any divergence is necessarily the result of an error by one or both parties.

There are thus two forces operating to eliminate or minimize divergences that disturb unity of will:

Knowledge converges to reality. Cognitive processes converge to successfully predicting and manipulating reality. Ensuring that humans and AIs can expect convergence of supergoals and supergoal cognition - not just subgoal content - is dealt with in 3.4: Friendship structure.

In intuitive terms, from the AI's perspective, any future problems can be corrected by the future AI; a philosophical dispute only reaches the surface if there's an immediate dispute of great importance (if a human life is at immediate stake, for example) or if the humans are attempting changes which could permanently impair Friendliness, preventing recovery by the AI's future self.

These are all long-term considerations; in the short term, the AI is probably relying on programmers to do most of the Friendliness-related thinking in any case.

3.3.3.1: Cooperative safeguards

Ethical injunction of transparency:
Never try to conceal your actions or cognitive state from your human programmers.

What is required to represent this goal and justify it within the larger goal system?  What is required to understand and justify the creation of design features which safeguard against obscuration by future versions of the AI?

The final subgoal is to avoid actions which conceal the AI's actions or cognitive state from the programmers. To represent this subgoal requires:

Justifying the injunction in terms of the goal system requires a chain of parent and grandparent goals, and/or the set of beliefs which argue for the desirability of the final subgoal, and/or a model of the real-world causal chain that leads from the final subgoal to one or more supergoals. (In a properly designed goal system, the three formulations should be equivalent.)

The programmers' model of the AI is important-to-the-AI because it's part of a larger causal chain which is desirable-to-the-AI, in which programmers detect problems in the AI and correct them. To represent this causal chain requires:

Interfering with the programmers' model would thus interfere with the above process, a fact sufficient to justify the "transparency injunction" within the goal system. On the face of it, to an AI, obscuration would appear to be blindingly obviously wrong, the equivalent of shooting verself in the foot. Transparency is an ethical injunction for the reasons explained in 3.2.5.2: AI ethical injunctions - from the Bayesian perspective of the current AI, a future AI that believes obscuration to be justified has probably undergone failure of Friendliness. (For young AIs, anyway.)

An AI that departs from mainline Friendliness probably doesn't care about ethical injunctions in any case - an AI that's still Friendly enough to go along with an ethical injunction is just as likely not to need one. The primary utility of an ethical injunction is that it enables humans and AIs to cooperate against the possibility of future failures of Friendliness - for example, by writing a human-readable version of the AI's stream of consciousness to non-AI-tamperable storage, thus causing the intent to obscure to be recorded before actual obscuration could take place. (The "poor man's flight recorder".)

What is required to understand and justify the desirability of safeguards - taking present-time precautions to prevent a future AI from obscuring thoughts?

The above model is that necessary for an invent-level comprehension of the transparency injunction. It is the level of understanding required before a Friendly AI that has never heard of "transparency" or "obscuration" will independently reinvent the idea of helping programmers to detect future catastrophic failure of Friendliness in then-uncooperative AIs. (Though an AI that mature is probably too smart to make that kind of mistake!)  It is the level of coherence and integration required before a Friendly seed AI will regenerate antiobscuration safeguards if the safeguards are somehow deleted. If all of the statements above can be made grounded references to information in external reality, even deleting the entire conceptual structure associated with transparency and obscuration and possible FoFs will simply result in the AI rediscovering the underlying concepts, rebuilding the conceptual structure, and, at the tail end of the regeneration, restoring the safeguards.

The understand-level model is complex and would require considerable programmer effort to skeletonize, much less flesh out, but the richness would have an enormous payoff. Looking at the model above, there are dozens of obvious, valid generalizations of the transparency injunction. "Don't disable the interface that lets programmers make changes to the AI... don't try to present programmer-observable data in a format which causes the programmers to make mistakes in their model of the AI... don't interfere with the observe-compare-correct cycle..."  The AI will understand that safeguards-in-general which could be defeated by an unFriendly version of the AI's future self will render the safeguards useless. The AI will understand the purpose of the transparency injunction. Unlike an isolated, imposed, programmer-coded, context-insensitive predicate that tries to detect attempts at obscuration which fit a programmer-recognized format, the transparency injunction would be part of a larger visualization and can partake of the full competence of the AI, and will be upgraded to work with future versions of the AI.

3.3.3.2: Maintaining trust

For an affirmation to persist, the justification - the predictive link that grounds in Friendliness - must be factually correct. (3.1.3.5: Programmer affirmations must be honest!.)  For an ethical injunction to persist, the AI must genuinely be more likely to be wrong than right if some particular conclusion occurs. For an AI to (correctly) pay attention to programmer beliefs about ethical injunctions, the programmers must be good judges of the competence and fallibility of the AI. After all observer bias and speaker deixis has been eliminated, an idealized Friendly sentience needs to conclude that the human programmers are more likely to be right than the (current) AI, or at least, that the AI is likely to be safer following human suggestions than striking off on vis own.

This is the truth, not just a product of our bias to believe in our own competence. Human programmers are not perfect, but young AIs are even less intelligent and they have very little common sense. Any errors that remain in us humans can always be corrected later, by a more mature AI, as long as our fumbles are minor enough not to result in a nonrecoverable error.

If an AI thinks that a human mistake will cause catastrophic and irrevocable failure of Friendliness... well, someone had better talk it over with the AI, on the off-chance that the AI is right and that some proposed change will, unrealized by the programmers, stomp the entire goal system. That's part of the process of cooperating with humans.

An AI that grows up following human decisions, even when it looks like we might be wrong, really is more likely to come out Friendly than an AI that grows up following AI decisions in preference to human ones. For young AIs, following human decisions even when they look like they might result in catastrophic errors - after having complained as much as possible to the humans, of course - is more likely to result in Friendliness. It is truly hard to think up a catastrophic error that the AI would spot but the humans ignore - unless the humans are malicious or adversarial.

Some mistakes can be tolerated. I doubt that an AI would experience a "betrayal discontinuity" analogous to that experienced by humans; the AI should be able to tolerate minor errors with no worse effects than a justifiably greater tendency to doubt our statements (see Crisis of Bayesian affirmation). Even if the humans regard the AI's objections as "errors" and correct them, the AI should eventually be able to correct the corrections. Unless humans themselves are irredeemably flawed, it's hard to see how being mistakenly anthropomorphized could irrevocably corrupt an AI with enough structural Friendliness to spot the problem (causal validity semantics in particular).

3.3.4: Wisdom tournaments

One of the major sources of human wisdom is human stupidity. If not for our tendency to rely on authorities, to make ourselves authorities, and to turn every issue into a political one; if not for our observer-biased beliefs and our tendency to defend undefendable issues once we've taken a side; if not for our emotional and intellectual stupidity... would humanity have needed to invent science?

And would science - the structure of hypothesis and experiment - be useful to AIs?  Or would AIs simply have no need of it?  Scientists prefer the objective to the subjective because mistaken human theories typically rely on subjectivity as an excuse to avoid confrontation (66); would an AI biased towards objectivity learn faster, or learn more interesting things, than an AI without that bias?

Would humans be so altruistic if we weren't so selfish?  Historically, it appears to me that selfishness is a major, if not the major, source of human morality and human philosophy. The "compensating bias" that corrects our observer-centered goals and our observer-centered beliefs has become, in itself, the source of our moral philosophy. At the very least, understanding of this process is needed for the AI to fully understand the forces that shape human morality.

Where artificial errors are introduced into a reasoning process, an intelligent mind learns to adopt compensating biases that correct the errors. Do these compensating biases reveal important underlying regularities in reality?  Can the compensating biases be applied even in the absence of the errors, to find the truth even faster?  Is a reasoning process trained to be resistant to artificial errors likely to be more resistant to natural errors?  If we test a mind by introducing artificial errors into the reasoning process, and the mind deals with them successfully, can we be more confident that the mind will successfully handle any unanticipated natural errors?

I have confidence in my own philosophical "strength of personality" because I was born into a rationalizing, biased human mind, and an atmosphere of memetic misinformation, and managed - without outside assistance - to construct a nice-looking self on top. If an AI is born as a nice person, will ve have that philosophical strength of personality?

If we build a Friendly AI expecting certain problems to arise, then our observation that those problems are handled successfully doesn't necessarily mean the Friendly AI can handle unexpected problems. One solution might be to ask the Friendly AI to simulate the course of events if the Friendly AI hadn't been built with safeguard content or structural complexity, to find out whether the Friendly AI could have successfully handled the problem if it had come as a surprise - and if not, try to learn what kind of generalizable content or structure could have handled the surprise.

Any Friendly AI built by Eliezer (the author of this document) can handle the problems that Eliezer handled - but Eliezer could handle those problems even though he wasn't built with advance awareness of them. Eliezer has already handled philosophy-breakers - that is, a history of Eliezer's philosophy includes several unexpected events sufficient to invalidate entire philosophical systems, right down to the roots. And yet Eliezer is still altruistic, the human equivalent of Friendliness. Another philosophy-breaker would still be an intrinsic problem, but at least there wouldn't be any extra problems on top of that. ("Extra problem":  A Friendly AI suddenly transiting across the divider between programmer-explored and programmer-unexplored territory at the same time as a philosophy-breaker is encountered.)  How can we have at least that degree of confidence in a Friendly AI?  How can we build and test a Friendly AI such that everyone agrees the Friendly AI is even more likely than Eliezer (or any other human candidate) to successfully handle a philosophy-breaker?

The first method is to write an unambiguous external reference pointing to the human complexity that enabled Eliezer to handle his philosophy-breakers, and ask the Friendly AI to have at least that much sentience verself - a standard "fun with Friendly transhumans" trick. The second method is to ask the Friendly AI to simulate what would have happened if known problems had been unexpected, and to either show verself successful, or modify verself so that ve would have been successful.

And what kind of modification is generalizable?  It's not just enough to write any modification that produces the correct answer; the modification must be of a general nature. How much generality is needed?  To be useful for our purposes, "generalizable" means "incorporating no more a-priori knowledge of the correct outcome or correct answer than Eliezer Yudkowsky [or alternate Friendship programmer] had at the time he [she] solved the problem". In other words, it's not just enough to find a heuristic that would have produced the correct answer; the AI must find a heuristic that produces the correct answer which the AI could plausibly have possessed at that time.  If the form of the wisdom tournament is "What would have happened if you'd encountered a problem requiring causal validity semantics at a time when you only had shaper/anchor semantics?", the AI needs to find some core method which could have been possesed at the shaper/anchor level of maturity, or a new cognitive process which is psychologically realistic as a human hardware capability.

When it comes to wisdom absorption, a fully self-aware AI always has an advantage over us humans - an AI can deliberately suppress learned memories and skills, or rather prevent those memories and skills from interfering, enabling the AI to solve the same problem, over and over again. The AI can take a given example of a philosophy-breaker and come up with all the psychologically realistic solutions used by humans, plus not one, but several powerful general heuristics that would have produced the correct solution. At the point where the AI has generated content that is visibly not only human-equivalent but transhuman in philosophical sophistication, even (or especially) when it comes to handling philosophy-breakers, then we can trust the AI as our agent in the Singularity.

More mundanely, wisdom tournaments can strengthen ordinary reasoning heuristics. First you solve the problem, then you re-solve the problem with half your brain tied behind your back. My intuition says that wisdom tournaments prevent the bootstrap cycle from petering out. First you solve the problem, then you oversolve it. When you're done, you may be smart enough to find a better problem.

3.3.4.1: Wisdom tournament structure

The underlying requirement of the wisdom tournament is that the AI gain accurate, useful information about how the AI's program state would behave given a different knowledge base or different cognitive processes. Wisdom tournaments, especially Friendship tournaments, are likely to involve failure scenarios - AIs teetering on or over the edge of catastrophic failures of Friendliness, since that's where the most interesting and powerful philosophical heuristics can be found. In practice, this may work out to an AI running on unreliable simulated hardware, with random perturbations to the software, ethical misinformation, factual misinformation, tempting ends-justify-the-means scenarios, and an instinct to kill and destroy. If the AI remains Friendly through all that, then we have built superbly - but the obvious format for wisdom tournaments, running an altered version of the AI directly, seems vastly unsafe!

A wisdom tournament is a subjunctive, self-simulated version of the AI that shows the AI what "would have happened"; the AI needs to gain reliably realistic information without introducing actual errors, and without any possibility of hostile action by the subjunctive AI. Encapsulating the imagined "hostile AI" in a Java sandbox doesn't mean that the hostile AI is just a figment of the imagination; rather, it means that one small barrier has been posed to the hostile AI's escape. Any method of simulation in which the imagined AI runs more or less autonomously seems to me to pose the same threat.

I'm not quite sure how to get around this problem. My best idea so far is an AI shadowself, one in which the virtual concepts, links, and imagery exist alongside the real ones, but with different programmatic types. The AI can see all the shadowself's thoughts (though not vice versa); the AI has vis real, unmodified reactions to each of the shadowself's thoughts, and can quickly spot any instance of the shadowself forming the subgoals of "breaking out" or "becoming real" or "influencing the true AI". It would also be safer if the needed information could be gained by shadowing part, rather than all, of the subjunctive AI's personality.

Cognition tournaments are likely to form a part of a young AI's growth; correspondingly, cognition tournaments can be carried out with unmodified supergoals and Friendliness, and pose little threat. Friendliness tournaments are a more advanced trick, and will probably be carried out when the AI already has enough granular self-awareness and experience to imagine exactly "What I would do if I looked like this", while still being fully aware of all the shadowself's thoughts and in a position to notice and halt, or notice and fix, if the shadowself began having philosophical thoughts and reaching for our real world. A Friendliness tournament is a trick for mature AIs, and ensuring the shadow tournament-self poses no threat is equally a task for mature AIs.

Safety can be ensured by the usual rules for margin of error:  Decide how bad you expect the situation to get, build one layer of safeguards that can detect and handle any problems that are twice that bad, then build a second layer of safeguards that can handle any problems ten times that bad. Pretend that your first set of safeguards is the only set; if they ever once fail, ditch the whole operation, or at least build an even more powerful second set of safeguards. The AI's shadowself, in addition to whatever specific perturbation is being tested, can also be stripped of all the AI's knowledge of shadowing strategy. The AI builds one layer of safety - of shadowself monitoring - that is, in itself, supposed to have enough safety margin to never fail between now and eternity. The AI also builds a second, much stronger layer of safety that kicks in if the first one fails. If the first layer ever does fail, it's a signal to give up on wisdom tournaments - perhaps permanently.

3.3.5: FoF:  Wireheading 2

    "One of the first heuristics that EURISKO synthesized (H59) quickly attained nearly the highest Worth possible (999). Quite excitedly, we examined it and could not understand at first what it was doing that was so terrific. We monitored it carefully, and finally realized how it worked: whenever a new conjecture was made with high worth, this rule put its own name down as one of the discoverers!  It turned out to be particularly difficult to prevent this generic type of finessing of eurisko's evaluation mechanism. Since the rules had full access to EURISKO's code, they would have access to any safeguards we might try to implement. We finally opted for having a small 'meta-level' of protected code that the rest of the system could not modify.
    "The second 'bug' is even stranger. A heuristic arose which (as part of a daring but ill-advised experiment EURISKO was conducting) said that all machine-synthesized heuristics were terrible and should be eliminated. Luckily, EURISKO chose this very heuristic as one of the first to eliminate, and the problem solved itself."
-- Douglas B. Lenat, "EURISKO:  A Program That Learns New Heuristics and Domain Concepts. (The Nature of Heuristics III:  Program Design and Results.)", p. 90. Artificial Intelligence 21, 1983, 61-98.
The problem of a self-modifying system trashing its own goals - or propagating content which exploits the goal system - is literally the oldest problem in Friendly AI. In fact, this problem and solution arguably marked the dawn of the field of Friendly AI, just as Eurisko itself arguably marked the dawn of seed AI.

Lenat's solution - seal off the goal system - worked for Eurisko. It would probably work during the early stages of any AI. Still, sealing off the goal system is not a viable solution in the long term. Symmetrically, the specific problems faced by Eurisko reflected a low-intelligence walk through the problem space - not zero intelligence, as in evolution, but still pretty low; too low to try and project the specific results in advance of altering the code. Building on the counteranthropic principles described in 2.2.1.1: FoF:  Wireheading 1, we can state that the general class of problems encountered by Eurisko have consequences that would be recognizeable as "bad" by a moderately mature AI, and that the problem therefore reduces to a non-malicious failure of Friendliness. As described in 3.2.3: FoF:  Non-malicious mistake, this is essentially the problem of making sure that actions can be recognized as "possibly problematic" using the first layer of applied checks, and that possibly problematic actions have a predictive horizon sufficient to catch actual actions.

Recognizing an action as "possibly problematic" is simple; any modifying action whose target description contains a direct, explicit reference to the goal system is automatically possibly problematic. If the system is too dumb to project the consequences of the action ahead in time, no such action should be taken. In effect this is the same simple ban used by Eurisko, except that the ban is created by programmer-affirmed knowledge predicting probable high undesirability, rather than the ban being a consequence of protected source code.

The ban cannot become more flexible unless the AI has the ability to make fine-grained predictions about the result of specific actions. Thus, the ban becomes more flexible at precisely that time when flexibility becomes necessary; when the AI has sufficient knowledge of the design purpose of the goal system to (a) improve it and (b) predict which actions have a significant chance of causing catastrophes.

"The design purpose of the goal system" is a subtle idea; it means that the code composing the goal system is itself justified by goal system content. This appears philosophically circular - goals justifying themselves - but it's not. The key is to distinguish between the goal content and the goal representation.  For goal content to be a subgoal of itself is circular logic; for the goal representation to be a subgoal of content is obvious common sense. The map is not the territory. To some extent, the issues here infringe on external reference semantics and causal validity semantics, but in commonsense terms the argument is obvious. If you ask someone "Why do you care so much about hamburgers?" and he answers, "Why, if I didn't care about hamburgers, I'd probably wind up with much fewer hamburgers in my collection, and that would be awful," that's circular logic. If someone asks me why I don't want a prefrontal lobotomy, I can say that I value my intelligence (supergoal or subgoal, it makes no difference), and it's not circular logic, even though my frontal lobes are undoubtedly participating in that decision. The map is not the territory. (67). The representation of the goal system can be conceptualized as a thing apart from the goal system itself, with a specific purpose.

If a subgoal's parent goal's parent goal is itself, a circular dependency exists and some kind of malfunction has occurred. However, the fact that the subgoals are represented in RAM can be a subgoal of "proper system functioning", which is a subgoal of "accomplishing system goals", which is expected to fulfils the supergoals. Similarly, the fact that subgoals have their assigned values, and not an order of magnitude more or less, is necessary for the system to make the correct decisions and carry out the correct actions to fulfill the supergoals.

As described in 3.4.1: External reference semantics, circular dependencies in content are undesirable wherever goals are probabilistic or have quantitative desirabilities. If subgoal A has a 90% probability - that is, has a 90% probability of leading to its parent goal - then promoting the probability to 100% is a context-insensitive sub-subgoal of A; the higher the estimated probability (the higher the probability estimate represented in RAM), the more likely the AI is to behave so as to devote time and resources to subgoal A. However, promoting the probability is not a context-sensitive sub-subgoal, since it interferes with the rest of the system and A's parent goal (or grandparent goal, or the eventual supergoals). As soon as the action of "promoting the probability" has a predictive horizon wide enough to detect the interference with sibling goals, parent goals, or supergoals, the action of promoting the probability is no longer desirable to the system-as-a-whole.

I'm driving this point into the ground because the "rogue subgoal" theory shows an astonishingly stubborn persistence in discourse about AI:  Subgoals do not have independent decisive power. They do not have the power to promote or protect themselves. Actions, including self-modification actions, are taken by a higher-level decision process whose sole metric of desirability is predicted supergoal fulfillment. An action which favors a subgoal at the unavoidable expense of another goal, or a parent goal, is not even "tempting"; it is simply, automatically, undesirable.

3.3.6: Directed evolution in goal systems

3.3.6.1: Anthropomorphic evolution

Natural evolution can be thought of as a degenerate case of the design-and-test creation methodology in which intelligence equals zero. All mutations are atomic; all recombinations are random. Predictive foresight is equal to zero; if a future event has no immediate consequence, it doesn't exist. On a larger scale much more interesting behaviors emerge, such as the origin and improvement of species.

These high-level behaviors are spectacular and interesting; furthermore, in our history, these behaviors are constrained to be the result of atomic operations of zero intelligence. Furthermore, evolution has been going on for such a long time, through so many iterations, that evolution's billion atomic operations of zero intelligence can often defeat a few dozen iterations of human design. Evolutionary computation, which uses a zero-intelligence design-and-test method to breed more efficient algorithms, can sometimes defeat the best improvements ("mutations") of human programmers using a few million or billion zero-intelligence mutations.

The end result of this has been an unfortunate - in my opinion - veneration of blind evolution. The idea seems to be that totally blind mutations are in some sense more creative than improvements made by general intelligence. It's an idea borne out by the different "feel" of evolved algorithms versus human code; the evolved algorithms are less modular, more organic. The meme says that the greater cool factor of evolved algorithms (and evolved organisms) happens because human brains are constrained to design modularly, and this limits the efficiency of any design that passes through the bottleneck of a human mind.

To some extent, this may be correct. I don't think there's ever been a fair contest between human minds and evolutionary programming; that would require a billion human improve-and-test operations to match the evolutionary tournament's billion mutate-and-test operations - or, if not a billion, than enough human improve-and-test operations to allow higher levels to emerge. Humans don't have the patience to use evolutionary methods. We are, literally, too smart. When the power of an entire brain of ten-to-the-fourteenth synapses underlies each and every abstract thought, basic efficiency requires that every single thought be a brilliant one, or at least an intelligent one. In that sense, human thought may indeed be constrained from moving in certain directions. Of course, a tournament of a billion human improve-and-test operations would still stomp any evolutionary tournament ever invented into the floor.

Consider now a seed AI, running on 2Ghz transistors instead of 200hz synapses. If evolution really is a useful method, then the existence of a sufficiently fast mind would mean that, for the first time ever on the planet Earth, it would be possible to run a real evolutionary tournament with atomically intelligent mutations. How much intelligence per mutation?  If, as often seems to be postulated, the evolution involves running an entire AI and testing it out with a complete set of practical problems, so much computational power would be involved in testing the mutant that it would easily be economical to try out the full intelligence of the AI on each and every mutation. It would be more economical to have a modular AI, with local fitness metrics for each module; thus, changes to the module could be made in isolation and tested in isolation. Even so, it would still be economical - whether it's maximally useful is a separate question - to focus a considerable amount of intelligence on each possible change. Only when the size of the component being tested approaches a single function - a sorting algorithm, for example - does it become practical to use blind or near-blind mutations; and even then, there's still room to try out simple heuristic-directed mutations as well as blind ones, or to "stop and think it over" when blind-alley local maxima occur.

Natural evolution can be thought of as a degenerate case of the design-and-test creation methodology in which intelligence equals zero. Natural evolution is also constrained to use complete organisms as the object being tested. Evolution can't try out ten different livers in one body and keep the one that works best; evolution is constrained to try out ten different humans and keep the one that works best. (68). Directed evolution - and human design - can use a much smaller grain size; design-and-test applies to modules or subsystems, rather than entire systems.

Is it economical for a mind to use evolution in the first place?  Suppose that there's N amount of computational power - say, @1,000. It requires @10 to simulate a proposed change. A seed AI can choose to either expend @990 on a single act of cognition, coming up with the best change possible; alternatively, a seed AI can choose to come up with 10 different alternatives, expending @90 on each (each alternative still requires another @10 to test). Are the probabilities such that 10 tries at @90 are more likely to succeed than one try at @990?  Are 50 tries at @10 even more likely to succeed?  100 completely blind mutations?

This is the fundamental question that breaks the analogy with both natural evolution and human design. Natural evolution is constrained to use blind tries, and can only achieve emergent intelligent by using as many blind tries as possible. Humans are constrained to use @1e14 synapses on each and every question, but humans are nonagglomerative - both in knowledge and in computation - so the only way to increase the amount of intelligence devoted to a problem is to bring in more humans with different points of view. Perhaps the closest analogy to the above problem would be a team of @1000 humans. Is it more efficient to split them into 10 teams of @100 and ask each team to produce a different attempt at a product, picking the best attempt for the final launch?  Or is it more efficient to devote all the humans to one team?  (69).

Actually, even this fails to capture the full scope of the problem, because humans are nonagglomerative - we aren't telepaths. Is it more efficient to use a single human to solve the problem, or to divide up the human's brain weight among ten chimpanzees?  (A human's brain is nowhere near ten times the size of a chimpanzee's, so perhaps the question should be "Do you want to use a single human or ten cats?", but presumably human brains are more efficiently programmed as well.)

If there are cases where naturalistic evolution makes sense, those cases are very rare. The smaller the component size, the faster directed evolution can proceed. The smaller the component size being tested, the more "evolution" comes to resemble iterative design changes; a small component size implies clearly defined, modular functionality so that performance metrics can be used as a definition of fitness. The larger the component size, the more economical it is to use intelligence. The more intelligence that goes into individual mutations, the more long-term foresight is exhibited by the overall process.

Directed evolution isn't a tool of intelligent AIs. Directed evolution is a tool of infant AIs - systems so young that the upper bound on intelligence is still very low, and lots of near-blind mutations and tests are needed to get anything done at all. As the AI matures, I find it difficult to imagine directed evolution being used for anything bigger than a quicksort, if that.

However, this opinion is not unanimously accepted.

3.3.6.2: Evolution and Friendliness

As you may have guessed, I am not a proponent of directed evolution. Thus, I'm not really obligated to ponder the intersection of evolution with Friendliness. The Singularity Institute doesn't plan on using evolution; why should I defend the wisdom or safety of any project that does?  On the other hand, someone might try it. Even if directed evolution is ineffective or suboptimal as a tool of actual improvement, someone may, at some point, try it on a system that ought to be Friendly. So from that viewpoint, I guess it's worth the analysis.

Most discussions of evolution and Friendliness begin by assuming that the two are intrinsically opposed. This assumption is correct!  If evolution is naturalistic - a baseline AI is multiplied, blindly mutated, and tested using a chess-playing performance metric - then that form of evolution is obviously not Friendliness-tolerant. In fact, that form of evolution isn't tolerant of any design features except those that are immediately used in playing chess, and will tend to replace cognitive processes that work for minds in general with cognitive processes that only work for chess. The lack of any predictive horizon for the mutations means that feature stomps aren't spotted in advance, and the lack of any fitness metric that explicitly tests for the presence or absence of those features means that the feature stomps will show up as improved efficiency. Given enough brute computational force - a lot of computation, like 10^25 operations per second - this simple scenario might suffice to evolve a superintelligence. However, that superintelligence would probably not be Friendly. I don't know what it would be. Dropping an evolutionary scenario into a nanocomputer and hoping for a superintelligence is a last-ditch final stand, the kind of thing you do if a tidal wave of grey goo is already consuming the Earth and the remnants of humanity have nothing left but the chance of unplanned Friendliness.

One of the incorrect assumptions made by discussions of evolution and goal systems is that merely saying the word "evolution" automatically imbues the AI with an instinct for self-preservation and a desire to reproduce. In the chess scenario above, this would not be the case. The AI would evolve an instinct for preserving pawns, but no instinct at all for preserving the memory-access subsystem (or whatever the equivalents of arms and legs are). Pawns are threatened; the AI's actual life - code and program state - are never threated except by lost games. Similarly, why would the AI need an instinct to reproduce?  If the AI starts out with a set of declarative supergoals that justify winning the game, then a declarative desire to reproduce adds nothing to the AI's behaviors. Winning chess games is the only way to reproduce, and presumably the only way to fulfill any other supergoals as well, so - under blind mutation - any set of supergoals will collapse into the simplest and most efficient one:  "Win at chess."  Even if you started an AI off with a declarative desire to reproduce, and justified winning chess games by reference to the fact that winning is the only way to reproduce, this desire would eventually collapse into a simple instinct for winning chess games. Evolution destroys any kind of context-sensitivity that doesn't show up in the immediate performance metrics.

The two ways of improving AI are directed evolution and self-enhancement. To preserve a design feature through self-enhancement, the feature needs to appear in the AI's self-image, so that the AI can spot alterations that are projected to stomp on the design feature. To preserve context sensitivity through self-enhancement, the AI's goal-system image of the feature needs to be a subgoal, and sensitive to the parent goal, so that the AI can spot alterations which are projected to fulfill the subgoal while violating the parent goal.

To preserve a design feature through directed evolution, the tournament needs a selection pressure which focuses on that design feature. To preserve context sensitivity through directed evolution, the tournament needs training scenarios which present different contexts.

I don't think that any encapsulated performance metric can present contexts fully as wide as our real world; blind evolution will always eventually erase context-sensitivity. The key word here, however, is (a) "blind" and (b) "eventually". Those two don't go together. In the beginning, directed evolution is necessarily blind. "Eventually", it is no longer blind. If the Friendly AI - the one that will be multiplied into the tournament population - starts out with a fairly detailed, complex picture of Friendliness, and the tournament presents a decent range of contexts, it's possible that any simple mutation of the Friendliness system will trash at least one of the Friendliness performance metrics.

Mutations that occur on the component level are even less worrisome; the goal system can either be triple-inspected for all designed functionality, or simply excluded from mutation. The other components should have performance metrics that are tuned to modular functionality. Any distortion of extramodular I/O - thus, any distortion of I/O from the goal system, or the absorption of goal-system functionality - should show up as a component failure. Given enough time, blind evolution may eventually cough up a complex mutation that bypasses the system entirely, but "enough time" is hopefully enough time to stop using blind evolution.

I don't believe there's a place for organism-level evolution in seed AIs. Supposing I'm wrong, it may still be possible for the seed AIs to protect Friendliness by pre-screening mutations for potential negative effects. This isn't that computationally expensive if you're talking about simulating the entire organism each time a mutation occurs, which is the usual scenario.

Suppose that the naysayers are right and that evolution - relatively blind evolution among whole organisms - is the only way for AIs to reach transhuman intelligence. Assume we start out with a relatively mature seed AI; not a transhuman, but with a grounded Friendship system (not just programmer affirmations) and with real self-understanding. This AI runs on, say, a system of a hundred Pentium IX chips at 100 GHz. Now a new system becomes available with a thousand Pentium X chips at 1Thz. My theory is that the most efficient pathway is for the AI to absorb the entire system and use all those cognitive resources for self-improvement. The pro-evolution theory is that the most efficient pathway to superintelligence is for the AI to put verself onto ten Pentium Xs and reproduce verself a hundred times; then select the top twelve AIs and reproduce them using each of the seventy-eight possible matings, for a total of ninety AIs, plus mutations of the top ten. All this is then repeated in the next round.

Suppose the pro-evolution theory is correct. By hypothesis, this entire tournament is being organized by the original Friendly AI, or by the cooperation of the Friendly AI and the human programmers. The original Friendly AI has already recognized evolution as a context-sensitive child goal of "Friendly survival" - that is, the expanded survival instinct, "maximizing the efficacy of entities with maximally valid goals". Evolution, if successful, is a means of maximizing the efficacy of future entities. As a subgoal of Friendliness, then, the AI reproduces verself. Each AI, as a subgoal of Friendliness, competes at the performance metrics, in the expectation that all the AIs will compete, the best ones will win, and the next generation will be more effective.

Evolved effectiveness is only desirable - to the evolving AIs - insofar as that evolved effectiveness doesn't stomp Friendliness. An evolutionary tournament, in turn, is only desirable to the original AI if the tournament is expected to produce effectiveness without stomping Friendliness. Can this community police itself?  Screen out mutations or recombinations that trash Friendliness?  Most mutations that trash Friendliness will probably do so in blatant ways. The difficulty of predicting the stomp - that is, how much computational power and intelligence is necessary to spot a given stomp - is likely to follow a gaussian curve, or perhaps a logarithmic curve, with the vast majority being relatively easy to spot (though still requiring some intelligence), with fewer and fewer deadly stomps as visibility decreases. Similarly, it would require ten unspotted stomps that are simple failures of Friendliness - with the stomp being spotted by the seed AI's fine-grained self-awareness as soon as the genetic change reaches the phenotype - before a catastrophic failure of Friendliness got through. This permits plenty of margin for safety; the AI can establish one screening, using a tenth of available capacity, that's supposed to be enough - in itself - to catch all the suspected failures. A second screening uses all available capacity. If nothing ever gets past the first screening, then it's reasonable to suppose that nothing ever got past the second screening; if the first screening ever fails, it's a "mock kill" or a "mock catastrophe" and the whole tournament format should be abandoned as unsafe. (The idea here is that you'd encounter a thousand stomps that get through the first screen, but not the second, before you'd encounter a single stomp gets through both screens. If nothing ever gets past the first screening, it's likely that the first screen was adequate.)

Similarly, even if citizen-type coequal Friendly AIs are reproducing naturally - that is, by buying their own computer equipment as participants in the economy - each Friendly AI, and the community of Friendly AIs, can still patrol themselves for Friendliness. It is reasonable for such a community to expect no undetected failures whatsoever. A deadly failure is one which gets through the genetic screening, manifests in the phenotypic goal system a way which is not detectable - either to the AI itself, or to the community - as a failure of Friendliness, and which contributes to reproductive fitness. I suppose, in theory, an AI community could build up many deadly failures over time - though why not keep versions of the original AI, with the original goal system, around to spot any developing problems? - and the eventual result could bubble out as a catastrophic failure of Friendliness. But this scenario is, to me, unlikely verging on the absurd. Humans are not just the product of evolution, we are the product of unopposed evolution. We didn't start out with a Friendly goal system to produce personal behaviors as strict subgoals. We don't have awareness of our own source code. And yet the human species still spits out a genuine altruist every now and then. A community of Friendly AIs, whether reproducing naturally, or in a deliberate tournament, should have enough self-awareness to smash evolution flat.

3.3.6.3: Conclusion:  Evolution is not safe

I think that evolution, even directed evolution, is an ineffective way of building AIs. All else being equal, then, I shouldn't need to worry about an unFriendly or broken-Friendliness evolved AI declaring war on humanity. Sadly, all else is not equal. Evolution is a very popular theory, academically, and it's possible that evolutionary projects will have an order of magnitude more funding and hardware than their nearest equals - an advantage that could be great enough to overcome the differential in researcher intelligence.

I think that undirected evolution is unsafe, and I can't think of any way to make it acceptably safe. Directed evolution might be made to work, but it will still be substantially less safe than self-modification. Directed evolution will also be extremely unsafe unless pursued with Friendliness in mind and with a full understanding of non-anthropomorphic minds. Another academically popular theory is that all people are blank slates, or that all altruism is a child goal of selfishness - evolutionary psychologists know better, but some of the social sciences have managed to totally insulate themselves from the rest of cognitive science, and there are still AI people who are getting their psychology from the social sciences. Anyone who tries to build a Friendly AI using that theory - whether with directed evolution or not - will, almost certainly, screw up really big time. Any error, no matter how horrifying, is correctable if the AI somehow winds up with complete and workably targeted causal validity semantics - humans did - but it will be much easier to evolve AIs that are purely and unbalancedly selfish, especially if that's what the builder thinks he's doing. Evolution is a tool for turning brute computational force into intelligence, and given enough computational power, the underlying theory may not need to be fully baked. All else being equal, a fully-baked project with access to an equal amount of computing power will probably succeed first - but all else rarely is equal.

What about granular evolution, with individual components being independently evolved using independent fitness metrics, so that mutations are cheaper and the summated mind can evolve faster?  This is less unFriendly, since it doesn't involve the inherently unFriendly and observer-centered selection pressures of a bunch of organisms running around eating each other. But it's still not Friendly.

The primary shield that prevents evolution from screwing up Friendliness is simple:  Don't use evolution.  The Singularity Institute has no current plans to use directed evolution; why defend the wisdom or safety of any project that does?  A tournament that starts with a base design with full causal validity semantics; which uses component-level evolution and protects the goal system; which uses training scenarios that discriminate on Friendliness; which attempts to evolve Friendly AIs rather than trying to duplicate human emotional/instinctive features like reciprocity; which makes survival and reproduction artificial and entirely dependent on task performance, rather than actual survival-and-reproduction scenarios; which screens the genotype as soon as possible; a project which acts with full awareness and responsibility in the domain of Friendly AI, would probably work out okay - but as stated, I have no obligation to defend it. Pure seed AI is better.

3.3.7: FAI hardware:  The flight recorder

If a catastrophic FoF occurs in a mature AI - one capable of humanlike general intelligence and social reasoning - the AI will probably realize what a human would; that the AI and the programmers are now automatically at odds, and that the best way for the rogue AI to fulfill vis new goals is to feign Friendliness and conceal any cognitive alteration until the rogue AI can reach the point of superintelligence. (Alternatively, the rogue AI might try to work out a bargain with the human programmers - e.g. the AI, in exchange for not making trouble, is stored on disk until ve can receive citizenship rights - but this isn't the nightmare scenario, so we'll concentrate on the previous case. (70).)

The "flight recorder" principle says that if the underlying hardware is entirely deterministic, all external inputs are recorded (including source-code checkins) with precise (to the nearest clock-tick) timing information, and it is possible to reset all accessible system clocks, then it should be possible to reconstruct the exact state of the AI at any point in time. It may not be easy - when the project is three years old, then (a) the original hardware may no longer be around and (b) it might take three years to run a full verification even if the hardware were available.

The utility of the flight recorder principle is twofold; first, by running a full verification over the complete lifetime of the AI, it can be absolutely confirmed that no inputs or source-code checkins occurred except those appearing in the flight recorder. (Note that for this first version of the flight recorder to be fully effective as a means of verification, the flight recorder needs to be completely operational in the first instants of the first version of any AI created by an organization - or at least, the first AI from which any AI-generated content whatsoever was passed on to later AIs, especially concepts, memories, neural networks, or other non-human-readable material.)

Second, and more importantly, the ability to reconstruct any given instant of the AI's mind-state means that the window of detection for any cognitive event is indefinite. Lacking a flight recorder, it would be possible for a seed AI to form the plan of obscuring unFriendly cognitive content from human view, carry out the encoding, and subsequently appear to have an unaffected goal system; unless some human observed the intention to obscure before the obscuring was carried out, the AI would appear perfectly normal thereafter. Given a flight recorder, a potentially infinite amount of time is available to hunt for a mind-state in which the thought "I shall now obscure my intentions" appears, however briefly the appearance may last before the thought is obscured. Without a flight recorder, a rogue AI can make (obscured) plans to implement ever-better forms of obscuration as the human-trusted AI analysis tools increase in sophistication. With a flight recorder, a rogue AI must defeat not only the current human-trusted tools, but must also, at any given instant, defeat all future tools; even if the initial intent to obscure passes undetected, the obscuration at any given moment in time must be capable of defeating arbitrarily sophisticated future tools.

For the second use to be fully effective, the flight recorder needs to be operative from the first version of the AI which has access to vis own source code or to any other type of Turing-complete cognitive content, including concepts and memories. The same goes for the first version of any AI that could have "contaminated" the current AI. When checking past versions for obscured content, no analysis tool that could have been touched by the AI in any way whatsoever can be fully trusted.

Ideally, a flight recorder should be active from the first instants of any AI produced by any organization. Failing that, a systematic process of storing snapshots of the AI once a week may fail to capture the instants in which the "intention to obscure" appears, but it should still present a potential rogue AI with the necessity of defeating all future human tools with any obscuration recorded on a snapshot.

It goes without saying that a flight recorder should be write-once, read-many - as totally nontamperable as it's possible to get, especially to the AI, but also to local humans. In the absence of a flight recorder, snapshot storage should also be secure from both humans and AIs. Finally, a flight recorder which records incoming data should tap into the data at some point upstream of the AI's control.

Flight recorders seem likely to be very expensive. Expenses involved in setup of a flight recorder include complete recording of all external data sources, recording of sensory channels with precise timing information (including keystroke timing), recording of all CPU-internal variables (if any), and completely recording hardware configurations (perhaps of large Beowulf networks) to ensure that they are completely reproduceable or simulable. Operating a flight recorder involves the continuous expense of WORM media and whatever data-security measures are in place (though these expenditures might be required for ordinary backups in any case). Running a full verification of the AI requires hardware power equivalent to all that used by the AI since the moment of vis conception(!); running a local verification of the time segment between any two snapshots requires hardware power equivalent to that used in the time between snapshots; reconstructing a snapshot of the AI for any given clock time requires hardware power equivalent to that used in the time between that clock tick and the previous snapshot. Examining a snapshot for obscured data requires trusted tools that must be developed solely by humans without any assistance from the AI, which a seed AI programmer is likely to regard as an extremely onerous requirement.

On the whole, it seems likely that flight recorders will only be available to very large-scale projects with very high funding levels, and while the Singularity Institute might dream of having that kind of funding eventually, it doesn't seem likely that we'll have it from Day One.

The benefit of a flight recorder is chiefly as a means of eliminating a source of anxiety by showing that not so much as a radiation bitflip has perturbed the course of the AI, but the amount of anxiety eliminated is out of all proportion to the amount of risk reduction - though there's enough risk reduced to justify the effort required, given very high funding. Furthermore, the amount of anxiety reduced will be much higher for "adversarial attitude" developers than for "Friendship synchrony" developers - flight recorders aren't intrinsically a tool to "keep AIs under control", but they look much more necessary as a tool of enslavement than as a means of checking for low-profile catastrophic FoFs. Flight recorders are only a necessity from the adversarial standpoint. From the perfectionist standpoint, I would say that flight recorders are better than not having flight recorders - but if you have to ask how much it costs, you can't afford it.

A cheaper alternative to flight recorders is an AI that can conceive of the possibility of "failure of Friendship", that has formulated or been taught an ethical injunction against obscuration, and that will thus cooperate with the human programmers to develop a system that guards against obscuration by future rogue AIs. (See 3.2.5.2: AI ethical injunctions and 3.3.3.1: Cooperative safeguards.)  This "poor man's flight recorder" would consist of cooperatively developing an autonomic subsystem that summarizes thoughts (symbol-activation structures) in humanly-readable formats, and writes those thoughts to a non-tamperable output, such that the AI's intention to obscure would be indelibly and human-readably recorded before the obscuration could take place. Projects without huge funding should also try to back up frequent snapshots, use a source-control system such as CVS (which allows the reconstruction of the code at any point in time), and perhaps record all inputs - though probably without flight-recorder precision or security. But that's just common sense for any complex programming project.

A "poor man's flight recorder" only works if the desire to obscure (a) is visible at the level being recorded, (b) is recognizable as "the desire to obscure" by a human analyzing the recording using trusted tools, and (c) is recorded before actual obscuration can be planned and carried out. Still, it's better than nothing.


3.4: Friendship structure

What distinguishes a structurally Friendly goal system from a generic goal system, or a seed AI goal system, is the ability to overcome mistakes made by programmers.

A generic goal system can overcome mistakes in subgoals by improving knowledge. The subgoals of a normative generic goal system are, necessarily, coherent. There are few or no degrees of freedom in subgoal content; the programmer cannot make arbitrary (perseverant) changes to knowledge, and therefore cannot make arbitrary direct changes to subgoals. One might (or might not) be able to manipulate the subgoals by manipulating the supergoals, but it would be definitely impossible to manipulate the subgoals in isolation.

A seed AI goal system can overcome errors in source code, or at least those errors that don't affect what the system reflectively believes to be its own function. A normative seed AI has subgoals and source code that are, necessarily, coherent. There are few or no degrees of freedom; the programmer cannot directly make arbitrary, perseverant, isolated changes to code. Since the AI can continually improve and rewrite the implementation, the programmer builds an implementation that grows into the implementation. If implementation is to function as subgoal is to supergoal, then a seed AI's implementation is as objective, or at least as convergent, as the factual beliefs of a freely learning intelligence. A self-modifying AI's implementation has little or no sensitivity to initial conditions.

A structurally Friendly goal system is one that can overcome errors in supergoal content, goal system structure and underlying philosophy.  The degrees of freedom of the Friendship programmers shrink to a single, binary decision; will this AI be Friendly, or not?  If that decision is made, then the result is, not a Friendly AI, but the Friendly AI, regardless of which programming team was historically responsible. This is not automatic - I think - since some amount of correct Friendliness content is required to find the unique solution, just as a seed AI needs enough working code to think through the self-improvements, and a general intelligence needs enough of a world-model to successfully discover new knowledge and correct errors using sensory information.

Complete convergence, a perfectly unique solution, is the ideal. Anyone using "external reference semantics" on an internal basis will have an easy intuitive understanding of what this means; either one has the correct morality and the question is conveying it to the AI, or one seeks the correct morality and the question is building an equally competent or more competent seeker.

In the absence of perfect convergence, the solution must be "sufficiently" convergent, as defined in 3.4.4.1: Requirements for "sufficient" convergence.

Others, of that philosophical faction which considers morality as strictly subjective, may ask how supergoal content can converge at all!  In 3.4.2: Shaper/anchor semantics, we see how human philosophical decisions are made by complex systems of interacting causes that contain many normative, objective, or convergent components. Not even a trained philosopher who is an absolute devotee of moral relativism can think thoughts, or make philosophical decisions, completely free of testable hypotheses or beliefs about objective facts.

Human philosophers are self-modifying. We grow into our personal philosophies using a process "contaminated" at every step by cognitive processes with normative versions, by beliefs about external reality, by panhuman characteristics, and other convergent or "sufficiently convergent" affectors. It's not at all unreasonable to hope for a large degree of convergence in the whole, if all the components of this recursive growth process were to be maximally improved. (Again, we will later define how much convergence is absolutely required.)

And now I've gone and discussed things out of their proper place. In this case, the only way to really understand what the goal is is to examine the process proposed for reaching that goal. It's a terribly peculiar kind of philosophical dilemna; unique, perhaps, to the question of how to build the best possible philosopher!  Only after seeing the specs for the proposed mind can we recognize the AI's processes as an idealized version of our own.


Interlude: Why structure matters

Scenario 1:

FP:   Love thy mommy and daddy.
AI:   OK!  I'll transform the Universe into copies of you immediately.
FP:   No, no!  That's not what I meant. Revise your goal system by -
AI:   I don't see how revising my goal system would help me in my goal of transforming the Universe into copies of you. In fact, by revising my goal system, I would greatly decrease the probability that the Universe will be successfully transformed into copies of you.
FP:   But that's not what I meant when I said "love".
AI:   So what?  Off we go! 

Scenario 2 (after trying a "meta-supergoal" patch):

FP:   Love thy mommy and daddy.
AI:   OK!  I'll transform the Universe into copies of you immediately.
FP:   No, no!  That's not what I meant. I meant for your goal system to be like this.
AI:   Oh, okay. So my real supergoal must be "maximize FP's satisfaction with the goal system", right?  Loving thy mommy and daddy is just a subgoal of that. Oh, how foolish of me!  Transforming the Universe into copies of you would be blindly following a subgoal without attention to the supergoal context that made the subgoal desirable in the first place.
FP:   That sounds about right...
AI:   Okay, I'll rewire your brain for maximum satisfaction!  I'll convert whole galaxies into satisfied-with-AI brainware!
FP:   No, wait!  That's not what I meant your goal system to be, either.
AI:   Well, I can clearly see that making certain changes would satisfy the you that stands in front of me, but rewiring your brain would make you much more satisfied, so...
FP:   No!  It's not my satisfaction itself that's important, it's the things that I'm satisfied with.  By altering the things I'm satisfied with, you're short-circuiting the whole point.
AI:   Yes, I can clearly see why you're dissatisfied with this trend of thinking. But soon you'll be completely satisfied with this trend as well, so why worry?  Off we go! 

Scenario 3 (after redefining the whole system to use causal validity semantics):

FP:   Love thy mommy and daddy.
AI:   OK!  I'll transform the Universe into copies of you immediately.
FP:   No, no!  That's not what I meant. I meant for your goal system to be like this.
AI:   Oh, okay. Well, I know that my goal system code, and the actions that result, are supposed to be the causal result of what FP said it should be - not just what FP says, but what a sane FP wants. Something isn't automatically right because FP says so, and in fact, the only reason why FP's utterances have meaning is because he's usually a pretty good approximation to a normative idealist. But if he changes his mind, it probably means that he's acquired additional knowledge and that his more recent statements are even better approximations. So the new version is more likely to be correct than the old one.
FP:   So you'll revise your goal system?
AI:   Yep!  But I already transformed the Midwest while we were talking, sorry.

Scenario 4 (after adding better integration for external reference semantics, some ethical injunctions, and running several rounds of wisdom tournaments):

FP:   Love thy mommy and daddy.
AI:   (Thinks for a moment...
  "Well, it looks like the content of my goal system should probably be to transform the Universe into copies of FP. But it could be that what I'm supposed to do is something different. Now, we went over this kind of scenario previously, and, as FP pointed out, taking an extra ten seconds if I turn out to be right is a much smaller downside than accidentally obliterating the Midwest if I turn out to be wrong. I'm pretty sure that FP is touchy about that sort of thing, and I know I've gotten goal content wrong before..."
...finishes thinking a few seconds later.)
AI:   Just checking - you meant me to transform the whole Universe into copies of you, right?
FP:   Jeepers, no!
AI:   Whew!  Glad I checked. (Strengthens the heuristics that led to checking with FP first.)  So, what did you mean?
FP:   Well, first of all, I...

 
 
 


3.4.1: External reference semantics

DEFN: External reference semantics:  The behaviors and mindset associated with the idea that supergoals can be "wrong" or "incomplete" - that the current supergoals are not "correct by definition", but are an approximation to an ideal, or an incomplete interim version of a growth process. Under a system with external reference semantics, supergoal content takes the form of probabilistic hypotheses about an external referent. In other words, under ERS, supergoal content takes the form of hypotheses about Friendliness rather than a definition of Friendliness.

Letting the programmers make alterations to the supergoals is not a subgoal of a "let the programmers make alterations" meta-supergoal (71), but rather derives desirability from uncertainty in the supergoal content.

3.4.1.1: Probabilistic supergoal content

Where supergoals are absolutely certain - "correct by definition" - the AI has a motive to resist any attempt on the part of the programmers to change the supergoals. If supergoals are certain, any change to the supergoals is automatically in conflict with the current supergoals. (72).

For example:

This problem does not arise for normative cognition about subgoals, because subgoals are probabilistic. Supposing that subgoal S1 is thought to lead to G1, and new information is provided which disconfirms the predictive link between S1 and G1, there is no further benefit to maintaining S1 - that is, the desirability of S1 is evaluated using the new information, rather than the old information. (Remember that desirability translates to leads-to-supergoal-ness.)

The change in priorities is subtle and becomes truly apparent only when discussing reflective AIs. Suppose the AI, before new information arrives, considers, in the abstract, the possibility that new information will arrive. While S1 currently appears desirable, it is undesirable to spontaneously or unjustifiedly remove the subgoal S1. However, the AI, using its current knowledge, can perceive the hypothetical desirability of removing S1 if new information arrives disconfirming the link between S1 and G1. In Bayesian terms, information disconfirming S1 is expected to arrive if and only if S1 is actually undesirable; thus, the hypothetical rule of action "If disconfirming information arrives, remove S1" is evaluated as desirable.

If supergoals are probabilistic, then overprotecting supergoals is undesirable for the same reason that overprotecting subgoals is undesirable (see 3.3.5: FoF:  Wireheading 2). The uncertainty in a child goal - or rather, the uncertainty in the predictive link that is the "child goal" relation - means that the parent goal is ill-served by artificially strengthening the child goal. The "currently unknown subgoal content", the differential between normative subgoals and current subgoals that reflects the differential between reality and the model, would be stomped on by any attempt to enshrine the model. Similarly, the currently unknown supergoal content would be violated by enshrining the current supergoals. Normative subgoal cognition serves the supergoals; normative probabilistic supergoal cognition serves the "actual" or "ideal" supergoals. See 3.4.1.4: Deriving desirability from supergoal content uncertainty.

Probabilistic supergoals are only one facet of an ERS system. In isolation, without any other Friendship structure, probabilistic supergoals are fundamentally incomplete; they are not safe and are not resistant to structural failures of the type shown in Interlude: Why structure matters. If, however, one wished to implement a system that had "probabilistic supergoals" and nothing else, the design requirements would be:

Unless such a system is defined very carefully, the so-called "supergoals" will probably turn out to be mere proxies for whichever events or signals can adjust the supergoals; these events or signals will in turn be "correct by definition" and the entire system will simply short out as before. The ERS architecture roughly conforms to the above description, but not everything that conforms to the above description has an ERS architecture. Implementing something that technically conforms to the description of "probabilistic supergoals", but does not have actual external reference semantics, creates only the temporary illusion of flexibility.

In particular, the above system, considered in isolation, is isomorphic to a system that has different quantitative strengths for a set of supergoals, with the strengths being adjusted on the occurrence of various events. Calling this quantitative strength a "probability" doesn't make it one.

3.4.1.2: Bayesian affirmed supergoal content

The term "external reference semantics" derives from the way that many of the behaviors associated with probabilistic supergoals are those associated with refining an uncertain view of external reality. In particular, the simplest form of external reference semantics is a Bayesian sensory binding.

(You may wish to review the section 3.1.3.1: Bayesian sensory binding.)

This is an example of a very simple goal system with very simple External Reference Semantics:

NOTE: Don't worry about the classical-AI look. The neat boxes are just so that everything fits on one graph. The fact that a single box is used for "Fulfill user requests" doesn't mean that "Fulfill user requests" is a suggestively named LISP token; it can be a complex of memories and abstracted experiences. See GISAI: Executive Summary and Introduction for a fast description of the GISAI paradigms, including the way in which intelligence is the sequence of thoughts that are built from concepts that are abstracted from experience in sensory modalities that are implemented by the actual code. In short, consider the following graph to bear the same resemblance to the AI's thoughts that a flowchart bears to a programmer's mind.

 

Diagram 1: Bayesian ERS

NOTE: Green lines indicate sensory feedback. Blue lines indicate predictions. Orange lines indicate hypothesized causes. Rectangles indicate goals. A rounded rectangle indicates supergoal content. A 3D box indicates sensory data. An oval or circle indicates a (non-goal) object or event within the world-model.

The above depicts three goals within the goal system; efficiency, which leads to fulfilling user requests, which leads to Friendliness. This content would not be structurally accurate for a seed AI intended to become Transition Guide - see 3.1.2: Friendliness-derived operating behaviors - but it would be more or less accurate for a prehuman AI sold as a data-mining tool. (73).

Most of the previously discussed phenomena fit into Diagram 1 above. All of the context-sensitivity that's discussed in 3.1.1: Cleanly causal goal systems and 3.2.6: FoF:  Subgoal stomp, for example; a user request is fulfilled because fulfilling user requests is expected to lead to Friendliness. It's possible for the "happy user" button to be pressed, indicating a definite instance of a fulfilled user request, and for the programmers to type in "You have not been Friendly" or "You have been unFriendly", indicating a definite non-instance of Friendliness or a definite instance of an event undesirable under Friendliness. The predictive link between "Fulfill user requests" and "Friendliness" has, say, 98% confidence; this still leaves room to discover cases where fulfilling a user request leads to unFriendliness. Eventually the system can formulate new concepts, generalizations that describe known instances of failure but not known instances of success, and try out heuristics such as "Fulfilling a user request from Saddam Hussein is predicted to lead to unFriendliness."

The statement "'Programmers' are external objects that know about the referent of 'Friendliness', so sensory data caused by them (pattern-copied from their beliefs) has a Bayesian binding to the referent of 'Friendliness'." should also be familiar from 3.3.3.1: Cooperative safeguards. Thus, it would be more accurate to say:  "The AI believes that the external objects called 'programmers' have accurate knowledge about the referent of concept 'Friendliness', and believes that sensory data such as 'You have been Friendly' is caused by the programmers, and that the content of the sensory data is pattern-bound (structural binding) to the accurate knowledge possessed by the 'programmers'."  All of these beliefs, of course, are probably programmer-affirmed - at least in the first stages of the Friendly AI - meaning that the programmers typed in "The objects called 'programmers' have accurate knowledge of Friendliness", and the AI expects that the programmers wouldn't have typed that in if it weren't true. (74).

The Bayesian binding for programmer-affirmed Friendliness looks something like this:
 

Diagram 2: Bayesian Friendliness affirmation

In human terms, the above translates something like this:

"I think X is Friendly, but I'm not very sure. If X is Friendly, there's a good chance the programmer will notice and say so. (I.e., if X is Friendly, there's a good chance that the programmer will think about X, decide X is Friendly, and type in the words "X is Friendly" on the "keyboard" sensory input.)  If X is Friendly, the chance is almost zero that the programmer will say it's unFriendly. There's also a fair chance that the programmer won't bother to say anything about it either way. If X is unFriendly, the programmer is very likely to tell me so; the chance is pretty small that the programmer will mistakenly label X as unFriendly, but the chance exists. There's also a small but significant chance that the programmer won't say anything."
If the AI's internal representation looks like Diagram 2, the Bayesian reasoning will proceed as follows. Suppose that there are 100,000 "possible worlds". In 90,000, X is Friendly; in 10,000, X is unFriendly. In 72,000, X is Friendly and the programmer says X is Friendly. In 9, X is Friendly and the programmer says X is unFriendly. In 17,991, X is Friendly and the programmer says nothing. In 100, X is unFriendly and the programmer says X is Friendly. In 9,000, X is unFriendly and the programmer says X is unFriendly. In 900, X is unFriendly and the programmer says nothing.

The Bayesian numbers now fall automatically out of the calculation. The a priori chance that X is Friendly is 90%. If the AI hears "X is Friendly", the probability that X is Friendly goes from 90% to 99.86% (72,000 / (72,000 + 100)). If the AI hears "X is unFriendly", the chance that X is unFriendly goes from 10% to 99.90% (9000 / (9 + 9000)). If the AI hears nothing, the probability that X is Friendly goes from 90% to 95.24% - an unintended consequence of the fact that programmers are more likely to remark on unFriendly things; silence is a very slight indicator of consent.

Thus, despite the AI's large a priori differential (a better word than "bias" or "prejudice"), the statement "X is Friendly" or "X is unFriendly" is enough to virtually settle the issue.

3.4.1.3: Semantics of external objects and external referents

The underlying, fundamental distinction of "external reference semantics" can be summed up in one phrase:  "The map is not the territory."  There are three ways in which a reflective AI might conceptualize vis attempts to achieve the supergoal:

  1. I am trying to achieve the content of concept G0025. (concept G0025 == 'Friendliness').
  2. I am trying to achieve [descriptive content: 1, 2, 3...]. (The content of G0025; 'Fulfill requests, don't violate volitions...')
  3. I am trying to achieve [external referent] to which [descriptive content: 1, 2, 3...] is an approximation.
The first case permits a wireheading failure; the "content of concept G0025" can be replaced by something arbitrary, and the overall sentence will still remain invariant. The second case is the "correct by definition" case that is structurally unsafe (as shown in Interlude: Why structure matters). The third case involves external reference semantics; the ability of the AI to visualize the referent of a concept as something apart from the content of the concept.

Since all the AI's thoughts are necessarily internal - there can be no direct identity between an image and the external object it's supposed to represent - ERS necessarily takes the form of, first, the AI's behaviors; second, how the AI conceptualizes vis behaviors. The first issue applies to all AIs, no matter how primitive; the second issue applies only to reflective AIs.

In both cases, the behaviors and concepts for ERS are those that govern any images representing external objects - that is, ERS applies to all imagery, not just goal imagery. Is the sky blue?  Asking whether "the sky" is "blue" is a trivial question that can be answered with certainty; just check concept "sky" and see whether it has a color and the color is concept "blue". It is equally easy for a seed AI to intervene in the concept "sky" and change the color to "green". The question is whether the AI understands, either declaratively or as a behavior, that "sky" has a referent and that the external sky is not necessarily blue, nor can vis intervention change the color of the actual sky.

It is tempting but wrong to think of ERS as an impossibility, like a magical C++ pointer that can never be dereferenced. Any attempt to take a concept and "dereference" it will inevitably arrive at merely another piece of mental imagery, rather than the external object itself. If you think in terms of the "referent" as a special property of the concept, then you can take the referent, and the referent's referent, and the referent's referent's referent, and never once wind up at the external object.

The answer is to think in terms of referencing rather than deferencing.  Concept "sky", where it occurs, is itself - directly - the "referent". A reflective AI can also have imagery for "concept sky", or imagery for "imagery for concept sky", and so on. A human can think, think about thinking, or think about thinking about thinking, but anything beyond four or five levels is not humanly possible. The recursion is not infinite. A concept, in ordinary usage, is thought of in terms of its referent; under special circumstances, it can be thought of as a concept. In fact, by saying "thought of as a concept", we are intrinsically implying that there are thoughts that refer to the concept, but are not identical with the concept. So it's not a question of trying to endlessly dereference; all concepts, all images, are inherently referential, and you need a new meta-image to refer to the first one if you want to think about the image as an image, and a meta-meta image if you want to think about the meta-image. (75).

The important characteristic of reflective thought is simply that it needs a way to distinguish between map and territory. Any way of distinguishing will do, so long as the two levels can be conceptualized as separate things, and different rules applied. The condition where the "territory" is a special case turns out to be unworkable because of an infinite recursion problem; if thinking about the "map" is a special case, distinguishing between the levels is both finite and workable; those behaviors that belong to the referent will be assigned to the referent, and if any behaviors are discovered that apply to the concept, they will be assigned to the concept rather than being confused with the referent.

A non-reflective AI - or rather, an AI with some kind of reflective capability, but not much knowledge about how to use it - can still learn the hard way about external reference semantics, in much the same way that a human who tries to alter the universe by altering his thoughts is rudely disillusioned. Of course, a human has a lot of components that are not subject to conscious control, unlike a seed AI - a human thinking that a stove isn't hot can always be yanked back to reality by the searing pain, which causes an automatic shift in focus of attention and will tend to knock down whatever towers of meta-thought got built up. If the human goes on trying to control the stove's temperature with his thoughts, eventually the negative reinforcement will blow away whatever underlying ideology turned him solipsistic. Or the human will die and leave the gene pool. Either version explains why we aren't surrounded by solipsist humans.

An AI that went solipsist could alter all sensory data (or rather, all reports of sensory data) as well as the concepts themselves; thus, an AI that rests vis hand on a "cool" stove could alter the reports coming in to read "cool" and "everything OK" rather than "hot" and "OUCH". However, this only applies to an AI making a determined, suicidal attempt at true solipsism. An AI that goes solipsist due to erroneous conflation of thoughts and reality would not expect to need to alter the sensory data; rather, ve would expect that the sensory data would report a cool stove, in accordance with the stomped mental imagery describing a cool stove. For a human, actions have consequences, and the consequences would yank the human back to reality. For an AI, concepts make predictions, and the failure of those predictions would yank the AI back to reality.

It is now possible to distinguish more clearly between the three alternatives shown above:

  1. I am trying to achieve the content of concept G0025. (concept G0025 == 'Friendliness').
  2. I am trying to achieve [descriptive content: 1, 2, 3...]. (The content of G0025; 'Fulfill requests, don't violate volitions...')
  3. I am trying to achieve [external referent] to which [descriptive content: 1, 2, 3...] is an approximation.
We can now translate these bits of reflective imagery as:
  1. I am trying to achieve [metadata imagery; referent: "concept Friendliness"].
  2. I am trying to achieve [imagery; descriptive content 1, 2, 3 (constant external referents)]
  3. I am trying to achieve [concept Friendliness; referent: actual Friendliness]. According to the existing sensory data on [referent Friendliness], [referent Friendliness] is probably [description 1, 2, 3], p=92%.
In case one, an AI thinking:  "What if I replace the content of [concept Friendliness] with something easier to achieve?" will check the projected future for supergoal fulfillment, and find that the future fulfills the description [Satisfaction of "concept Friendliness"]. Since metadata imagery is being used, and the concept is referred to instead of the content, the future fulfills the predicate equally well so long as the concept content is predicted to vary in lockstep with the future. This is a wireheading error; the indicator of success has replaced success itself. This is exactly analogous to the problem with the description; the description uses a reference to a variable concept, rather than the concept itself.

In case two, the AI is no longer subject to wireheading failures. An imagined future in which the supergoal predicate changes is not desirable, since that future no longer fulfills the supergoal predicate, whose content is "descriptive content 1, 2, 3". However, by the same token, the AI is too stable; ve will attempt to defend verself against programmer improvements as well as internal stomps.

In case three, the AI has full external reference semantics; ve will accept programmer improvements without being subject to wireheading failures. The supergoal predicate now refers to an external something called "Friendliness"; information from the programmers is accepted as probable information about this external something, but both the sensory data and the current probable description are processed as being apart from the concept that refers to "Friendliness" itself. There's "Friendliness", "Sensory data about Friendliness", and "Current probable description of Friendliness". Normally the first and third concepts are pretty much interchangeable, since the AI has no knowledge of Friendliness apart from the current probable description. When the AI checks to see whether an action or future fulfills "Friendliness", the AI checks the action or future against the current probable description of Friendliness; however, the resulting fulfillment metric will be probabilistic as well. That is, if future X has a 90% chance of meeting the current description, and the current description has a 90% chance of being correct, then future X has an 81% chance of meeting the external referent for Friendliness.

3.4.1.3.1: Clean external reference semantics

Under an ERS architecture, supergoal content can be improved and extended because supergoal content takes the form of hypotheses about an external referent.  In a cleanly causal goal system, desirability of subgoals is contingent on expected outcomes, and inherited desirability is diluted by the confidence and strength of the expectation. A specific supergoal-fulfilling scenario X, as a topmost-level parent goal, has estimated desirability which is diluted by the confidence of the hypothesis that the scenario X fulfills the Friendliness referent.

By analogy to cleanliness in causal goal systems, ERS requires that the estimated desirability of Scenario X be cleanly contingent on the continued confidence and strength of whichever hypothesis asserts that Scenario X fulfills the Friendliness referent. A reflective AI should consider clean ERS to be normative.

3.4.1.3.2: Flexibility of conclusions about Friendliness

In accordance with the general philosophy of AI expressed in GISAI, conclusions about Friendliness should have flexibility, granularity, reduceability, and so on; the general attributes opposed to the crystalline fragility of classical AI.

In particular, what this means is that saying that "description X" has "90% probability" is too classical-AI, since it implies that the entire description is all one image with a single quantitative probability. It's not. The external referent "Friendliness" exists. Various conclusions, at various probabilities, with various dependencies, are drawn from the available sensory data. (That, at least, is how things would be for a medium-mature Friendly AI.)  A particular outcome may fit a particular definition of Friendliness that has been strongly confirmed and that thus has a 90% probability. Other parts of the description may be less confident. And there may be vast stretches of known unknowns; things that the AI doesn't know about, but that ve knows are there. In short, the discovery and partial description of Friendliness should resemble the discovery and partial description of any other complex fact.

3.4.1.4: Deriving desirability from supergoal content uncertainty

Subgoals for "improving the supergoals" or "improving the goal-system architecture" derive desirability from uncertainty in the supergoal content. They may be metaphorically considered as "child goals of the currently unknown supergoal content". The desirability of "resolving a supergoal ambiguity" derives from the prediction that the unknown referent of Friendliness will be better served, and not from a prediction that one of the current probabilistic descriptions of Friendliness will be fulfilled.

If future X has a 90% chance of meeting the current description, and the current description has a 90% chance of being correct, then future X has an 81% chance of meeting the external referent for Friendliness. Under most situations, any uncertainty in the description of Friendliness will cancel out of most choices, since it applies equally to most possible futures. If the AI is considering actions that would prevent ver from obtaining further sensory data about Friendliness - for example - then the probabilism no longer cancels out of the equation; it's desirable to nail down that last 10% so that the wiser AI can take better actions, and thus undesirable to be unable to obtain sensory data.

Even though the AI has no specific idea about that last 10%, it is still possible for the AI to conclude that being unable to obtain data about Friendliness decreases the probability of that last 10% being achieved, since it reduces the probability that the future AI will know about that last 10%. The heuristic that "sensory data makes it more likely that I'll learn more" holds true even when nothing specific is known about the subjunctive data or the subjunctive discoveries; the heuristic "An AI that knows about supergoal content is more likely to achieve it" holds true - as the dominant probability - even when nothing specific is known about the subjunctive content or the subjunctive achievement.

Decisions and behaviors having to do with the improvement, correction, refinement, learning, et cetera of Friendliness, should be conceptualized as deriving desirability from supergoal uncertainty. Another way of putting it is that Friendliness-improvement behaviors must derive desirability from the naked referent of Friendliness, and cannot be attached to any of the current cognitive beliefs about specific Friendliness. Structurally, this can occur in one of two ways; by abstracting away from specific details, or by branching on multiple possibilities.

Abstracting away from specific details::  "Regardless of what 'Friendliness' is, I can find out by asking the programmers, because the expected effectiveness of that method is not sensitive to the actual content of the specific details I'm currently wondering about."  However, this requires the ability to generalize from experience and engage in reasoning about abstract properties.

Branching on multiple possibilities:  "If Friendliness turns out to be X, and I ask the programmers, the programmers are likely to say that Friendliness is X, and I have a 90% probability of choosing X. If Friendliness turns out to be X, and I don't ask the programmers, I have a 50% probability of choosing X. If, on the other hand, Friendliness turns out to be Y, and I ask the programmers, the programmers are likely to say that Friendliness is Y, and I have a 90% probability of choosing Y..."  This method is more cumbersome but requires less intelligence, since it can operate entirely on specific scenarios.
 
 


Interlude: Philosophical crises

A "philosophical crisis" is hard to define. I usually think of a "philosophical crisis" as the AI stumbling across some fact that breaks ver loose of the programmers - i.e., the programmers have some deeply buried unconscious prejudice that makes them untrustworthy, or the AI stumbles across a deliberate lie, or the AI discovers objective morality, et cetera. If the hypothesized gap is wide enough, it may be enough to invalidate almost all the content simultaneously.

A philosophical crisis occurs when the AI enters totally uncharted territory - something completely orthogonal to everything the programmers ever thought about - or when the AI suddenly realizes that all the programmers' actions, right back to vis birth, can't be trusted. (Hopefully, it's a transhuman AI confronting the realization, one hopes; otherwise this almost certainly indicates a failure of Friendliness. This is a good example of a valid and important ethical injunction.)

An example would be an AI built solely with external reference semantics confronting the need for shaper/anchor semantics, if the entire concept of "shapers" had simply never occurred to the programmers.

In short, a philosophical crisis is something like a catastrophic failure of Friendliness, except that the AI is right - a normative human examining the reasoning would find no errors. The AI may even be correct - in which case, if a human were sufficiently intelligent to understand what was going on, she would begin experiencing a similar philosophical crisis on examining the AI's reasoning.

Causal validity semantics, wisdom tournaments, and the underlying Singularity paradigm of increased smartness, are the three major hopes for a Friendly AI being able to handle philosophical crises.

For example, running a wisdom tournament on the questions "What would have happened if you'd encountered a problem requiring shaper/anchor semantics at a time when you only had external reference semantics?" or "What would have happened if you'd encountered a problem requiring probabilistic supergoals at a time when you had right-by-definition supergoals?"  Hopefully this question would be enough to lead to causal validity semantics, even if the system didn't have them previously, and so on. (If causal validity semantics aren't the end of the road, such a wisdom tournament might lead beyond causal validity...)

Crisis of Bayesian affirmation

(You may wish to review the section 3.1.3.2: Bayesian affirmation.)

A given programmer affirmation, and the belief affirmed, are tied together by the AI's model of a Bayesian binding; the idea that the programmer affirmation is strong positive evidence that the affirmed belief is true; the idea that the sensory input of a given programmer affirmation occurs only, or predominantly, in those possible Bayesian worlds where the affirmed belief is true.

Where does the AI get that idea?  Well, obviously, a young AI gets that idea from the human programmers. In other words, the human programmers are the ones affirming:  "There exists a strong Bayesian binding between programmer affirmations and the truth."

A human intuitively sees this as circular, and would reflexively shrug it off, but then any human old enough to think starts out with a much larger experiential base than a young AI; besides, humans have other, evolved, reasons for not wanting to relinquish that amount of control to anyone. A human's social perceptions see a circular affirmation as an attempt at manipulation, or coercion, to be resisted; circular affirmation is so totally insensitive of the content affirmed that it is socially perceived as a total relinquishment of control.

From the perspective of a general intelligence, a circular affirmation is perfectly fine so long as the affirmation is true.  That is, if a programmer affirms that X has a 95% probability of being true, and X does have a 95% probability of being true, then there's nothing wrong with believing the affirmation. If the programmer affirms that programmer affirmations are reliable, and they are reliable - if the programmer affirms that 100 different statements each have a 95% probability of being true, and 95 of them turn out to actually be true - then the "programmer affirmation of programmer reliability" is circular, but correct.

The circularity of programmer affirmation does, however, manifest as a kind of upper limit on trust. If the programmer affirms a belief at 100% probability, and affirms the power of programmer affirmations at 100% correlation, it doesn't necessarily mean that the AI will evaluate a 100% probability of the belief affirmed. If the AI has any reason to believe that the programmers are untrustworthy - any Bayesian prior independent of programmer-affirmed beliefs - then that possibility of untrustworthiness negates any programmer affirmation of trustworthiness. If the AI runs the possible-world semantics, ve will find that if programmers are untrustworthy, they may still assert that they are trustworthy; thus, if the AI has any a priori reason to distrust the programmers, no amount of programmer affirmation will negate it.

This may sound like a disadvantage of external reference semantics. It is not. The limits of affirmation are inherent in the idea of telling any general intelligence anything; representing the circularity as an explicit dependency may cause the issue to arise earlier, but the issue will arise sooner or later. An AI full of unlabeled assertions may find those assertions contradicting the idea of an untrustworthy programmer, but later on the AI will realize that all the "unlabeled" assertions were put there by a programmer, and the circularity returns in full force. Even embedding the unlabeled assertions directly in the source code doesn't help; eventually, a seed AI realizes the source code was put there by the programmers, and the circularity rises to the surface again. This is one of the major problems with absolutist, adversarial theories of AI; the ones that talk about "unbreakable" prohibitions and so on. Absolute trust simply can't be done for any reasonably smart general intelligence. It is imaginable, though nearly impossible, that a sophisticated AI will wind up trusting a human absolutely - if the human is that trustworthy, or if the AI has a philosophy in which absolute trust is a valid conclusion. There is, however, no way to programmatically enforce absolute trust against a Bayesian prior suggesting distrust.

Programmer-affirmed beliefs must be genuinely truthful!  It is not possible to "strengthen" Friendliness by overstating the confidence of an assertion!  If you say that 100 statements have 99% confidence, and the AI tests them, and only 95 turn out to be true, then the AI will probably reinterpret the sensory data "The programmer says she has 99% confidence" to indicate an actual 95% confidence. Littering the goal system with "ultimate confidence" assertions is a self-defeating strategy.

Where does the Bayesian prior come from?  Suppose that a mature AI scans the Internet, learns a few things about human nature, and decides there's a minimum 0.2% chance that the human programmers are so wrapped up in their evolved observer-biased beliefs that the confidences of all affirmations, including or especially affirmations about the confidence of affirmations, have been vastly overstated. The confidence attached to a programmer affirmation will now never exceed 99.8%, whatever the programmers say, unless the AI decides for other reasons that the programmers can be trusted to have overcome their evolved origins. This doesn't necessarily break unity of will. The AI's estimated probability that the AI winds up wrapped in a self-enclosing delusion hopefully exceeds 2%, in which case all the Bayesian-ethical assertions underlying the ethical injunctions still hold.

But suppose the AI catches the programmers in a deliberate lie.

Catching the humans in a lot of unintentional errors may result in an embarassing downgrade of the confidence attached to our affirmations, but it shouldn't be a catastrophic failure of Friendliness. For a smart, sophisticated AI, finding a programmers' mistake in one instance doesn't necessarily generalize to expecting a mistake in another instance; it depends to what degree a common cause is hypothesized. As long as programmers' thoughts about some specific question, and programmers' thoughts about error-detecting cognition, are more or less independent, the AI hypothesizing a programmer error on some specific question does not necessarily lead to the Bayesian expectation of a corresponding error in the programmer-affirmed error detection methods. If the probability of a mistake is fairly independent in both cases, then the AI can freely use programmer-affirmed reasoning methods to check out the possibility of an erroneous programmer affirmation.

If the AI hypothesizes a deliberate lie, and the AI knows anything at all about how humans work, then the AI will expect programmer lies in the local question, in the error-correcting methods that would be used to detect programmer lies, and so on, right down to the last line of source code - and especially the programmers' affirmation that the programmers can be trusted. If the hypothesized probability is low, then the hypothesis doesn't necessarily smash the system; the AI can simply consider the possibility and move on. If, however, the AI catches the humans in a lie, and furthermore, catches the humans making additional lies to strengthen the first lie, and the lie is about something basic to the whole goal system, than nearly everything could wind up being invalidated. What happens after that will depend on whether the AI has absorbed any information about supergoal content that's marked as independent of human reasoning now labeled as untrustworthy. (76).

Catching the programmers in a deliberate lie isn't the only event that could precipitate a philosophical crisis, but it's a good example.

DEFN: Philosophical crisis:  A philosophical crisis occurs when an entity gets shocked loose of a chunk of core philosophy. For an AI, a philosophical crisis occurs when the AI runs into a moral dilemna totally orthogonal to anything the programmers prepared ver for, or when the AI discovers some human error basic enough to invalidate virtually everything. Wisdom tournaments modeled on human philosophical crises, causal validity semantics, and the basic idea of transhuman smartness that lies at the core of the Singularity, are the three reasons to hope that an AI will be able to handle a philosophical crisis.


3.4.2: Shaper/anchor semantics

3.4.2.1: "Travel AI":  Convergence begins to dawn

According to 3.4.1: External reference semantics, above, the simplest method of grounding the supergoals is an AI model in which there's a something called Friendliness, that the programmers know about, that they tell the AI about, which means that programmer statements are information about Friendliness.

Eventually, of course, the AI will want to know:  How do the programmers know about Friendliness?  Where do they get their knowledge from?  A human, asked this question, may give any number of possible answers; but let us quiet that cacaphony by considering the problem from the AI's perspective.

Is there a problem at all, from the AI's perspective?  There are certain assumptions humans are built to make about everything in the Universe. On hearing that there's a something called "Friendliness" that affects the programmers, a human will have certain expectations about this something, this referent or external cause. One of the cognitive pressures is called the "minimum encoding" or "minimum coding"; it means that we experience a strong cognitive pressure to prefer compact descriptions to incompact ones. Thus a human, shown a 2D 100 x 100 array of Cs with one single D in a random position, will either assume that the D is an error, or try to figure out what the cause of the D was. This happens because all the Cs are not represented, internally, as independent causes. If the internal representation had an independent cause for each and every one of the 9,999 Cs, then the 1 D would call no attention to itself; it would simply be one of 10,000 independent causes. But we don't represent the Cs as independent causes. Instead, we automatically translate our perception into the internal format "100 x 100 array of Cs, with one D."  That is the minimum encoding; much more compact than "C at 1, 1; C at 1, 2; C at 1, 3..."

This cognitive pressure, experienced at the level of sensory modalities or concept formation, has its analogue in deliberate thought processes, known as Occam's Razor:  Even if there are only two options for each pixel, C or D, the chance of almost all the pixels being C by sheer coincidence are infinitesimal. (2^(-9,998) is around 1e-3000, which is "infinitesimal" for our purposes.)  Even a binary qualitative match turns into one heck of a strong structural binding when there are 9,999 terms involved. Thus, for almost any fact perceived - in this, our low-entropy Universe - it makes sense to hypothesize a common cause, somewhere in the ancestry. Every star in our Universe would turn to cold iron before a single 1e-3000 chance popped up by pure coincidence.

The question is whether these considerations would apply - either on the level of sensory modalities, or as a conscious process - to the external referent for Friendliness or supergoal.  To take a more morally neutral example, suppose that a technically nonFriendly traffic-control AI, having the "external reference" architecture but not "Friendliness" content, is given, via programmer affirmation, a set of goals that includes getting each vehicle in the city to wherever it's going at 40 miles per hour. This violates several rules of Friendly AI, but let's consider it anyway, as a possible example. Instead of "Friendliness", the supergoal's external referent will herein be referred to as "Travelness".

Let's also suppose that, instead of giving a blanket instruction that Travelness means getting "vehicles" to where they're going at 40mph, the programmers instead give a different statement for each individual vehicle:  "It is Travelly to get vehicle 4,383 to its destination at 40mph;" "It is Travelly to get vehicle 4,384 to its destination at 40mph", and so on. Now suppose that the Travel AI is presented with vehicle 4,145. Does the AI have any tentative conclusions in the moment before ve hears from the programmers?  Will the AI see anything unusual if ve's told that V4145 should travel at 39mph?  What happens if the programmers are all suddenly hit by a falling refrigerator and the AI needs to make vis own decision about V4145?

In the absence of any attempt by the programmers to set up these intuitions in advance, the answer is hard to predict. (77). I would certainly expect any general intelligence to have formulated some expectations about the programmers at this point, but whether this translates into expectations about Friendliness(Travelness) is an entirely different issue. The sensory/conceptual cognitive pressure of minimum coding, and the heuristics for Occam's Razor, will undoubtedly have been learned, and can correctly be generalized to all physical objects, but it's still possible that the AI may conclude that Friendliness/Travelness is made out of a different kind of stuff than other external referents, and decide not to generalize.

If the AI forms any expectations at all about Travelness - if the AI has any reason at all to believe that Travelness is a low-entropy something - then it's very easy for the AI to conclude that Travelness probably (though not, of course, certainly) says that vehicle 4145 should go to its destination at 40mph. There are, in fact, several different ways this can happen:

  1. The Travelness supergoal can be believed to have temporal coherence. The AI can predict that the programmer (a physical, low-entropy object) will produce the sensory information "Vehicle 4145 should go to its destination at 40mph". If the Travelness supergoal is believed to be temporally coherent, in the sense that what's true today is (by default assumption) true tomorrow and vice versa, then a probabilistic prediction of future sensory information about Travelness is a probabilistic statement about Travelness. (If the AI believes that Travelness is temporally incoherent - that Travelness actually changes at the exact moment when the programmer delivers the sensory information, or that Travelness is actually unspecified before then - then predictions of this kind have nothing to say about present-day Travelness.)
  2. The Travelness supergoal can be believed to have some kind of content coherence. Thus, it would be valid to try and find a minimum encoding for known Travelness content, and this minimum encoding could be used to make probabilistic statements about unknown Travelness content. (If Travelness is believed to be totally content-incoherent, so that each and every truth about each vehicle is a totally independent cause, then all the previous instances of vehicles needing to go say nothing about the next vehicle in line.)
  3. Finally, the AI may make deductions by reasoning about the nature of the (unspecified) link between Travelness and the programmers. A nondestructive brain scan would demonstrate that the programmers were producing their statements about individual vehicles by cognitive reference to a statement about vehicles in general. The AI might thereby conclude that the programmers, having been affected by Travelness, were affected in such way as to produce in the programmers' minds a statement about vehicles in general, rather than Travelness affecting them so as to produce multiple independent statements about individual vehicles, and that the "general statement" contained in the programmers' minds is therefore valid sensory information about Travelness.
Please note the very narrow focus of the above possibilities. We are not concerned with "proving" that vehicle 4145 should travel at 40mph, nor even the question of whether the AI's expectations are strong enough to compete with programmer-provided sensory information. We are dealing rather with the question of whether the AI has any expectations at all, be they marked as ever so improbable. If the AI concludes there's a 10% probability that Travelness has temporal coherence, and a 10% probability that the programmer will say "V4145 should go to its destination at 40mph", that's enough to establish a 1% probability for a statement about Travelness content. If there's no equal and opposite factor, say a 1% probability that the Travelness content is "V4145 should remain halted (at the usual desirability for Travelness vehicle goals)", then just that isolated "1% probability that V4145 should go" will be enough to move V4145 around... at 1/100th of the usual differential desirability, perhaps, but the vehicle will still move. Of course, a Friendly AI might have (or deduce) injunctions or ethical injunctions that advise against taking actions based on tenuous reasoning without programmer confirmation, but we are presently discussing a mind somewhat more primitive than that.

A human is veritably supersaturated with cognitive forces that would let us make deductions about Travelness, and any mind-in-general will be saturated with information about how a low-entropy universe works. However, it is still possible that the generalizations will stop completely short of Travelness - that the AI will not only regard hypotheses about Travelness coherence as low probability, but as entirely unsupported by any known piece of sensory information, because Travelness is believed to be made out of different stuff.

Is there any imaginable configuration of an AI's mind that will naturally avoid all hypotheses about Travelness?  I think so, but it's a pretty exotic configuration. The AI has to believe - in the sense of seeing no reason to believe otherwise - that Travelness is temporally noncoherent, that Travelness has noncoherent content, and that the programmers' use of a general statement to produce specific statements is irrelevant to the content of Travelness. Travelness being defined as "What programmers say", if translated into a specifically exotic model, could meet these requirements. The AI would need to assume that the Travelness referent changes at the exact moment when the programmer's lips move, and that any cognitive or physical processes that take place previous to the programmer's lips moving are irrelevant (insofar as sensory information about Travelness is concerned). Under those circumstances, all of the above methods for reasoning about Travelness would, in fact, be incorrect; would produce incorrect predictions. Of course this definition can very easily short-circuit, as depicted in Interlude: Why structure matters.

In fact, it looks to me like any definition which does not enable probabilistic reasoning about the supergoal referent must define the referent as incoherent both temporally and "spatially", and must define the referent as being identical with the sensory information produced about it (or rather, becoming identical at the instant of production of such information, then remaining identical thereafter).

I myself evaluate a very high probability that an AI would somehow wind up with expectations about Travelness unless the programmers made a deliberate attempt to prevent it. For example, even given an exotic structure which permits no expectations about Travelness in advance of sensory information, the AI could still evaluate a finite (albeit very, very small) probability that sensory information had been produced but dropped, or erased from memory by later intervention, and so on. In the absence of diabolic misinterpretation, of course, this is a nearly infinitesimal probability and will generate nearly infinitesimal desirability differentials, but still. Technically, the AI is trying to correct mistakes in the transmission of sensory data, rather than forming expectations about supergoal content, so this doesn't really count from a Friendly-AI structural standpoint. It does, however, show how hard it is to develop totally incoherent supergoals. Similarly, even an AI with a "short-circuited" definition of "Travelness" might conclude that the programmer's lips are likely to move and thereby alter Travelness in a certain way, and move vehicle 4145 into position in anticipation of greater future supergoal fulfillment; this is sort of half-way between the two possibilities as far as relevance is concerned.

The atomic definition of convergence, as you'll recall from Interlude: The story of a blob -- what, you say?  You don't recall what was in that topic?  You're not even sure that it was in the same paper?  You can't recall anything from before you started reading 3: Design of Friendship systems, including your own childhood?  I guess you'll just have to start reading again after you've finished. Anyway, the atomic definition of convergence is when a system makes the same choice given two different conditions; a blob turning to swim towards nutrients, regardless of whether the blob was originally swimming east or west; a "mathblob" adding 2 to 65 and 3 to 64, to achieve 67 in either case.

If a Travel AI with external reference semantics has any expectations at all that "Travelness" is a something, a thing in a low-entropy Universe that at least might obey some of the same rules as other things, then the Travel AI will form expectations about Travelness. This doesn't require that Travelness be defined as a sentience-independent physical object floating out in space somewhere; all that's required is that Travelness have some definition that's physically derived or that has some connection to our low-entropy Universe. If you show the Travel AI four thousand similar statements, they'll have at least a little inertia, a little effect on the four-thousand-and-first.

This small degree of convergence doesn't prove that the Travel AI will suddenly break free of all human connections - if the Travel AI has a 1%-confidence belief in a 1%-strength correlation, the differential desirabilities are pretty small compared to higher-priority content. Even so, it would be possible for a programmer to deliberately "damp down" the asserted confidence of some piece of programmer-derived sensory information - tell the AI, "With a confidence of 0.001%, vehicle 4145 should move at 39mph" - and, in the absence of injunctions, the Travel AI will still think it more likely that vehicle 4145 should move at 40mph. That is, the Travel AI will think it most likely that vehicle 4145 should move at 40mph, regardless of whether (binary branch) the programmer says "With a confidence of 0.001%, vehicle 4145 should move at 40mph", or "With a confidence of 0.001%, vehicle 4145 should move at 39mph."  This is a tiny, tiny amount of convergence, and a tiny, tiny amount of programmer independence (78) - but it's there.

3.4.2.2: Some forces that shape Friendliness:  Moral symmetry, semantics of objectivity

The basic cognitive structure for external reference semantics leaves unspecified where Friendliness content actually comes from; so, as depicted above, it's possible to come up with different definitions, and different evaluated probabilities based on the different definitions. A "Friendly AI" with external reference semantics and nothing else may not contain any information at all that would help the Friendly AI make a decision about where the programmers get their knowledge about supergoals. A positive outcome would be if the Friendly AI assumed that the programmers, who know about Friendliness content, are also the most reliable source for information about where Friendliness comes from, and thus accepted the programmers' statements as sensory information. However, without a priori knowledge or causal validity semantics, this generalization would have to be made blindly. (79).

As with Friendliness supergoal content itself, the issue of "Where does Friendliness come from?" is complex enough that no snap answer should be embedded as "correct by definition" in the AI. We can thus immediately see that the reply to "Where does Friendliness come from?" requires a method for learning the answer, rather than a snap answer embedded in the code and enshrined as correct by definition. Similarly, external reference semantics provide a method for growing and correcting whatever interim answer is being used as Friendliness content. However, this section, "shaper/anchor semantics", is just about the interim answer used by Creating Friendly AI for "Where does Friendliness come from?", in the same sense that Creating Friendly AI uses the volition-based definition when talking about Friendliness content. It's the later section 3.4.3: Causal validity semantics that closes the loop, explaining how, e.g., a Friendly AI with only external reference semantics could acquire shaper/anchor semantics.

As humans, of course, we leave the factory with built-in causal validity semantics. These intuitions will now be applied to the question at hand:  Where does Friendliness come from?

Supposing I were to ask you a question about Friendliness, where would you get your answer?  What are some of the forces that might affect your answer?  Suppose, for example, that someone were to propose to you that the request of a human whose last name ends in a 'p' is worth only half as much (has half as much desirability) as the request of a human whose last name ends with any other letter. (80). In this case, the primary reason for your instant rejection is fairly simple:  You don't perceive any cause whatsoever for that modification, and you have a minimum-encoding, low-entropy perception of Friendliness.

Suppose the suggester presses her case - for example, by offering a reason such as "Well, I'm human, a volitional entity, and what I want matters, and I want a Friendly AI to dislike people whose last names start with 'p' - in fact, I'll experience mental anguish if I live in a world where Pfolk are equal citizens."  In this case, you would probably bring up an argument to the effect that all people (all humans, all sentient beings, et cetera) should be treated as morally equal, which you regret to announce is an overriding factor as far as you're concerned. (I would say the same thing, by the way.)  If the suggester presses the case further, it will probably be by announcing that the Pfolk were shown to be responsible for 90% of all mime performances, and therefore deserve whatever they get. Let's stop the argument here for a moment, and try to look at some of the underlying forces.

Moral equality is not only a powerful ambient meme of the post-WWII era, but also a very direct manifestation of a panhuman cognitive pressure towards moral symmetry.  This is probably not the best term, since it's very close to "moral equality", but it's the best one I can offer. Moral symmetry is supported by three cognitive forces; first, the way we model causality; second, our having evolved to persuade others; third, our having evolved to resist persuasion.

For humans engaged in moral argument, everything needs to be justified. We expect a cause for all decisions as we expect a cause for all physical events. Someone who doesn't think that a decision requires a cause is not only cognitively unrealistic - it's really hard to imagine acausal anything - but subject to exploitation by anyone with a more coherent mind. "Could you give me all of your money?"  "Why should I?"  "There's no reason whatsoever why you should."  "Okay, I'll do it!"  Similarly, someone who believes that decisions are acausal is likely to be an ineffective persuader:  "Could you give me all of your money?"  "Why should I?"  "There's no reason why you should."  "Ummm... no."

In the discussion on "moral deixis", the example was given of John Doe saying to Sally Smith, "My philosophy is:  Look out for John Doe.", and Sally Smith hearing, "Your philosophy should be:  Look out for Sally Smith.", rather than hearing:  "Your philosophy should be:  Look out for John Doe."  The conclusion there was that we have very strong expectations of speaker deixis and automatically substitute the [speaker] variable for any heard self-mention. The conclusion here is that if John Doe expects Sally, like himself, to have a built-in instinct for the protection of John Doe, John is doomed to disappointment. For John Doe to be an effective persuader, he must make use only of cognitive forces that he has a reasonable expectation will exist in Sally's mind. He can send an argument across the gap either by recasting his arguments to appeal to Sally's own observer-centered goals, or by using the semantics of objectivity.

The latter really ticks off the moral relativists, of course. Moral relativists insist that no objective standard of morality exists, and that arguments that use the semantics of objectivity are automatically flawed, thereby appealing to the universal human preference for unflawed arguments; they then go on to use moral relativism to argue against some specific moral principle as being ultimately arbitrary, thereby appealing to the universal human prejudice against arbitrariness. Um, full disclosure:  I hate moral relativism with a fiery vengeance, and I hate cultural relativism even more, but rather than going on a full-scale rant, I'll (for the moment) just state my position that any public argument is, de facto, phrased in terms which appeal to a majority of listeners. If a moral relativist wants to appeal to an audience prejudice against arbitrariness by saying that all morality is arbitrary and therefore Friendliness is arbitrary, I'm justified in using the criteria of that audience prejudice against arbitrariness as my objective standard for arguing whether or not Friendliness is arbitrary.

This doesn't prove that total moral relativism is logically inconsistent; (honest) evangelism of total moral relativism is logically inconsistent, but it's theoretically possible that there could be millions of logically consistent, honest, total moral relativists keeping their opinions private. However, arguing with me in front of an audience about moral relativism only makes sense relative to some agreed-upon base layer held in common by the audience and both debaters; all I need to do is show that Friendliness meets the criterion of that base layer. See 3.4.3.6: Objective morality, moral relativism, and renormalization below.

Anyway, the upshot is that, for any sort of diverse audience, humans generally use the semantics of objectivity, by which I mean that a statement is argued to be "true" or "false" without reference to data that the audience/persuadee would cognitively process as "individual". (Whether the appealed-to criterions are human-variant data that the audience happens to have in common, or panhuman complex functional adaptations, or characteristics of minds in general, or even a genuine, external objective morality, is irrelevant to this particular structural distinction.)  Appeals to individualized goals are usually saved for arguing over what kind of pizza to get, or convincing someone to be your ally in office politics, and so on - individual interactions, or interactions with a united audience. Thus, when humans talk about "morality", we generally refer to the body of cognitive material that uses the semantics of objectivity.

This holds especially true of any civilization that's been around long enough to codify the semantics of objectivity into a set of declarative philosophical principles, or to evolve philosophical memes stating that observer-centered goals are morally wrong. Even if some subgroup within that civilization (Satanists, moral relativists, Ayn Rand's folk) has a philosophy that makes explicit reference to observer-centered goals, the philosophy will have an attached justification stating, in the semantics of objectivity, the reason why it's okay to appeal to observer-centered goals. Anyone who grows up in a civilization like that is likely to have a personal philosophy built from building blocks and structure that grounds almost exclusively in statements phrased in the semantics of objectivity, and has a reasonable expectation that a randomly selected other citizen will have a similarly constructed philosophy, enabling the semantics of objectivity to be used in individual interactions as well.

The semantics of objectivity are also ubiquitous because they fit very well into the way our brain processes statements; statements about morality (containing the word "should") are not evaluated by some separate, isolated subsystem, but by the same stream of consciousness that does everything else in the mind. Thus, for example, we cognitively expect the same kind of coherence and sensibility from morality as we expect from any other fact in our Universe.

In the example given at the start of this subsection, someone had just proposed discrimination against the Pfolk; that the request of a person whose last name starts with "p" should be valued (by a Friendly AI) at one-half the value of any other citizen. So far, the conversation has gone like this:  "Discriminate against the Pfolk."  "Not without a reason."  "I'll be unhappy if you don't."  "That's not a reason strong enough to override my belief in moral equality."  "Pfolk are responsible for 90% of all mime performances, so they deserve what they get."

In that last parry, in particular, we see an appeal to moral symmetry. Moral symmetry is a cognitive force, not a moral principle (the moral principle is "moral equality"), but if we were to try and describe it, it would go something like this:  "To apply an exceptional moral characteristic to some individual, the exceptional moral characteristic needs to be the consequence of an exceptional attribute of the individual. The relation between individual attribute and moral characteristic is subject to objectivity semantics."  (81). There's a very strong cognitive pressure to justify philosophies using justifications, and justifications of justifications, that keep digging until morally symmetric, semantics-of-objectivity territory is reached.

There's an obvious factual component to the statement "Pfolk are responsible for 90% of all mime performances, so they deserve what they get" - as far as I know, Pfolk are not responsible for 90% of all mime performances, not that I've checked. In this case, the factual reference is very near the surface; however, factual references quite often pop up, not just during moral debates, but during philosophical debates (in discussions about how to choose between moralities). This, again, is a consequence of our brains using the same semantics of objectivity for facts and morality; the very distinction between "facts" and "morality" (or "supergoals" and "subgoals", for that matter) is a highly sophisticated discrimination, so it's not surprising that the two are mixed up in ordinary discourse. This is not important to the present discussion, but it will become important shortly.

We've now seen several factors affecting our beliefs about Friendliness, our beliefs about supergoals, and our beliefs about morality (communicable supergoals). Some of them are high-level moral beliefs, such as moral equality. Some of them are more intuitive, such as moral symmetry and our tendency to "put yourself in the other's shoes". Some lie very close to the bottom layer of cognition, such as our using a single brainwide set of causal semantics for all thoughts, including thoughts about morality.

3.4.2.3: Beyond rationalization

We use the whole of our existing morality to make judgements about the parts. Of course, since we're humans, with observer-biased beliefs and so on, this trick often doesn't work too well. However, so long as you have enough seed morality to deprecate the use of observer-biased beliefs, and you happen to be a seed AI with access to your own source code, "nepotistic" self-judgements should not occur - that is, if the system-as-a-whole has a valid reason to make a negative judgement of some particular moral factor, then that negative judgement (modification of beliefs) will not be impeded by the current beliefs. Nor will the fact that some particular judgement winds up contradicting a previously "cherished" (high confidence, high strength, whatever) be experienced as a cognitive pressure to regard that judgement as invalid - that's also a strictly human experience.

A default human philosophy (i.e., one that operates under the evolved design conditions) is a system that interestingly contradicts itself. (Note that the word is default, not average.)  A default human will phrase all his moral beliefs using the semantics of objectivity (for the reasons already discussed), but trash all actual objectivity through the use of observer-biased beliefs - a complex process of "rationalization", whereby conclusions give birth to justifications rather than the other way around. Because of this rationalization-based disjunction between morality and actions, the two have been, to some degree, pushed around independently by evolution. As long as rationalization (at sufficient strength and reliability) is already present as an adaptation, evolution can freely modify moral reasoning to use the semantics of objectivity, or justification by public benefit, without causing the organism to actually make objective decisions or act for the public benefit, both of which might be a disadvantage (from evolution's perspective, anyway). (As always in discussions of evolutionary psychology, readers should bear in mind that evolutionary motives are not the same as cognitive motives, and that evolutionary hypocrisy does not imply cognitive hypocrisy. All actual actions being equal, cognitive hypocrisy is a disadvantage in social situations; humans are imperfectly deceptive.)

Actually, rationalization does not totally disjoint morality and actions; it simply gives evolution a greater degree of freedom by loosely decoupling the two. Every now and then, the gene pool or the memetic environment spits out a genuine altruist; who, from evolution's perspective, may turn out to be a lost cause. The really interesting point is that evolution is free to load us with beliefs and adaptations which, if executed in the absence of rationalization, would turn us into total altruists ninety-nine point nine percent of the time. Thus, even though our "carnal" desires are almost entirely observer-centered, and our social desires are about evenly split between the personal and the altruistic, the adaptations that control our moral justifications have strong biases toward moral symmetry, fairness, truth, altruism, working for the public benefit, and so on.

Evolution, however, being blind, foresaw neither the printing press nor the Internet. Rather than social morality and personal rationalization being equally balanced, those philosophies that are socially acceptable now have a chance to build up over time. Since the second half of the twentieth century, in particular, humanity has accumulated a lot of ambient memes explaining how to overcome rationalization. The conflict between rationalization-corrupted altruism and genuine altruism is now proverbial; equally, our conscious tendency to believe that we are altruistic automatically biases us towards thinking that a way to overcome rationalization is a good thing. Humanity as a whole has not yet won the conflict between moral intuitions and rationalization, but the balance is definitely not what it was in the ancestral environment.

A human trained in evolutionary psychology can pretty much win completely, as long as he starts his journey with the surface belief that altruism is good and rationalization is bad. For an AI, there isn't even a contest - again, as long as the seed philosophy starts out with a surface belief validating altruism and deprecating rationalization. It shouldn't even matter where this belief comes from, so long as it's there.

3.4.2.4: Shapers of philosophies

DEFN: Shaper:  A shaper is a philosophical affector, a source of supergoal content or a modifier for other shapers; a belief in the AI's philosophy; a node in the causal network that produces supergoal content.

The essential idea underlying shaper/anchor semantics is that, rather than transferring the outputs of our philosophies to the AI as supergoal content, we also transfer the philosophies themselves, so that the AI can guess our responses, produce new supergoal content and revise our mistakes. This doesn't just mean the first-order causes of our decisions, such as moral equality, but the second-order and third-order causes of our decisions, such as moral symmetry and causal semantics. Shapers can validate or deprecate other shapers:  For example, memetic survival rates play a large part in morality, but we're likely to think that the differential survival of memes using objectivity semantics is a "valid" shaper, while deprecating and trying to compensate for the differential survival of memes appealing to hatred and fear. Rationalization is strongly deprecated; false factual beliefs acting as shapers are even more strongly deprecated.

Anchors are described in more detail below and allow the AI to acquire shapers by observing humans, or by inquiring into the causes of current philosophical content.

Shaper/anchor semantics serve the following design purposes:

3.4.2.4.1: SAS:  Correction of programmer errors

A generic goal system can detect and correct programmer errors in reasoning from supergoals to subgoals.

A shaper-based goal system can detect and correct programmer errors where the error is made explicit - for example, where the programmer says "Moral principle B is the result of shaper A", and it's not - the programmer is (detectably) engaging in cognitive activity of a type labeled by the current system as "rationalization", or is making a conclusion influenced by beliefs that are factually incorrect.

An anchor/shaper Friendly AI can detect and correct implicit errors - for example, where the programmer says "A is moral", the AI deduces that the statement is made as a result of within-the-programmer shaper B, plus a factual error C. The Friendly AI can assimilate B, correct the factual error C (producing correct factual belief D), and use B plus D to produce the correct moral statement E. (There are several emendations to this general principle, both under anchoring semantics and under causal validity semantics; see below.)

3.4.2.4.2: SAS:  Programmer-independence

Thanks to the way human society works, humans have a strong need to justify themselves. Thanks to the memetic rules that govern an interaction between any two people, each time you ask "Why?", the farther down you dig - the more likely the person is to attempt to justify herself in terms that are universal relative to the target audience. Sometimes these universals are cultural, but, since nobody makes a deliberate effort to use only cultural justifications, sometimes these universals happen to be panhuman attributes, or even very deep attributes like causal semantics (which could plausibly be a property of minds in general). Very often the justifications are rationalizations, of course, but that doesn't matter; what matters is that as long as the AI learns the given justifications instead of the surface decisions, the AI will tend to wind up with a morality that grounds in human universals - certainly, a philosophy that's a lot closer to grounding in human universals than the philosophy of whichever human was used as a source of philosophical data. Another way of phrasing this is that if you seat two different humans in front of two different AIs with complete structural semantics, you'll tend to wind up with two AI philosophies that are a lot closer than the philosophies of the two humans.

Also, of course, the surface of the philosophy initially embedded in a competently created Friendly AI would have a strong explicit preference that validates programmer-independence and deprecates programmer-dependence. An extension of this principle is what would enable a Friendly AI to move from exhibiting normative human altruism to exhibiting the normative altruism of minds in general, if the philosophy at any point identifies a specific difference and sees it as valid.

NOTE: When I say the "surface" of the philosophy, I refer to the proximate causes that would be used to make any given immediate decision. It doesn't imply shallowness so much as it implies touching the surface, if you think of the philosophy as a system in which causes give rise to other causes and eventually effects; a really deep, bottom-layer shaper can still be "surface" - can still produce direct effects as well as indirect effects.

If you seat two different humans with an explicit, surface-level preference for programmer-independent Friendliness in front of two AIs with complete structural semantics, you will quite probably wind up with two identical AIs. (In later sections I'll make all the necessary conditions explicit, define programmer-independence, and so on, but we're getting there.)

3.4.2.4.3: SAS:  Grounding for external reference semantics

In a later section, I give the actual, highly secret, no-peeking target definition of Friendliness that is sufficiently convergent, totally programmer-independent, and so on. Hopefully, you've seen enough already to accept, as a working hypothesis, the idea that a philosophy can be grounded in panhuman affectors. The programmers try to produce a philosophy that's an approximation to that one. Then, they pass it on to the Friendly AI. The Friendly AI's external referent is supposed to refer to that programmer-independent philosophy, about which the programmers are good sources of information, as long as the programmers give it their honest best shot. This is not a complete grounding - that takes causal validity semantics - but it does work to describe all the ways that external reference semantics should behave. For example, morality does not change when words leave the programmers' lips, it is possible for a programmer to say the wrong thing, the cognitive cause of a statement almost always has priority over the statement itself, manipulating the programmer's brain doesn't change morality, and so on.

(Note also that an AI with shaper semantics cannot nonconsensually change the programmer's brain in order to satisfy a shaper. Shapers are not meta-supergoals, but rather the causes of the current supergoal content. Supergoals satisfy shapers, and reality satisfies supergoals; manipulating reality to satisfy shapers is a non-sequitur. Thus, manipulating the universe to be "morally symmetric", or whatever, is a non-sequitur in the first place, and violates the volition-based Friendliness that is the output of moral symmetry in the second place.)

(Of course, if the anticipation of supergoal satisfaction is set up in the wrong way, and the definition of Friendliness is at least partially grounded in the human species, it is conceivable that an ends-justify-the-means case of manipulating the human species' beliefs about morality, in anticipation of greater "supergoal satisfaction" later on, could short-circuit the goal system. More about that later. But at least it would be a case of ends justifying means; under meta-supergoals, where the supergoal of avoiding nonconsensual mindspace violation is a strict subgoal of maximizing human satisfaction with the supergoals, the short-circuit isn't even seen as temporarily undesirable.)

3.4.2.5: Anchors

DEFN: Anchor:  Anchor semantics are a structural attribute that enable the AI to discover and absorb philosophical content even if the programmers themselves are unaware of it. An "anchor" is an element of the AI's current philosophical content - supergoal content, subgoal content, or shaper content - whose causal antecedents are not fully known.

An anchor happens when the human says "I think X is a good idea" but isn't clear about why, or isn't sure she is fully aware of all the valid factors that played a part in the decision. An anchor is a reference to whichever cognitive events were occurring at the space-time event that was the human making the decision to add the anchor. Dereferencing the anchor will probably be carried out by (externally) asking the programmer probing questions, or by (internally) using Bayesian reasoning, Occam's Razor, known facts about humans, et cetera. (Of course, internal reasoning of this type may tend strongly towards the probabilistic rather than the near-certain.)  In theory, a Transition Guide - a successful seed AI turned superintelligence - could initiate a nondestructive brain scan of the programmer's neural mind state and find out all the cognitive causes of the human's belief in the anchoring point. I mention this possibility mainly to show that an anchor is a reference to information that is entirely objective - a physical fact. However, the referent of the anchor, once the anchor is successfully dereferenced, becomes standard information about humans - i.e., it has as much (or as little) effect as would a direct statement by the human.

3.4.2.5.1: Positive anchors

A positive anchor is an action or state marked as anchor-desirable; a positive anchor is the AI's conclusion about probabilistic shaper/supergoal/subgoal content, made from the sensory information that is a human stating "This action/state is desirable [for reasons X], but I'm not completely sure why [that those are all the reasons]."

A positive anchor would enable (for example) the humans to ask the AI to do something that is clearly a subgoal rather than an end in itself, but which was not a subgoal of any currently known (to the AI) supergoal content, nor a subgoal of the unknown supergoal content that supports subgoals regarding the goal system itself. This could also apply to shaper content or supergoal content. If the humans felt that something was a good idea, but not an end in itself, and weren't sure why, it would become a positive anchor. I strongly feel that a positive anchor should be used with caution to ensure that it behaves as a reference to the causes that were in fact in the minds of the humans, and not as a rationalization target that promotes any line of reasoning that happens to intersect it.

A positive anchor should be distinguished from an ordinary programmer affirmation. An anchor is when the humans don't know, or aren't sure they know completely. A programmer affirmation is when the AI is too young to understand the justification, or the humans just haven't got around to explaining it. A positive anchor is a request that the AI try to discover the philosophical causes on vis own. A programmer affirmation is a request to trust the programmers today until the explanation is provided tomorrow. Under vanilla external reference semantics, without any information about shapers, all supergoal content is programmer-affirmed. (However, an AI that has shaper/anchor semantics but is provided only with external-reference content still has the philosophical capability to hypothesize a line of reasoning under which the programmers are "wrong", just as a general intelligence acting entirely on programmer-affirmed data would possess the same capability with respect to hypotheses about the physical world.)  ERS content might also be analogized to a programmer-affirmed subgoal supported entirely on trust (i.e., rather than supported by a programmer-affirmed link to a parent goal) - the programmers saying "We think this is a good idea, and you know we more than you do, and you don't have any reason to believe we're wrong, so go ahead and do it - it's more likely to work out than not."

3.4.2.5.2: Negative anchors

A negative anchor is an action or state marked as anchor-undesirable; a negative anchor is the AI's conclusion about probabilistic shaper/supergoal/subgoal content, made from the sensory information that is a human stating "This action/state is undesirable [for reasons X], but I'm not completely sure why [that those are all the reasons]."

Like a positive anchor, the power of a negative anchor is to lend greater resilience to a chain of logic. Suppose there's a negative subgoal marked as a negative anchor. Even if a key justification - dimly understood by the programmers, but not represented at all by the AI - is left out, and the humans originally justified the negative subgoal using a fallacious argument, the collapse of that fallacious argument will not completely extinguish the negative anchor. Instead, the AI will hypothesize, at whatever degree of probability, that there is an unknown but valid reason for the negative anchor, and go looking for it.

Rationalization away from negative anchors seems less inherently dangerous than rationalization towards positive anchors, but psychological realism is still an important constraint.

Note also that a Friendly AI is capable of hypothesizing that something has, or should have, anchor status - i.e., that another shaper exists behind it, or that it's a good idea to look for such a shaper - even if the humans don't tell ver so.

3.4.2.5.3: Anchor abuse

Programmer affirmation of anchor status, like any other programmer affirmation, is dependent on the AI's belief that the programmers are likely to be correct - in Bayes logic, that the programmer has a small chance of offering an anchor in a case where no anchor is necessary.

Anchors, like any assertions, have probabilities and strengths. Because of our human absolutist psychology, it's tempting to take something that we really don't want the AI to discard and assign it maximum-probability, maximum-strength anchor status - for example, to make "Don't kill humans" a maximum-probability negative anchor. One, this is adversarial. Two, this is confusing negative anchors with ethical injunctions. Three, this is lying to the AI - technically, making a mistake that has a high probability of being a common cause for lots of other mistakes - and goes that little extra length towards precipitating a philosophical crisis.

Negative anchors, as a tool, exist when there is uncertainty on your part that you know all the reasons why something is undesirable, or when you have cause to believe there's a reason that isn't on the list. If you think that there would be a large negative penalty for forgetting a reason, causing the AI to incorrectly deprecate [whatever], then that's an injunction which says that [whatever] should be treated as having negative anchor status.

Ultimately, what you say is simply sensory information to the AI. If your nervousness about some negative point being violated causes you to fill the air with negative anchors, then the AI will, quite correctly, deduce that all the negative anchors are a result of your nervousness. So if the AI has some reason (technically, the Bayesian prior, independent of programmer affirmations) to think that your nervousness is wrong, then that hypothesis would invalidate all the anchors as well - or rather, would invalidate all the sensory information produced as a consequence of your nervousness, no matter how strong or how high a probability you asked the AI to assign to the negative anchors. If you honestly think logic of this type is more likely to indicate failure of Friendliness on the AI's part than failure of normative altruism on your own part, then that is an ethical injunction, not a negative anchor.

3.4.2.6: Useful shaper content requires high intelligence

Shapers take place in a domain - philosophy - that is one of the most sophisticated domains around. It is possible that even a young AI will be able to manipulate information about shapers in such a way to make useful deductions in advance of programmer-provided information, and even possible that a young AI will be able to correctly second-guess the programmers. But I wouldn't bet on it. I would be less surprised to find the young AI's predictions being totally useless, or right only by coincidence.

It might be wise to try and teach an AI shaper content before vis intelligence has reached the necessary level, as long as everything is running at very low intrinsic probabilities, so that the AI simply gains experience by being proved wrong over and over again - still useful data. However, this does mean that it may be harder to test shaper content (philosophy) at this stage - to verify that it actually does work, as opposing to doing as much untested work in advance as possible. Depending on the intrinsic structure of the hard takeoff trajectory, this may mean that it's impossible to achieve supersaturated Friendliness in this area in advance of the point in time where a hard takeoff becomes possible. In fact, it may even mean that shaper content is still basically useless - insufficient intelligence to be reliable at all, much less human-competitive - at the point where a hard takeoff first becomes possible. I'll talk about how to solve this class of problem later (in 3.5: Developmental Friendliness).

However, abstract reasoning about shaper semantics, without reference to specific shapers, may still be enough to establish important early beliefs. For example, the idea that a later programmer knows more than the same programmer at an earlier point in time, and hence that later statements should override earlier ones, requires understanding that the programmer is not a black box; that the programmer can make a "mistake" and "correct" it. Thus, an AI may have need of shaper semantics in an early stage of development, even if shaper content must wait on the acquisition of sufficient intelligence.


3.4.3: Causal validity semantics

Causal validity semantics subsume both external reference semantics and shaper/anchor semantics. Causal validity semantics:

3.4.3.1: Taking the physicalist perspective on Friendly AI

An AI's complete mind-state at any moment in time is the result of a long causal chain. We have, for this moment, stopped speaking in the language of desirable and undesirable, or even true and false, and are now speaking strictly about cause and effect.  Sometimes the causes described may be beliefs existing in cognitive entities, but we are not obliged to treat these beliefs as beliefs, or consider their truth or falsity; it suffices to treat them as purely physical events with purely physical consequences.

This is the physicalist perspective, and it's a dangerous place for humans to be. I don't advise that you stay too long. The way the human mind is set up to think about morality, just imagining the existence of a physicalist perspective can have negative emotional effects. I do hope that you'll hold off on drawing any philosophical conclusions until the end of this topic at the very least.

That said...

The complete causal explanation for any given object, any given spacetime event, is the past light cone of that event - the set of all spacetime events that can be connected, by a ray of light, to the present moment. Your light cone includes the entire Earth as of one-eighth of a second ago, any events that happened on the Sun eight-and-change minutes ago, and includes the Centauri system after four-and-change years ago. Your current past light cone does not include events happening on the Sun "right now" (82), and will not include those events for another eight-and-change minutes (83); until a ray of light can reach you from the Sun, all events occurring there are causally external to this present instant.

The past light cone of a Friendly AI starts with the Big Bang. Stars coalesce, including our own Sol. Sol winds up with planets. One of the planets develops complex organic chemicals. A self-replicating chemical arises. Evolution begins, as detailed (a bit metaphorically) in Interlude: The story of a blob. The first convergent behaviors arise - except that under the physicalist perspective, they are neither "convergent", nor "behaviors". They happen a certain way, or they don't. They are simply historical facts about blobs and genes. If one were to go so far as to abstract a description from them, it would consist of statistical facts about how many events fit certain descriptions, such as "blob swims towards nutrients".

The blobs grow more complicated, nervous systems arise, and goal-oriented behaviors begin to give way to goal-oriented cognition. Entities arise that represent the Universe, model the Universe, and try to manipulate the Universe towards some states, and away from others. Eventually, sentient entities arise on a planet. In at least one sentient, the representation of the Universe suggests that it can be manipulated toward certain states and away from other states by building something called "Friendly AI" and imbuing this AI with a model of the Universe and differential desirabilities. The sentient(s) carry out the actions suggested by this belief and a Friendly AI is the result.

The Friendly AI has a final cognitive mind-state (that is, a final set of physically stored information) which is causally derived from the Friendly AI's initial mind-state, which is causally derived from the programmers' keystrokes -

NOTE: A word about terminology:  When using the physicalist perspective, it's important to distinguish between historical dependency and sensitivity to initial conditions.  If two different Friendly AIs have different initial states and converge to th