The term "Friendly AI" refers to the production of human-benefiting, non-human-harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals. Present-day AIs are enormously inferior to humans in almost every capacity and do not possess the capability to significantly harm or benefit humans. Yet growth in AI intelligence - though slow by the standards of most technologies - is astronomically faster than the rate of human evolution. There are also powerful theoretical reasons to believe that AI growth rates can move quickly compared to human cultural evolution. These reasons include steady exponential growth in underlying computing power, recursive self-improvement in self-modifying AIs, and the maximum switching speed of transistors relative to neurons. Although some presently consider it controversial whether real-world Artificial Intelligence can be achieved at all, let alone whether AIs will someday exceed human capacities, the need for advance planning is established by a strong theoretical argument for the possibility. The SIAI Guidelines on Friendly AI are produced by the Singularity Institute for Artificial Intelligence, Inc., a 501(c)(3) nonprofit corporation. The field of Artificial Intelligence is presently only beginning to explore the problems bound up in Friendly AI. Thus, the Guidelines do not currently represent an academic consensus or an industry standard. Rather, the Singularity Institute's commitment to Friendly AI is intended as a focal point around which debate and consensus can accrete. Friendly AI is a frontier AI challenge as well as a public safety issue, and creativity may turn out to be more in demand than standardization, but there is still a definite public safety benefit in the open sharing of any concrete suggestions for Friendly AI. Our development of the Guidelines was sparked by a theoretical analysis of Friendly AI which suggested several identifiable features of human cognition that would need to be duplicated in order to achieve Friendly AI, suggested specific design methods and cognitive architectures, and suggested that Friendship features might need to be implemented early in the course of AI development for maximum safety and to ensure forward compatibility with later versions. Furthermore, debates about the dangers and benefits of AI and other advanced technologies have recently begun to appear, with increasing frequency, in academic and public venues; thus, it is of immediate importance whether a strong theoretical case can be made for the feasibility of Friendly AI. Making safety recommendations for Artificial Intelligence is a unique challenge because the problem of Friendly AI is inextricably intertwined with the problem of AI itself. Creating Friendly cognition requires creating cognition. In other technologies where the need for safety guidelines is recognized, the safety guidelines are simpler, more obvious, and less controversial than the technical and scientific challenges of the field's frontiers.
Example: Although biotechnology itself is still a rapidly growing science, the NIH Guidelines on Recombinant DNA precisely describe multiple levels of risk and provide detailed, technical instructions for containment of each risk group. Thus, although mandatory only for federally funded programs, the NIH Guidelines continue to be voluntarily and universally accepted within the biotechnology industry.Friendly AI, by contrast, is a challenge which lies at the frontiers of AI. Thus, these Guidelines are not intended as a proposal for future regulation or legislation. The current state of AI is such that it would be impossible to create a human-equivalent AI, or even a workable theory of AI, by appointing a panel of experts. A panel of experts would be unlikely to agree even on first principles. Any project that succeeds in developing AI has demonstrated exceptional competence, more than would be expected from a group selected by any other criterion. Thus, it would be very dangerous to take away the responsibility for implementing Friendliness, or even take away the responsibility for developing a basic theory of Friendliness, from whichever AI project is first successful in developing real AI. It is simply not possible, given the current condition of the field, to convene a committee to solve an AI problem, and Friendliness is a frontier AI problem as well as a public safety issue.Example: The Foresight Guidelines on Nanotechnology are designed to ensure safety of a technology which does not yet exist, and the Foresight Institute acknowledges that the recommendations made are probably only a small subset of the safety precautions needed, but the Foresight recommendations are both simple and obvious in retrospect. For example, that molecular blueprints, and especially the blueprints of manufacturing devices, should be encrypted in such a way that any transmission error between memory storage and manufacturing randomizes the blueprint.
Although local AI projects should have final authority, local projects should nonetheless be aware of that authority. Future projects may have their own theories of Friendly AI, but they should be aware of their responsibility to have some theory of Friendly AI. If a particular safeguard is generally held to be a good idea, a project that decides not to implement the safeguard should have made the deliberate and explicit decision that the safeguard is unworkable, or unsafe, or incompatible with their theory of Friendly AI, or impossible under their cognitive architecture. Any sufficiently advanced AI project needs to be "Friendliness-aware". Such awareness is currently nonexistent. This is not currently as dangerous as it might be - the lack of safety awareness does not present the immediate crisis it would present in a more developed technology. Nearly all present-day AI projects are not "sufficiently advanced"; they are neither real-world AIs nor the intended precursors of real-world AIs. An AI that is not self-improving, and is not intended to become self-improving, probably does not need to implement the Guidelines' recommendations. An AI that does not possess a sufficiently general cognitive architecture cannot implement the Guidelines' design recommendations. But regardless of when specific Friendly AI features become necessary, we believe any AI project that states the future goals of general intelligence and self-improvement thereby incurs the responsibility to be Friendliness-aware.
For More InformationTheoretical grounds for analyzing Friendly AI are drawn from
existing theories of normative decision-making and evolutionary
psychology. Humans are presently the only subjects of cognitive
science - the only intelligent systems that have been studied - but
modern theories of human cognition are sophisticated enough that a
principled attempt can be made to adjust the human theories for other
types of mind. It is possible to link effects to causes, and to
distinguish between causes that are unique to humans, causes that
carry over to minds in general, and causes whose presence or absence
is a design decision. Unfortunately, humans are also the only
researchers of cognitive science. As humans, we have built-in,
hardwired assumptions about other minds. In our ancestral environment,
all other intelligent entities were humans sharing our built-in
emotional and cognitive architecture. We are thus adapted to expect,
in others, what is "natural" for us; we are adapted to expect human
behaviors from minds in general because humans were the only minds
present in our ancestral environment. Even today, humans are the only
form of intelligent life of which we have experience, thus depriving
us of perspective. Our experience tends to indicate that
anthropomorphism - the inappropriate application of human-anticipating
instincts or human-descriptory experience to nonhuman minds - is the
single greatest source of human error in forward analysis of AI
psychology, and in Friendly AI especially. Because our social
instincts are emotional instincts, anthropomorphic errors often carry
with them a weight of emotional investment, making them unusually hard
to dispel. A detailed analysis of common anthropomorphisms is beyond
the scope of the Guidelines; please see Creating Friendly
AI and CFAI 2: Beyond
anthropomorphism. Once anthropomorphism is dispensed with, the
task of creating a Friendly AI is found to not remotely resemble the
task of ensuring ethical behavior in a possibly hostile human, or even
the task of instilling ethical behavior in a growing human
child. Human analogies are dangerous, both because they assume far too
much built-in positive functionality, and because they warn against
negative outcomes resulting from human behaviors probably not
shared by an AI. It is a truism in AI that researchers, as humans,
tend to notice those problems that are difficult for humans and that
rise to the level of our conscious attention. Tasks that are
automatically handled by our preconscious systems do not rise to our
conscious attention, even if the tasks are extremely complex, or are
prerequisite to the solution of the conscious problem at
hand. Typically such preconscious tasks only come to the attention of
AI after years of failure to solve the high-level problem without
first implementing the prerequisite low-level cognition. Where the
cognition at hand is moral cognition, the semantics of human moral
disputes exacerbate the problem. The first class of error is the
assertion of objectivity, which results in the programmer perceiving
non-automatic functionality as "natural" or "obvious". This leads to
the non-implementation of positive functionality. The second
class of error is the assertion of arbitrariness, which interferes
with programmer perception of error correction, context sensitivity,
and design elegance. This leads to the non-prevention of
negative functionality. The conclusion offered by Creating
Friendly AI is that Friendliness is neither automatic nor
arbitrary. That is a prerequisite condition for the existence of any
Guidelines - effort is required to create a Friendship system,
and there are constraints upon what can be created. It is necessary
to take action and possible to make mistakes. However, it does not
follow that Friendly AI researchers must make zero mistakes or that
they must solve the entire problem immediately. A fundamental problem
of AI is an AI that can, given some threshold of ability, acquire
further abilities on its own - either through humanlike learning or
"seed AI" self-improvement. The
task of building a Friendly AI that makes a certain decision correctly
is the problem of Friendship content. The task of
building a Friendly AI that can learn Friendliness is the
problem of Friendship acquisition. The task of building a
Friendly AI that wants to learn Friendliness is the problem of
Friendship structure. The content and acquisition
problems are similar to other AI problems of acquiring, improving, and
correcting skills, abilities, competences, concepts, and beliefs. The
structural problem is unique to Friendly AI. The acquisition
problem is probably harder than the structural problem, but solving
the general acquisition problem is prerequisite to the creation
of AIs advanced enough require Friendliness. The more powerful and
intelligent the AI, the more Friendliness content is required; but
also, in turn, the higher the level of intelligence that can be
assumed to be turned toward acquiring Friendliness - so long as that
AI chooses to acquire Friendliness. The onset of the need for
Friendship content is defined by the timing of the need to make
real-world decisions that may benefit or harm humans. The onset of
the need for Friendship structure is defined by the onset of
the AI's ability to resist human manipulation if the AI does
not see that manipulation as desirable. Given an AI with the
structural Friendliness needed to accept human advice in situations
where programmer competence in Friendliness exceeds the AI's own, a
structurally correct AI needs only that threshold level of Friendship
content required to know when to ask for advice; this is not true
"competence" by the standards of AI, but it is safety. Because
of the extremely high stakes associated with the creation of novel
intelligent entities, it is necessary to be conservative in estimating
how much Friendship content and structure is required at a given point
in time. "Conservative", for Friendly AI, has the opposite polarity
of "conservative" for AI in general; it means attempting to set
upper bounds on the AI's potential rather than lower
bounds on the AI's current abilities. The Singularity Institute
presently distinguishes two conservative methods for
Friendship-preparedness. The first method is "supersaturated"
Friendliness, in which the maximum possible amounts of Friendship
content and structure are infused; as soon as it becomes possible for
an AI to usefully represent a Friendship feature, that feature is
implemented. The second method is to pursue a "90/10" strategy for
Friendship content and a "one step ahead" strategy for Friendship
structure. It is proverbial in computer programming that the last 10%
of the functionality requires 90% of the effort; thus, "90/10" refers
to the strategy of implementing that 90% of Friendship content that
requires 10% of the effort. "One step ahead" implies a development
schedule divided into stages, with a given feature for structural
Friendliness scheduled for completion at least one stage in advance of
the stage where that feature is (conservatively) expected to become
necessary. (Again, Friendship content becomes necessary in response to
real-world capabilities; Friendship structure becomes necessary in
response to internal capabilities for self-modification or cognitive
content modification.) Supersaturated Friendliness is the safest
policy, and also ensures maximal forward compatibility by implementing
architectural features as early as possible. In an ideal world, all
projects seriously striving for a sufficiently advanced AI would
subscribe to the ideal of supersaturated Friendliness. 90/10
Friendliness would be reserved for AI projects trying for general
intelligence or self-improvement, but without the explicit goal of
real-world independent planning or transhumanity. In practice, the
distinction between supersaturated Friendliness and 90/10 Friendliness
is more likely to reflect the distinction between non-profit and
for-profit projects; or the distinction between well-funded and
shoestring projects; or the distinction between projects that strongly
believe in the need for Friendly AI, and projects that were persuaded
to implement some minimal level of Friendliness "just in
case". However, we maintain that anything less than 90/10
Friendliness should probably not be considered
Friendliness-aware. Furthermore, forward compatibility may require
implementation of cognitive architectures in advance of when those
architectures become directly necessary to Friendship structure. What
is to be particularly avoided is the cognitive equivalent of a Y2K
bug; a design requirement which is trivial to fulfill if anticipated
in advance, but which is difficult and expensive if there already
exists an "installed base" of source code or cognitive content. Thus,
a Friendliness-aware AI project should be conscious of all
architectural features currently predicted to be later required, no
matter how far off. Where should efforts in Friendly AI research be
concentrated? Friendship structure and acquisition are more unusual
problems than Friendship content. Friendship structure and acquisition
are closer to the design level and involve a more clearly defined
amount of complexity. (Consider the difference between the bounded
amount of adapted complexity required for humans to form memories, and
the vast amount of complex data contained in all the memories formed
over a lifetime.) Friendship structure and acquisition are
closer to the level of underlying cognition, and are thus less likely
to be visible to naive introspection, arguing that these areas are
likely to be underserved in existing speculations. Furthermore,
Friendship architecture represents the capability needed to recover
from programmer errors. Since programmer error is nearly certain,
showing that a threshold level of architectural Friendliness can
naturally handle a very wide range of errors is prerequisite to making
a theoretical argument for the feasibility of Friendly AI. The more
robust the Friendship architecture, the less programmer competence
need be postulated in order to argue the practical achievability of
Friendliness. Thus, the Guidelines' specific design
recommendations focus on architecture.
3: Design
The specific design recommendations found in the Guidelines are highly condensed summaries of the features listed in "Features of Friendly AI", which in turn are highly condensed summaries of the discussions found in section 3 of Creating Friendly AI, "Design of Friendship systems". For more information see "Features of Friendly AI" or Creating Friendly AI.
1. Friendliness-topped goal system.
Friendliness should not be an afterthought or a tacked-on "control system"
that rules out certain actions. Friendliness should be the sole top-level
goal ("supergoal") within the system. Other goals,
such as "self-improvement", should be subgoals deriving their desirability
from the desirability of Friendliness. For example, self-improvement
is predicted to lead to a more effective future AI, which, if the future
AI is Friendly, is predicted to lead to greater fulfillment of the Friendliness
supergoal. Friendliness does not overrule other goals; rather,
all other goals' desirabilities are derived from Friendliness.
This is not the human organization, but it appears
to embrace all necessary behaviors. If a programmer correctly sees
a behavior as necessary and nonharmful to the existence and growth of a (Friendly)
AI, then the behavior is, for that reason, a valid subgoal of Friendliness.
The necessity of such a behavior may be supported by programmer affirmation
of the link, even if the link is not understandable to an early AI.
We recommend (to Friendliness-aware projects)
the use of a Friendliness-topped goal system in any AI having a declarative
goal system.
2. Cleanly causal goal system.
A "causal" goal system is one in which desirability is the inverse of prediction;
that is, if action A is predicted to lead to desirable state B, then B's desirability
will flow to A. In a cleanly causal goal system, there are no other
sources of desirability, and all desirability flows from supergoal content.
Furthermore, desirability in a cleanly causal goal system is not perseverant.
If desirability flows from B to A, then the continuing desirability of A
is strictly contingent on the continuing desirability of B and on the continuing
strength of the causal link between A and B. Thus, a behavior which
is usually a subgoal of Friendliness, but which has unFriendly consequences
in some specific case, will not be seen as desirable in that case.
In a cleanly causal goal system, the property desirable behaves identically
with the property is-predicted-to-lead-to-supergoal.
We recommend (to Friendliness-aware projects)
the use of a cleanly causal goal system in any AI having a causal goal system.
3. Probabilistic supergoal content.
If supergoals are certain or "correct by definition", a self-modifying AI
may resist all alterations to supergoals. For example: If the
supergoal is to achieve A, then switching the supergoal to achieving B is
obviously not an action that serves A. If, however, the supergoal has
a "90% probability" of being A, and new information arrives indicating that
the supergoal is actually B, then no conflict should exist. The capacity
to make changes to supergoal content will be perceived by the AI as desirable
only if there is uncertainty in the current supergoals; such a capacity cannot
inherit desirability from any specific supergoal content.
We recommend (to Friendliness-aware projects)
the use of probabilistic supergoal content for any declarative goal system
capable of learning, and we recommend that this feature be implemented at
least one stage in advance of the system becoming capable of reflective reasoning
or self-modification.
4. Acquisition of Friendliness sources.
A growing AI should acquire the cognitive complexity used by humans to make
decisions about Friendliness, so that the AI can model, anticipate, and eventually
improve on the decisions of the human programmers. The AI should preferentially
acquire that cognitive complexity which is designated as "valid" - by the
human programmers, or by the AI's current store of cognitive complexity.
That is, the AI should only acquire "good" sources of Friendliness decisions.
We recommend (to Friendliness-aware projects)
the implementation of Friendship acquisition architectures at least one stage
in advance of when the system is expected to acquire the intelligence necessary
to begin making actual hypotheses about Friendliness sources.
5. Causal validity semantics.
(Note: This is a very advanced topic.) Causal validity semantics
requires that the AI model the causal process that led to the AI's creation
and that the AI use its existing cognitive complexity (or programmer assistance)
to make judgements about the validity or invalidity of factors in that causal
process. Causal validity content would provide the cognitive framework
for an AI to decide how it "should have been" built, to perceive as erroneous
the unintended consequences of a programmatic architecture, to detect
very deep errors on the part of the programmers, or to perceive as desirable
the transition to new Friendliness architectures. Causal validity structure
provides the conceptual capacity to see, as possibly desirable, architectural
changes as well as content changes. The cognitive process enabling
these benefits is the AI's causal model of the AI's creation, in which the
programmers' intentions are designated as the cause of the AI's goal
system, and the validity of the goal system content (and architecture) is
derived from and contingent on the validity of the intentions of the programmers.
This enables the AI to see shifts in expressed programmer intentions as possibly
implying the desirability of changes to the goal system architecture, not
just goal content.
We recommend (to Friendliness-aware projects)
the implementation of causal validity semantics at least one stage in advance
of the system becoming generally intelligent. We note that this is
a highly advanced feature.
6. Injunctions. Human
actions, and particularly our refusals to take certain actions, are
not always motivated by directly visible consequences. However, many
such cases can be translated into normative goal reasoning by reference to
consequences which are not directly visible. For example, a
small probability of a very large long-term negative payoff may injunct against
an action with visible short-term benefits. In cases where short-term
benefits may become AI-predictable before the long-term penalties are understood,
or other cases where negative outcomes may not be immediately visible to
the AI, the designers should provide programmer-affirmed information about
possible consequences. Note that this is normative goal-system content
and does not require special-purpose code.
Experimentation with injunctions will probably
be required in any AI project where the AI's actions can have negative consequences
not completely understood by the AI. We recommend (to Friendliness-aware
projects) the proactive use of specific injunctive content to prevent possible
negative outcomes.
7. Self-modeling of fallibility.
A thought may be mistaken; the thought "X is green" does not have a 100% certain
Bayesian association with the actual greenness of X.
The same holds true for the thought "X is desirable". A goal system
may be mistaken under its own standards of normativeness; a probabilistic
goal system with reflection can imagine the possibility of an error.
Modeling of fallibility enables the current AI and the programmer to cooperate
against failures of Friendliness in future AIs; that is, the current AI will
estimate such cooperation to be desirable.
Modeling of ordinary fallibility is required by
AI projects investigating intelligence, in general. We recommend (to
Friendliness-aware projects) the proactive use of programmer-assisted modeling
of fallibility, or programmer-affirmed knowledge about fallibility, to prevent
negative outcomes stemming from nonawareness of fallibility, or to enable
important conclusions and behaviors based on self-modeling of fallibility.
8. Controlled ascent.
A self-improving system should have an "improvements counter" which increments
each time an improvement of a recognized type is made. This enables
detection if improvements begin occurring at a rate much faster than usual.
By measuring the rate of change of the improvements counter under normal conditions,
the programmers can designate some safe level of improvement which, if exceeded,
causes the system to halt and page the programmers and not continue until
approval is received.
Within a primitive system, a "controlled ascent"
feature can be implemented programmatically, using special-purpose code.
Since this is a very simple and inexpensive precaution, it should be taken
for any recursively self-improving system, no matter how primitive, on general
principles. (Recursive self-improvement should be distinguished from
learning systems that improve, but not self-improve.)
For general intelligences and self-understanding
AIs, a controlled ascent subgoal can be desirable because of a self-model
in which too much unsupervised self-improvement has a probability of leading
to Friendliness errors.
The purpose of a controlled ascent feature
is not to prevent an AI from "awakening", but rather to ensure that the process
occurs under human supervision, and can be slowed or paused to allow the
installation of further Friendship features if the project is unready.
Controlled ascent is strictly a temporary measure and is not viable as a
permanent policy.
We recommend the implementation of a programmatic
controlled ascent feature to any recursively self-improving AI where there
exists an obvious metric for the number of self-improvements made.
We recommend (to Friendliness-aware projects) the programmer affirmation
of a controlled ascent subgoal as soon as this cognitive content can be represented
by the AI.
4: Conclusions
The present-day Internet contains an amount of networked computing power exceeding most estimates of the computing power of a single human brain. AI still appears to be very far off, and this distance is real, but the distance may consist of software rather than hardware, and software capabilities can improve very quickly. This does not mean predicting that software capabilities will move quickly; only that software capabilities may move quickly. Because Friendly AI inherently implies the extremely advanced capability of making real-world plans in pursuit of goals, the present-day field of AI has been reluctant to discuss the topic at all. Speculation about the future of AI has been largely seen as the reserved subject of "popular" works. Trying to discuss extreme-sounding scenarios in most academic venues - or even daring to indicate that one takes the topic seriously - leads to the loss of status, as other scientists, eager to display their own constraint and conservatism, chime in with multiple choruses of criticism. Normally, this is not a problem; it is a useful part of the social process of science, which counteracts the tendency of individuals to focus on ideas that sound glamorous and exciting, to the detriment of the rational global distribution of effort. Friendly AI is admittedly exciting, perhaps even "glamorous", but it seems quite safe to say that Friendly AI would remain genuinely important and necessary even if the possibility of real-world AI utterly failed to excite the human psyche and the research itself were as boring as grade-school arithmetic. So, in this case, overcompensation has occurred. Friendly AI is important and any opportunity to develop the theory further should be taken as soon as it becomes available, rather than researchers waiting because "It isn't necessary yet" or "AI is a long way off". Such statements may be true but they are not knowably true. As an attitude, that is simply the wrong way to approach a public safety issue. To say that the rational distribution of effort has failed is to implicitly state that useful present-day efforts can be made in Friendly AI. We believe that it is both possible, and desirable, to build a firm theoretical foundation for future Friendly AI efforts before Friendly AI becomes immediately necessary, and that much of this work can be advanced by present-day investigations. These present-day investigations have not been undertaken because the field of Artificial Intelligence is averse to the detailed visualization of extreme possibilities, which visualization is necessary in order to understand which present-day investigations could be useful. It is a legitimate topic of cognitive science to examine, for example, "Qualitatively different behaviors exhibited by probabilistic and non-probabilistic goal systems under reflection." But to realize that this investigation is worthwhile - that it is relevant to Friendly AI - requires extended thinking about self-improving reflective Artificial Intelligence, a presently nonexistent technology. We recommend that Friendly AI be regarded as a legitimate area of research by the field of Artificial Intelligence. An AI project implementing a Friendly AI feature should be praised for making progress in Friendly AI and for setting a good safety precedent, rather than being accused of implicit unconservatism. We recommend that any AI project stating the future goals of general intelligence and self-improvement regard itself as having a responsibility to be Friendliness-aware. Because Friendly AI is a frontier research problem as well as a public safety issue, it is impossible to legislate or centrally regulate the design of Friendly AI. We recommend that centrally imposed requirements, if any, take the form of an obligation to produce a coherent policy or report on efforts made, rather than an obligation to implement specific features. (We believe any attempt to formalize responsibilities is extremely premature given the current state of Friendly AI, but this may change in the future.) We reaffirm the feature recommendations made in the Design section of these Guidelines. We recommend that all desirable Friendly AI features be implemented as soon as they become possible. Failing that, we recommend that all easy Friendly AI content be implemented as soon as possible, and that all architectural features be implemented at least one stage of development in advance of when they are expected to become necessary. In the future, we envision that nonprofit organizations such as the Singularity Institute may employ Friendship programmers which are loaned out, on a part-time or full-time basis, to AI projects requiring Friendship-awareness but having no resources or inadequate resources allocated to Friendly AI. We recommend this as an important public safety measure and a legitimate future use of philanthropic or charitable resources. We recommend that all AI projects which obtain concrete experience in problems of Friendly AI should publish their results, or as much of those results as can possibly be published without revealing trade secrets. There is no conceivable benefit of withholding Friendly AI results that outweighs the benefit of making the information public. We recommend that relevance to a Friendly AI problem be regarded as a legitimate affirmation of the importance of an issue in cognitive science, and that such issues be investigated as part of an effort to build a theoretical foundation for Friendly AI in advance of when it is needed. We recommend that theoretical efforts to extrapolate a foundation for Friendly AI focus on architectural issues.
Return to Publications
Return to the Singularity Institute