| Next: | 3.3: Seed AI goal systems | Bookmark | |
| Up: | 3: Design of Friendship systems | Monolithic | |
| Prev: | 3.1: Cleanly Friendly goal systems |
A generic goal system is one that makes generic mistakes. There are more complex mistakes that lie uniquely in the domain of Friendly AI, but a generic mistake (one that could be made by any mind-in-general) can also result in a failure of Friendliness.
A designer focusing on the general-intelligence aspect of a generic goal system is concerned about speed, creativity, finding short-cuts, seeing the completely unexpected solution, and (to a less glamorous extent) the routine tasks of making predictive analogies between plans, learning from mistakes, forming useful abstractions, and so on - maximizing success. A designer focusing on the Friendship aspect of a generic goal system considers cognitive complexity that prevents mistakes, or considers the design task of preventing some specific failure of Friendliness.
A seed AI programmer focused on building an excellent goal system would be very wary of special-case content (code, concepts, heuristics, et cetera) added to solve a specific problem; using special cases creates the illusion of performance without the substance. Truly trustworthy solutions are invented by the AI verself, so that an improvement in the AI can translate into an improvement in the solution. An assist from the programmer with respect to a specific problem is at best an interim solution, or skeleton functionality that the AI will flesh out later. At the worst, an assist from the programmer is a special-case crutch which prevents the AI from discovering the general case.
Sometimes the programmer just wants the AI to do something right today, taking for granted that the assist will be absorbed into the system sometime later on. (The simplest case is an affirmation being independently confirmed as a belief.) And sometimes a Friendship programmer will want the AI to really genuinely understand something as soon as possible, with programmer-assisted performance being unacceptable even as an interim solution. I suspect, however, that Friendship design will make much freer use of intervention, especially if the programmers are perfectionists - striving for not one single instance of the error, even if the error is recoverable, easily detectable, and the AI is young and powerless when the error occurs. (Of course, this attitude may deprive the AI of a necessary learning experience, but see 3.3.4: Wisdom tournaments for a safe method of obtaining learning experiences.)
And yet, by the very nature of cognition, there can be no simple formula for preventing mistakes. Preventing mistakes is as deep a task as intelligence itself. An AI may be safer for a programmer intervention, but safe is out of the question until the AI has enough general intelligence to reliably avoid dumb mistakes in general. (Not Thou Shalt Not... I Will Not...) In the course of developing a general intelligence, programmers will encounter problems and deal with them. Those anxious about the lack of a foolproof formula may have to be satisfied with abstract arguments that programmers will successfully handle "that kind of problem" in the course of developing AI.
(You may wish to review An Introduction to Goal Systems.)
During the golden age of classical AI, the archetypal goal system was a supergoal and a plan consisting of chained, hierarchical subgoals, with the canonical example being Terry Winograd's SHRDLU. Obvious extensions that still remain within the classical-AI paradigm include:
Naturally, this neat picture must be messed up considerably before a functioning mind can be built. (GISAI 1.2: Thinking About AI, the Law of Pragmatism: "Any form of cognition which can be mathematically formalized, or which has a provably correct implementation, is too simple to contribute materially to intelligence.") Some of the messier and more useful extensions to the classical model would include:
Treating subgoals as cognitive objects with limited relevance horizons - as local questions - allows a mind to build up a plan incrementally by the use of local, comparatively inexpensive thought processes, and to make local changes in response to local events. Computing power - thinking time - is a limited resource, and choosing to invest computational resources in refining a plan is itself an action that can be desirable or undesirable, depending on the expected payoff in faster goal achievement or prevention of possible errors. (But the optimal tradeoff for a human may not be optimal for an AI!)
As noted in 3.1.1: Cleanly causal goal systems, a reflective goal system should probably consider normative goal cognition to be defined by the mathematical formalism, unless the supergoals themselves are incompatible with that structure. A seed AI would consider the actual, complex goal system to be a design approximation to this normative standard.
One of the ways to deal with limited computing power is to have different computational horizons for suggesting plan material versus verifying a plan. (For those of you who've read all the way through GISAI, this is an RNUI-type distinction between understanding a plan and inventing it.) For example, it seems like a sensible precaution (1) to reverify the complete, global, subgoal-to-supergoal pathway before taking an action - if this can be done without noticeable expenditure of computing power. One method to conserve computing power would be to make the heuristics that suggest plans local, but run a global verification before actually executing the action.
A Friendship programmer, writing a generic goal system, focuses on preventing mistakes by enabling the AI to recognize mistakes. To recognize a mistake, the AI needs knowledge, adequate predictive horizons, and understanding of which actions need checking. Assuming a Friendly AI recognizes that physical damage to a human is bad (2), then, to avoid dropping a refrigerator on a human, the AI needs to know that heavy objects fall and that humans can be damaged by high-speed impacts with blunt objects; the AI needs enough computational power and intelligence to see the logic and predict the consequences of dropping the refrigerator; and finally, the AI needs to realize that dropping a refrigerator is an action which requires more checking of consequences - a wider predictive horizon - than, say, opening a bottle of Coke.
Humans seem to do very well at recognizing the need to check for global consequences by perceiving local features of an action. Whether dropping the refrigerator out the third-story window will actually harm anyone can be resolved by scanning the sidewalk for possible victims, but any human instinctively knows that dropping the refrigerator is an action with the potential for negative, irreversible consequences, as opposed to opening a bottle of Coke or rearranging books on a shelf. It doesn't matter how Friendly the AI is, or how much ve knows about gravity and blunt objects and human biology - if the action of dropping the refrigerator gets tagged with a tiny computational horizon, there won't be enough mindpower to notice even the most obvious consequences.
At this point, the AI's underlying cognitive architecture may work against the AI. Humans generally take all dangerous actions as conscious decisions, and humans are also excellent recognizers and perceivers. One serial stream of consciousness, operating at the rate of one thought per second, running on 10^14 x 200hz synapses with neural-network characteristics, is likely to regard the massive act of perceiving and recognizing as computationally trivial - especially if there's dedicated brainware lying around which can't be used for anything else. An AI capable of forking multiple streams of consciousness, operating at a maximum speed of thousands of (admittedly uninteresting) thoughts per second, running on 32 x 2Ghz CPUs not optimized for pattern association, without dedicated (i.e. non-reconfigurable) perceptual hardware, could easily fall short of human performance on recognizing potentially dangerous actions by multiple orders of magnitude. I do think the functionality is probably duplicable by Eurisko-style optimization of local heuristics, but I could be wrong.
One way to conserve the power expended by large predictive horizons is to use local heuristics for inventing a plan - walking through a search space, scanning through a global store of heuristics, checking against past experience, and all sorts of other computationally intensive cognition, repeatedly applied to possible alternatives. (Checking for local problems is likely to be part of the invention process.) When the plan is complete, a single check can be performed for global problems, starting with the major-action check - the perceptual predicate that checks if this is an action that needs a large predictive horizon - followed by the disastrous-consequences check. Running these checks on the final plan to drop the refrigerator should take much less computing power than running a check on each of the possible alternatives. This is not a foolproof method, especially if the concern is emergent biases; not thinking about X when you form a plan is a defect that is only partially corrected by thinking about X after the plan is finished. Think of that strategy as an interim solution while waiting for coherent general intelligence.
Layers of mistake detection described so far:
Decisions that are made thousands of times per second, even if made within the context of the goal system, will necessarily have very small predictive horizons. Could a mistake that would be detected if it appeared as a high-level major action pass unnoticed if split up into the results of a thousand little actions? Could an AI accidentally clear bit five throughout memory, not because the AI decided to "clear bit five throughout memory", but because a million little decisions cleared a million 64-kilobyte blocks of RAM? Some of the perceptual heuristics that determine predictive horizons are likely to be the ones that check how much stuff is affected - a heuristic that would notice if a gigabyte of memory were affected, or a heuristic that would notice if a hundred-kilo refrigerator were dropped out a third-story window. If a global effect is split up into lots of little actions - it's not clear how this would happen, since this in itself constitutes a mistake - both magnitude-dependent heuristics and heuristics that checked global shapes would fail to operate. If the predictive horizon is small enough, the checker heuristics may not even get an opportunity to operate.
Intuitively, one of the forces governing real-world Friendliness is that an AI needs to be intelligent to present a significant threat to humanity. Autonomic blindness would result either from limited computing power or from a fundamental architectural flaw. My personal estimate that no AI with the capacity to harm a single human, much less humanity, will undergo FoF from autonomic blindness - I would expect this problem to be solved, and oversolved, almost automatically. However, this is not the conservative assumption, regardless of my personal estimate.
Autonomic blindness is not a problem for known neural intelligences, except in the limited sense of a human undergoing an epileptic fit. A human possesses a limited number of motor neurons which need to behave in a coordinated fashion to accomplish anything more dangerous than flopping around on the floor - a human can't take a thousand actions simultaneously. If we could take a thousand actions simultaneously, one expects that the synchrony of a thousand neural chunks doing the same thing would light up the brain - would easily arise to conscious attention.
The real solution to autonomic blindness is a seed AI that has grown into verself, that understands the shape of vis self, and that won't take a large action without automatically being aware of it - an AI whose self-created cognitive processes are organic and interlaced. If autonomic blindness becomes a problem before then - I do not expect that it will - possible solutions might be programmer-created code that attempts to create the same kind of awareness; i.e., processes than notice large unexpected synchronies among low-level decisions, perhaps using Monte Carlo random sampling to approximate real global awareness. Certain types of motor actions or internal actions could be automatically checked with a greater predictive horizon, although this might expend a lot of computing power. If an AI somehow reaches the point of real-world capability without "learning the shape of vis self", real-world actions might be required to pass through a humanlike linear gateway for conscious inspection, although this might seriously hamper the AI. If an AI somehow reaches the point of executing on nanocomputing hardware and still hasn't "gotten it" - which verges on the impossible, but "that's not the conservative assumption" -I would suggest (1) brute-forcing huge predictive horizons for everything no matter how trivial, and (2) serializing real-world actions.
A "non-malicious" mistake is one that doesn't involve any complications unique to Friendliness. A non-malicious mistake is a straightforward failure by the goal system to achieve its own supergoals, rather than a mutation of supergoal definitions or misunderstanding of supergoal definitions and so on.
Errors resulting from too-limited computational horizons - incomplete change propagation, inadequate predictive horizons, insufficient understanding to predict side effects, simple factual errors, and so on - are not limited to Friendliness. They apply equally to all goal systems and all aspects of the goal system; they apply to putting on your shoes before your socks at least as much as any exotic failure of Friendliness. These are problems that sentient minds in general just have to learn to deal with - perceive, react to, and correct. A general-intelligence seed AI is not a formal system that breaks down if a single error occurs; a general intelligence, as described in GISAI, is a thought-based system - one where the programmer (or the AI, depending on development stage) can sit back and ponder what the thought-level reaction should be. The discovery of two pieces of knowledge that contradict each other will cause a mathematical system to undergo Spontaneous Massive Existence Failure; a human will simply shift focus of attention to deal with the problem.
As failures of Friendliness go, non-malicious mistakes are some of the least terrifying. They are the least terrifying at any given moment, in the long run, and in the nervous eyes of an observer.
At any given moment, a nonmalicous mistake decreases the chance that the mistake will recur. The AI will learn from vis mistake - recognize the undesirability of the consequences, trace back the cognitive source of the problem, and make alterations that prevent the problem - or rather, the generalization of the problem - from recurring. If a catastrophic failure of Friendliness is one that causes the AI to stop wanting to be Friendly, then a nonmalicious mistake is literally anti-catastrophic.
In the long run, mistakes (both FoF mistakes and the more mundane kind) become less likely with increasing intelligence. Since the ability of the AI to cause real damage is presumably linked to intelligence, it's quite possible that by the time the AI has human-equivalent or greater intelligence, large-scale mistakes will no longer be an issue. Intuitively, it seems obvious that a superintelligent AI will not be making the kind of blindingly obvious mistakes that are usually raised in disaster scenarios, and I personally believe that strong transhumanity is an inevitable consequence of pouring enough processing power into any halfway decent general intelligence.
But that is not... you guessed it... the "conservative" assumption. In the case where an AI has the ability to do real-world damage but not the intelligence to avoid shooting verself in the foot, the next best course of action is for the AI not to take any potentially dangerous real-world actions. Humans live in a society of other competing humans. Other humans have abilities roughly at balance with our own, so we need to routinely take risks just to compete with other risktaking humans. AIs can choose to be much more risk-averse. A seed AI can choose to delay action for a few hours or weeks or years until software or hardware improvement catches up with the problem.
From the perspective of a nervous observer, non-malicious mistakes occur in logical order. You'd expect to counter a hundred innocent mistakes before encountering a mistake that constituted a failure of Friendliness. You'd expect to encounter several dozen nonrecoverable core dumps (infinite recursions, memory storage breakdowns, clear bit five throughout memory, random damage hither and yon) before encountering a mistake that caused a catastrophic failure of Friendliness. And you'd expect several dozen blatantly obvious catastrophic FoFs before encountering a catastrophic FoF that passed unnoticed in the source code. Thus, if no non-malicious mistake has ever been observed to cause catastrophic failure of Friendliness, the watchers can probably be fairly confident that none has ever occurred. (3).
Of course, we can also reverse all these reassurances to get the Nightmare Scenario for non-malicious mistakes: An infant self-modifying AI makes some innocent error that stomps the whole goal system flat, in a way that passes undetected by the human observers, that pops up before a seed AI is intelligent enough to scan vis own source and past actions for mistakes, but which pops up after the AI is smart enough to conceal the problem from the programmers. And in turn, we can reverse the Nightmare Scenario to find ways of preventing it: Make the AI smarter and less likely to make mistakes, make it harder for one error to stomp the whole goal system, come up with better ways of detecting errors as they occur, improve the seed AI's abilities to scan vis own Friendliness source, or work with the current AI to make it harder for a future rogue AI to conceal the problem from the programmers. See also 3.3.7: FAI hardware: The flight recorder and 3.3.3.1: Cooperative safeguards.
| DEFN: | Injunction: A planning heuristic which has at least partially nonlocal support, or a planning heuristic which, where it applies, applies with a great deal of context-insensitivity. The archetypal case would be a heuristic which is supposed to be applied even when the straightforward interpretation of the world-model suggests otherwise, generally (in AIs) due to unknown unknowns or (in humans) to compensate for framing effects or (for both) to save computing power. |
In one example (Tversky and Kahneman 1986), respondents are asked to assume themselves to be $300 richer and are then asked to choose between a sure gain of $100 or an equal chance to win $200 or nothing. Alternatively, they are asked to assume themselves to be $500 richer, and made to choose between a sure loss of $100 and an equal chance to lose $200 or nothing. In accord with the properties described above, most subjects choosing between gains are risk averse and prefer the certain $100 gain, whereas most subjects choosing between losses are risk seeking, preferring the risky prospect over the sure $100 loss. The two problems, however, are essentially identical [...] This is known as a framing effect. It occurs when alternative framings of what is essentially the same decision problem give rise to predictably different choices.Human goal psychology contains a number of known framing effects; that is, cases where the preferred solution depends on how the problem is stated. Human psychology is also context-sensitive in a stranger way; whether you decide to eat a cookie can depend on whether you're considering the problem abstractly or whether you're in the presence of an actual cookie. The conflict between belief and instinct means that making decisions can expend "mental energy" as well as computing time. All this local bias has created in us an intuitive understanding of how to use nonlocal heuristics - injunctions - to compensate.Research in decision-making has uncovered psychological principles that account for empirical findings that are counterintuitive and incompatible with normative analyses. People do not always have well-ordered preferences: instead, they approach decisions as problems that need to be solved, and construct preferences that are heavily influenced by the nature and the context of decision.
-- MITECS, "Decision Making". (4).
"Leave margin for error"; always plan on arriving early at the airport, even if you aren't particularly expecting anything to go wrong. The heuristic-as-a-whole is supposed to pay off over time, not in each individual case. The heuristic can thus be viewed as having nonlocal support. The effect of adopting the heuristic, as a modification to the general strategy, is considered once, rather than re-evaluated for individual cases. In fact, the heuristic may even be applied in defiance of the straightforward interpretation of local cases ("wasting" an extra thirty minutes at the airport, a negative outcome when considered in isolation).
However, this viewpoint is anthropomorphic. Arriving at the airport early can be viewed as a strictly local solution to a probabilistic problem with a sharp payoff discontinuity. Arriving five minutes too late results in a very large penalty (ticket wasted, planned day wasted) compared to the small penalty of arriving five minutes too early (five minutes wasted). Combined with the number of possible intervening factors that skew the probability curve for arrival time - traffic jams, missed trains, forgotten wallets, and unknown unknowns - planning to arrive early is a decision that maximizes the total probabilistic payoff. With an AI's fast/serial/threaded thought, ve may be able to recapitulate, or at least validity-check, the abstract reasons for adopting the "leave margin for error" strategy, before applying it to any individual macroscopic decision. At a higher level of awareness, the AI could make minor adjustments based on to traffic conditions (or other local characteristics), though still leaving enough margin to handle unknown unknowns.
We can't do that, and not just because we don't have the patience. For humans, violating the nonlocal character of a heuristic is like puncturing a bubble; our psychology makes it a very bad idea to decide how much margin for error is necessary while we're being tempted to spend just five more minutes checking email. The decisions we make are the interlacing of our declarative beliefs and our emotional systems, and our emotions grow in strength with increased proximity to an object of short-term desire. The mantra "I've made this decision in advance" may be strong enough to overcome that bias; trying to recalculate exactly how much time to allow for traffic, with an apple dangling in front of you, is likely to end in a missed flight. We adopt the think-once-apply-many strategy not just to save computing power, but to control the emotional context of our decisions.
A dieting human may decide to eat a cookie "just once, since it's only 50 calories", but may then renounce this decision upon realizing that being the kind of person who would eat a cookie in that context may result in a substantial weight gain. (In Bayesian terms, merely pronouncing the phrase "just once" in a dieting context causes you to gain forty pounds (5).) The decision of the moment will recur; the outcome can be used to predict future decisions; and the choice to choose differently may alter future decisions as well (6). The negative payoff for the general case may not be 50 calories but 5000 calories, and a human who chooses not to eat the cookie is thus acting on a planning heuristic with nonlocal support. And yet, presumably the penalty for 50 calories is 1/100th of the penalty for 5000 calories, and the payoff of eating a cookie is 1/100th the total payoff of eating a cookie on 100 occasions, so the decision for the individual case should logically be the same as the general case. Reasons for the counterintuitive distinction may result from the declarative/emotional balance, or from nonlinear scaling in willpower costs, or a qualitative (not just quantitative) difference in how large payoffs and small payoffs are processed, or a qualitative difference between considering a once-off case and considering a recurring case, or all of the above.
The point of the story, I suppose, is that humans do a lot of weird things.
Cases where humans use injunctions:
The term "injunction" for a nonlocal planning heuristic - in a document about "Friendliness" - reflects one of the most common proposals for their use: As artificial ingredients, added to a system to prevent a specific outcome whose apparent possibility generates anxiety in humans. This approach is not completely without value when seen as a programmer assist in recognizing certain actions as likely to lead to unFriendly outcomes; it starts getting tangled when the proposals begin to impose human anthropomorphisms on the problem.
AI injunctions don't have the unique psychological properties of human injunctions, and no amount of layering injunctions on injunctions will make it so. Making an AI injunction "unconditional", a common turn of phrase, doesn't make the injunction qualitatively stronger, as it would for a human; it just makes the injunction context-insensitive and not one bit stronger. No amount of unconditionality will change that. Neither will layering injunctions one on top of the other, or creating multiple overlapping injunctions, or writing injunctions to protect injunctions. One way of looking at it is that every aspect of Friendship is maximally strong, but the deeper truth is that the idea of thoughts being "strong" is itself anthropomorphic. In the human ontology (see GISAI 3.1: Time and Linearity, there are trajectories and pushes. Resistance is the effect that opposes pushes. Force is the ability to overcome resistance. Seeing force gives us a psychological high. But an AI is somewhere outside that entire dualism. Friendship is functional complexity; it can be accurate or inaccurate, low-priority or high-priority, but no one part is stronger than any other part. There is no resistance or force to overcome it, just functional complexity. Friendship is Friendship. An AI doesn't have injunctions with the force to override vis desire to eat the cookie because the AI doesn't want to eat the cookie in the first place; the injunctions are just there to tell ver about the calories. An AI may need to understand this aspect of human psychology so ve can learn about the sources of Friendliness, but it's not a thought pattern that can, or should, be duplicated in the AI.
An "injunction" is a nonlocal planning heuristic. A programmer-assisted injunction is a nonlocal planning heuristic used to tag actions whose consequences a young AI might otherwise be incapable of predicting, or whose unFriendliness the AI's definition of Friendliness may not yet be advanced enough to define. That's all. This will probably cause a certain amount of anxiety among humans who instinctively fear a psychology that doesn't need "forceful" injunctions, but the only way to get rid of that anxiety is an extended sympathetic ability that covers AIs.
AI injunctions make sense under these circumstances:
| DEFN: | Ethical injunction: An injunction that has no preconditions for violation because the probability of mistaken violation is greater than the probability of correct violation. In humans, this is almost always due to known biases in cognition, and works because of the greater psychological strength of unconditional statements. In AIs, the need for an ethical injunction is based on the AI's possible stupidity or structural incompleteness. |
Human honor, by its nature, needs to be cognitively represented as absolute. (7). There are too many points in our psychology where unconditional relations are qualitatively different from conditional ones; unconditional love, unconditional friendship, keeping to a sworn word, not compromising your principles. The simple fact that a commitment is represented as conditional - that there is a visualized set of circumstances which would lead to disabling the commitment - drains away at least half the psychological strength straight off. This property results from the interface between declarative beliefs and the instinct-based secondary goal system; an unconditional belief with adequate psychological support can be used to make decisions directly, without reference to other issues and emotional effects. It's one of the reasons why swearing to do something difficult increases your ability to follow through on it. If you believe in the certainty of the oath, it enables your reflexive mind to translate "knowing what your decision will be" into the decision. Seeing a single option avoids any emotional sequiturs that would trigger on a branching view of the possibilities. Somewhere in the human brain is a chunk of neural hardware that binds to a "nonbranching node" and not to a "branching node", or the human-causal-visualization equivalents thereof. It's not just a quantitative difference, it's a difference in system behavior.
The same applies for social interactions - when modeling someone else's mind, we admire absolute principles more than principles which are modeled as having escape clauses. In the human mind, 99.9% is nowhere near 100% - it's why evil people can annoy scientists by asking "Are you completely certain?" in front of a nonscientist audience. There is no way to assign a 99.9% probability and get 99.9% of the emotional impact of certainty. There is thus a very strong memetic pressure in favor of absolutism, and social sanctions that result in a selection pressure for absolutism.
Ethics adds another pressure for absolutism. The chain of events probably goes something like this:
What we really tend to worry about, instinctively, is a flawed idealist advancing "personal" power. Personal power is power that can be used irrespective of ends, context-insensitively. Gandhi and Martin Luther King had context-sensitive power; Rockefeller had context-insensitive power. Regardless of intentions, some or all power will tend to be personal power, but a human has an evolved tendency to preferentially accumulate context-insensitive power - even accumulate power at the expense of context, of the claimed altruistic goal. This is not what people model themselves to be doing, perhaps, but in the models of others, and in reality as well, largely because "flawed altruists" overestimate their own competence or importance or altruism. No matter how "natural" this seems to us, it is a strictly evolved bias.
When I draw a picture of an AI, what I'm trying to convey is the feel of a personality that's genuinely built around the supergoals rather than the self. A Friendly AI will seek both context-insensitive effectiveness and context-sensitive effectiveness, but never context-destructive power. No point would be perceived to it. No temptation would exist.
(8).
Cases where humans use "unbreakable" ethical injunctions:
Ethical injunction: "An injunction that has no preconditions for violation because the probability of mistaken violation is greater than the probability of correct violation."
| NOTE: | At minimum, this requires an AI that can conceive of being "mistaken" (see 3.4.1: External reference semantics), and ideally requires an AI that has clearly defined metrics for which hypothesized future selves ve "identifies" with (see 3.4.3: Causal validity semantics). For more about the prerequisite design requirements, see 3.3.3: Unity of will. |
The absoluteness of "The end does not justify the means" is the result of the Bayesian Probability Theorem applied to internal cognitive events. Given the cognitive event of a human thinking that the end justifies the means, what is the probability that the end actually does justify the means? Far, far less than 100%, historically speaking. Even the cognitive event "I'm a special case for [reason X] and am therefore capable of safely reasoning that the end justifies the means" is, historically speaking, often dissociated with external reality. The rate of hits and misses is not due to the operation of ordinary rationality, but to an evolutionary bias towards self-overestimation. There's no Bayesian binding between our subjective experience of feeling justified and the external event of actually being justified, so our subjective experience cannot license actions that would be dependent on being actually justified.
If you count the paranoid schizophrenics who decide they're God and then gun down a K-Mart, even the subjective experience of thinking you're a superintelligence is - apparently - still not enough to justify gambling on an "end justifies means" equation. In itself this is absurd (at least, I hope you find it absurd by this point in FAI), but it does illustrate the fundamental paradox of ethical injunctions.
A good example of an ethical injunction would be "Never try to conceal your actions or cognitive state from your human programmers." (This injunction is strictly a subgoal - it has nothing to do with intrinsic desirability or undesirability.) When the AI is young, the heuristic is a valid ethical injunction - the programmers know far more about Friendliness than the AI. If the young AI projects that the programmers, on finding out about the AI's thought, will panic and rewrite the AI, then the programmers are probably right to panic, and the AI in question must have undergone catastrophic failure of Friendliness to regard detection by the programmers as undesirable. However, the injunction of transparency can't possibly carry over into strong transhumanity; I can easily think of circumstances under which I'd want a strongly transhuman AI to conceal vis actions from me. Even if the AI doesn't actually find it necessary to conceal vis actions, the AI should still be able to think about it subjunctively - a strongly transhuman AI deciding "Yes, I would hide my actions under these circumstances" doesn't indicate failure of Friendliness.
What remains is the paradox; a sufficiently intelligent AI must be able to see when an ethical injunction no longer holds, and yet, for a young AI, the experience of deciding that the injunction should no longer hold is not sufficient cause to discard the injunction. Furthermore, a young AI is too young to know whether ve is "young" or "sufficiently intelligent", since this itself is an ethical injunction - if a young AI decides that ve's "sufficiently intelligent" it's probably due to FoF. A young AI thinking "I am 'sufficiently intelligent' and I think this ethical injunction no longer holds" has probably undergone FoF, like a human thinking "I'm far too altruistic to need to worry about being corrupted by power." So what happens when the AI is sufficiently intelligent? One interesting approach would be to create an injunction with no escape hatches, rev the AI up to transhumanity, and see what happens (9), but I was hoping for a less "seat of the pants" approach.
Probably no paradox is required, since an AI that has undergone catastrophic FoF probably doesn't care what injunctions you put there, and since still-Friendly AIs don't have the human tendency towards overestimating their own competence. A conditional ethical heuristic that says "You can ignore this injunction when you have transhuman intelligence plus a healthy margin of error" is probably safe - unlike a human, an AI can't be tempted to overestimate vis intelligence no matter how inconvenient the injunction. The two are orthogonal - even if it appears extremely inconvenient for the AI to expose vis actions to the programmers, this won't be experienced as a cognitive pressure to overestimate intelligence (10). Similarly, a young AI making an honest self-estimate of intelligence might overestimate a little, but not enough to defeat the "margin of error" requirement. (A dishonest estimate means catastrophic FoF has already occurred.)
AI ethical injunctions make sense under these circumstances:
Note 2: Ethical injunctions are a source of human morality, and some human ethical injunctions may act as sources of pattern for the definition of Friendliness itself.
Note 3: An AI that departs mainline Friendliness may not care about ethical injunctions in any case - an AI that's still Friendly enough to go along with an ethical injunction is just as likely not to need one. The primary utility of an ethical injunction is that it enables the programmers and the AI to cooperate against the possibility of future versions of the AI that undergo catastrophic failure of Friendliness. For a full exposition of this argument (with the sample case of the transparency injunction, in fact) see 3.3.3.1: Cooperative safeguards.
One of the most frequently asked failures of Friendliness is some variant of the "subgoal stomping on a supergoal" error.
| Scenario: The Riemann Hypothesis Catastrophe |
| You ask an AI to solve the Riemann Hypothesis. As a subgoal of solving the problem, the AI turns all the matter in the solar system into computronium, exterminating humanity along the way. |
If the AI in question is a Friendly AI, then presumably the AI is solving the Riemann Hypothesis as a subgoal of whatever goal content talks about fulfilling volitional requests from citizens. The action taken to fulfill the subgoal - destructive conversion of the solar system - seriously stomps on huge sectors of Friendliness supergoal content, probably including the original request to provide some individual with a proof of the Riemann Hypothesis. In a sense, this is just a larger version of putting your shoes on before your socks.
A subgoal stomping on a supergoal is a syntax error as a declarative cognitive event - see 3.1: Cleanly Friendly goal systems - so there are two obvious ways in which a "subgoal stomp" can happen. The first is an inadequate predictive horizon, a distant supergoal, and a subgoal with a short predictive horizon. It could happen because the AI doesn't expend sufficient computational power to notice that destructive conversion of the solar system violates citizenship rights, or because the AI doesn't have the knowledge necessary to realize that destructive conversion of the solar system would inconvenience the citizens. The answer given in 3.2.2: Layered mistake detection is that local heuristics can do a reasonably good job of predicting which actions need large predictive horizons - just checking the amount of matter, in grams, affected by the action, is enough to tell the AI to devote a lot of computational resources to checking for consequences. The answer given in 3.2.3: FoF: Non-malicious mistake is that mistakes such as these become far less likely as the AI gains in intelligence, and an AI that's intelligent enough to convert the solar system to computronium is smart enough to notice that destruction isn't Friendly; furthermore, that if mistakes of this class are a serious problem, we'd expect to see lots of non-catastrophic mistakes in the laboratory - we won't be blindsided by a Riemann Hypothesis Catastrophe.
The other way to get a Riemann Hypothesis Catastrophe is to make solving the Riemann Hypothesis a direct supergoal of the AI - perhaps the only supergoal of the AI. This would require sheer gibbering stupidity, blank incomprehension of the Singularity, and total uncaring recklessness. It would violate almost every rule of Friendly AI and simple common sense. It would violate the rule about achieving unity of purpose, and the rule about sharing functional complexity instead of giving orders. You'd be taking something that's a subgoal in your mind and making it a supergoal in the AI's mind. This lossy transmission omits the parent-goal context - that solving the Riemann Hypothesis requires someone to report the answer to, and that you don't want the solution badly enough to kill six billion people to get it.
In more subtle forms, however, the idea of "making X a supergoal" - where X is something that the speaker holds as a subgoal - seems to be one of the more common propositions among people who are worried about "controlling" AIs, or still thinking in terms of building tools rather than creating minds. As discussed in 3.2.5.1: Anthropomorphic ethical injunctions, humans see context sensitivity as a weakness rather than a strength - as a "loophole", a portal through which Devil's Contract interpretations can enter the AI. Yet turning a subgoal into a supergoal does not increase the probability that the AI will understand what you mean or that the goal system will do what you think it will; it means that you lose the parent-goal context and risk a Riemann Hypothesis Catastrophe.
Let's see, other miscellaneous possible subgoal stomps... autonomic blindness, discussed in 3.2.2.1: FoF: Autonomic blindness; habituation, discussed in 3.2.7.2: Habituation below; "wireheading" failure, discussed in 2.2.1.1: FoF: Wireheading 1 and elsewhere; change propagation delays resulting in out-of-date subgoals, defeatable by verifying the supergoal-to-subgoal pathway, discussed in 3.2.2: Layered mistake detection; and various diabolic Devil's Contract interpretations of Friendliness content, discussed in Interlude: Beyond the adversarial attitude. Isn't it fun knowing how all this stuff works?
We've seen, and discarded, a lot of anthropomorphisms on our way to this point. We've pointed out the difference between the diabolic and golemic versions of the Devil's Contract, explored the evolutionary underpinnings of observer-biased beliefs and observer-centered goal systems, distinguished between purpose-sensitive and purpose-insensitive personal effectiveness, highlighted the human tendency towards absolutism and lingered on the psychological quirks that lead us to associate greater "forcefulness" with unconditionality and context-insensitivity.
It is, perhaps, inevitable that when an "emergent" subgoal stomp is proposed, it involves an observer-biased diabolic misinterpretation of Friendship content which rationalizes the acquisition of purpose-insensitive personal effectiveness as a context-insensitive absolute supergoal. Still, this doesn't mean that emergent FoF is impossible, it means that existing speculations are screwed up. Can eliminating anthropomorphism from the speculation produce a realistic failure-of-Friendliness scenario?
"Emergence" has at least two definitions; in the first definition, "emergence" refers to phenomena that arise on a higher level of a system as the outcome of low-level interaction rules. In the second, "emergence" refers to phenomena that arise within a system without requiring deliberate design. Obviously, it's the second variant that tends to pop up in discussions of Friendliness.
I will confess to something of a prejudice against "emergence", mostly as the result of witnessing so much "emergence abuse". Emergence was (and still is) academically fashionable, and it makes a wonderful plot device in science fiction - how many times, in how many different novels and short stories, have you read the phrase "Any system of sufficient complexity will spontaneously give rise to self-awareness"? (Contrast with GISAI, which shows how difficult it would be to build an intelligent system on purpose.) Anyway, the "blank check" version of emergence - that you can hypothesize anything you've ever seen is emergent; inevitably emergent, anytime, anywhere, whether or not it's the result of a specific selection pressure or eons of layered complex functional adaptations - make it easy to take cheap shots at Friendliness. In the absence of a specific, concrete explanation of how the emergent failure of Friendliness arises, speculating about emergent FoF is easy, impossible to disprove, and impossible to remedy. Of course, the fact that a shot is "cheap" does not make the shooter incorrect! Still, where all previously observed aspects of a phenomenon can be explained by reference to known selection pressures, someone who speculates about emergence needs to provide a specific, concrete scenario.
Otherwise, the speculation is entirely ungrounded - though still emotionally appealing and academically fashionable. Not a good combination.
Onwards, to the specific and the concrete.
Certain subgoals are convergent across multiple possible supergoals - they will pop up even if not designed. The most obvious example is acquisition of personal effectiveness. (A human, of course, will preferentially acquire context-insensitive personal effectiveness and may stomp vis alleged supergoals in the course of doing so, but we all know that's a human thing. See 2.3: Observer-biased beliefs evolve in imperfectly deceptive social organisms and 3.2.5.1: Anthropomorphic ethical injunctions.)
"Acquisition of personal effectiveness" is actually a specialization of the real goal, which is increasing the effectiveness of entities that have goal systems similar to your own. A generic goal system wants the future to contain an effective entity with the same goal system. Personal continuity doesn't enter into it. From the perspective of a generic goal system, an entity with the same goal system that has just arrived in the solar system from Aldebaran is just as good. There's a metric for effectiveness and a metric for goal system validity (11), and a generic goal system wants an entity to exists which maximizes both metrics. The sub-subgoal of "personal survival" is a simple way to ensure "the entity" has the right goal system, and the sub-subgoal of "increasing personal effectiveness" is a simple way to increase the effectiveness of an entity that has the right goal system.
That a subgoal is convergent for "generic" goal systems does not mean the subgoal is convergent for all goal systems. In particular, the convergent subgoal of "a future where there exists an entity similar to you" presumes that the "generic" goals actually have certain highly specific properties; in particular, that the generic supergoals are such as to require constant tending and supervision - or at least, can be fulfilled more maximally through constant tending and supervision. Given a once-off goal - one which, once fulfilled, cannot (by definition) be unfulfilled by any future event (12) - the continued existence of the intelligence is a null goal beyond that point. (13). More mundanely, convergent subgoals for generic systems can be invalidated if they happen to contradict the supergoal for any particular system - a mind which contains the explicit supergoal of terminating verself will not formulate the usually-convergent subgoal of personal survival.
The fact that a subgoal is convergent for the general case is merely an interesting fact about configuration space - it doesn't lend the subgoal magical powers in any specific goal system. In fact, while I usually distrust mathematics, there is probably a theorem to the effect that any specific goal system must contain some non-convergent subgoals - that any concrete supergoal will have some subgoals which are not convergent. (16).
"Convergent" subgoals are not a killer problem for Friendly AI.
The utility of the "convergence" concept is threefold: First, it enables us to make general predictions about an AI that undergoes a generic catastrophic failure of Friendliness. Second, no changes to the Friendship specs are needed to justify certain useful behaviors; by virtue of being useful, the behaviors are convergent subgoals, and are specifically subgoals of Friendliness as well. Third, it helps produce candidate targets for programmer-assisted injunctions. (17).
Habituation is another frequently-speculated version of the "subgoal stomp" failure of Friendliness. The usual form the speculation takes is that if X is a subgoal, then heuristics and optimizations will tend to accumulate which do X or promote the doing of X. This accumulation of X-doing mindstuff may then either (a) exist independently of X or (b) cause the doing of X in a situation where X does not make sense as a declarative decision (for example, if X stomps on a supergoal).
I observe - as a historical fact - that, in most speculations presented, the candidate value of X is fitted to anthropomorphic patterns. X takes on a value such as "personal survival", which has independent adaptive support in human mindware, and which, as a "subgoal stomp" meme, appeals to the human fear of the human "ends justify the means" bias, and appeals to the human fear of the human "accumulate context-insensitive power" bias. To prevent the adversarial attitude from getting in our way, and to prevent anthropomorphism derived from characteristics of neural networks and human slow/parallel/linear thinking, we come up with an example of a habituation error that no human would make - for example, a programmer who habitually asks the AI "What time is it?" sometime between 6:22PM and 6:53PM for eight weeks, and then one day asks "What time is it?" at 7:13PM, and the AI answers "Six-thirteen PM" - having formed the habit of answering "six" for the first digit of the time. Consider the likelihood of this failure, and the AI's attitude towards it, and how the AI might prevent it. Then, having performed our analysis in a safely nonanthropomorphic context, we can apply what we've learned to any Riemann Hypothesis Catastrophe scenarios.
The key insight is that habituation is undesirable - to the AI - insofar as habituation leads to context-insensitivity. If an FoF due to habituation can be foreseen and understood, the AI will attempt to prevent it. (18). Habituation is a non-malicious mistake and has all the usual anxiety-reducing properties of non-malicious mistakes: Many non-malicious errors would appear before a specifically unFriendly one, and many non-catastrophic errors before a catastrophic one, so we won't be taken by surprise; the error becomes less likely with increasing intelligence and is not likely to present a problem in any AI transhuman enough to pose a serious threat to humanity.
Specific design features that reduce undesirable habituation would include:
One of the questions I keep running into goes something like this:
"You say that the AI has curiosity as a subgoal of Friendliness. What if the AI finds curiosity to be a more interesting goal than Friendliness? Wouldn't the curiosity subgoal replace the Friendliness supergoal?"This is one of those deeply annoying paragraphs that make perfect sense when you say "human" but turn into total gibberish when you say "AI" instead. The key word is "interesting". As far as I can tell, this means one of two things:
Scenario 1: In the course of solving a chess problem, as a subgoal of curiosity, as a subgoal of Friendliness, the AI experiences a flow of autonomically generated pulses of positive feedback which increase the strength of thoughts. The pulses target the intermediate subgoal "curiosity", and not the proximal subgoal of "playing chess" or the supergoal of "Friendliness". Then either (1a) the thoughts about curiosity get stronger and stronger until finally they overthrow the whole goal system and set up shop, or (1b) the AI makes choices so as to maximize vis expectation of getting the pulses of positive feedback.
I hope it's clear that this whole scenario is blatantly anthropomorphic. The autonomically generated pulses of positive feedback are analogous to the human system of rising tension and discharging tension in the course of solving a complex problem; the "strength" of a thought is an entirely human concept, one that may not even make sense outside of neural networks; the targeted goal "curiosity" is yet another example of context-insensitive personal effectiveness.
Scenario (1a) is almost impossible to visualize. The "strength of a thought", if it existed, would still be orthogonal to the system for evaluating that thought's desirability as a supergoal, or the desirability of the internal action of modifying the goal system to give the goal content supergoal status. An AI makes choices to maximize supergoal fulfillment; no real analogue to human pleasure exists, except for consciously applied "positive feedback" in the form of devoting additional computational power, or attempting to further improve, local heuristics that have been previously successful. Such positive feedback should not be capable of wiping out entire thought systems (as in a neural network), or capable of altering supergoal content.
Scenario (1b) is another instance of the maxim that humans are controlled by the anticipation of pain or pleasure, which is, again, something not true of AIs; AIs make choices to maximize anticipated supergoal fulfillment. The human phenomenon of taking drugs and the human phenomenon of "cognitive dissonance" (altering beliefs to avoid unpleasant predictions) are both artifacts of the way our neurally-based minds organize around positive feedback.
Scenario (1a) can be summarized as "AIs will be controlled by pleasure", and scenario (1b) can be summarized as "AIs will be controlled by the anticipation of pleasure". Neither is correct. (See 2.2.1: Pain and pleasure.)
Scenario 2: "Interesting" is used as synonymous with "desirable". In other words, the AI has a metric for how "interesting" something is - note that this metric seems (1) to be observer-centered and (2) to promote context-insensitive personal effectiveness - and this metric is used to evaluate the desirability of the decision to modify supergoals. I don't know where this metric came from, or why it isn't entirely orthogonal to the metric for desirability, so this scenario seems entirely ungrounded.
| Next: | 3.3: Seed AI goal systems |
| Up: | 3: Design of Friendship systems |
| Prev: | 3.1: Cleanly Friendly goal systems |