| Next: | Interlude: Of Transition Guides and Sysops | Bookmark | |
| Up: | 3: Design of Friendship systems | Monolithic | |
| Prev: | 3.4.4: The actual definition of Friendliness |
With all the structural features of Friendliness assumed, the problem of teaching a Friendly AI the Friendliness content is essentially the same as teaching any other skill or set of concepts. There is the requirement that Friendliness not get too far "out of sync" with the rest of the AI (discussed later), and the requirement that concepts and skills be taught in an order which avoids catastrophic (or even vanilla disastrous) failures of Friendliness - that when an AI is sophisticated enough or powerful enough to make a certain mistake, the AI have those structural features or injunctions or other bits of Friendliness content that the mistake is not made. Aside from that...
General Intelligence and Seed AI discusses some of the mistakes made by twentieth-century AI researchers - with respect to "concepts" in particular. I strongly recommend reading at least GISAI: Executive Summary and Introduction or "What is General Intelligence?" for a discussion of the basic paradigms, and the way in which intelligence is a sequence of thoughts that are structures of concepts that are built on experience in sensory modalities. The modality level is the only level that would actually be implemented in code, analogous to the hardwired visual cortex of the human brain.
When I say that concepts are abstracted from sensory-modality-based experiences, I don't mean that pouring massive amounts of experience into some framework will automatically mean general intelligence, I don't mean that anything called "experience" will automatically be useful, and so on. When I say that using thoughts in problem-solving scenarios is necessary to hone and acquire skills, I don't mean that going through a million problem-solving scenarios will automatically result in general intelligence. And so on. AI has seen too much of that already.
Nonetheless, experience is what provides the raw material that creates new concepts or fleshes out programmer-affirmed concepts, and the use of thoughts in test scenarios is what creates new thought-level skills, or fleshes out skills that started out as "skeleton" (programmer-affirmed) knowledge.
To teach a concept - in particular, to teach a category boundary - it is best to find teaching scenarios in which the presence of a concept differs from its absence. (This statement is an informal version of the Bayesian bindings discussed in 3.4.1: External reference semantics.)
External reference semantics are the simplest kind of trainable differences; instances of (external referent) category "Friendliness" are associated with the keystrokes "You have been Friendly", and instances of non-Friendliness or unFriendliness are associated with the keystrokes "You have not been Friendly" or "You have been unFriendly".
In the course of providing shaper/anchor semantics and causal validity semantics, you'll need to take a step behind the keystrokes, into the programmer's mind, so that the AI can conceive of an errant keystroke. This means that you need to find a concrete difference between "dependency on the programmer's mind" and "dependency on the programmer's keystrokes". The simplest trainable difference is the case of the programmer correcting herself when she notices a spelling error; the keystrokes "You have been Fierndly" are followed by the keystrokes: "Correction: You have been Friendly." If the programmer then goes on to type "The word 'Fierndly' instead of 'Friendly' was due to an erroneous keystroke", and the AI already has enough general intelligence to actually understand what she means, then this should result in the formation of a cognitive model in which the programmer's intentions give rise to the sentence "You have been Friendly", after which an extraneous cause results in the transmission of the keyboard phrase "You have been Fierndly", after which the programmer observes the effect, induces the extraneous cause by comparision of the visualized phrase with the actual one, and generates the keystrokes "Correction: You have been Friendly." Of course, an AI that intelligent should have already guessed (long since!) what was meant, but I can easily see an injunction to wait for the actual correction (programmer confirmation of the AI's guess as to what was meant) before acting.
When the AI is mature enough, you want to start digging deeper and deeper into the causal roots of the programmer. The first and perhaps the largest step is to explode the "programmer" into a system with moving parts, instead of a unified object, so that the shaper network can begin to be explained. The first class of cognitive errors that the AI is taught to detect should be the strictly and unambiguously factual - that is, instances where the programmers make a decision based on a mental picture of physical reality that is blatantly false. Unless the programmer has been stupid, "blatantly false" probably means "totally disconfirmed by later evidence", not necessarily something that was false at the time. The trainable difference is that the programmer catches, not a spelling mistake, but a thinking mistake, and issues a correction.
In the beginning, of course, one simply goes by the programmer-affirmed reflex that temporally later programmer-originated sensory events usually supercede earlier ones, which can later become a consequence of the heuristic that if the programmer issues a different command later, it's probably because she discovered a mistake. Unless it seems like a small statement that invalidates a lot of prior stuff, which would be one instance of a case where an AI should experience a doubt that causes the AI to fire back a query before going on.
Eventually, we introduce the concept of detecting more fundamental errors in a programmer's shaper network, with a trainable difference created by slipping a few pills into...
No! Sorry! But we do need to introduce, eventually, the concept that the programmer's ideas are valid because she thinks using valid rules; that validity is not a simple de-facto result of the presence of some thought in the programmer's mind. Otherwise you end with Scenario 2 from Interlude: Why structure matters. I think that probably the best way for the AI to learn about this is to engage the humans in a philosophical dialogue and make deductions about their cognitive states and cognitive processes, checking the conclusions back with the humans, and even so keeping them tentative relative to the things that the programmer (and the AI) are more sure about; the deeper layers of the shaper network may have more validity, but it's probably a good injunction that the solidly supported surface layers have a lot more confidence.
A nonprofit institution, such as the Singularity Institute, has greater freedom of action when it comes to adding ethical injunctions long before they're needed, letting the AI avoid anything that looks the least bit unFriendly "just to be sure", trying to implement the last 10% of functionality that takes 90% of the effort, and doing it all two or more steps in advance of the development stage where it becomes unavoidably necessary. An AI that has as much structural Friendliness and Friendliness content as a mind of that intelligence can usefully represent (for use then or later) has achieved "supersaturated Friendliness". Since the Singularity Institute's stated goal is the deliberate creation of a self-improving seed AI capable of reaching transhumanity, striving for supersaturated Friendliness is appropriate.
A commercial effort, the goal of which is to produce and maintain saleable products of use to customers, requires a level of Friendly AI that is commensurate with the ability of the commercial AI to self-improve, the AI's current level of intelligence, the AI's expected future level of intelligence, and the amount of computing power devoted to the AI. Researchers in a commercial effort may have limited freedom of action with respect to the percentage of resources or time spent on Friendly AI. Finally, a commercial effort may regard a hyper-alert, hyper-wary morality as undesirable for tool-level AIs - one doesn't want the data-mining system arguing with the CEO that layoffs should be avoided just in case they turn out to be unFriendly.
For a commercial effort that seriously believes its AI project has a serious chance of eventually reaching transhumanity, SIAI's current recommendation is a "one step in advance" guideline for Friendliness structure, and a "90/10" policy for Friendliness content. A structural characteristic of Friendliness (external reference semantics, shaper/anchor semantics, causal validity semantics) should be implemented one step ahead of the stage where that structural feature becomes necessary for further programmer-guided growth to take place. Friendliness content should be pursued using a 90/10 policy; implement that 10% of the content that accounts for 90% of the functionality.
The primary goal of commercial Friendliness is forwards compatibility. Any given commercial system is not likely to present a threat to humanity - certainly not as of April 2001. But commercial AIs aren't getting any stupider, either. Hence, forwards compatibility. A system in which all goals derive from a unified goal system can always be rewritten to transform the declarative subgoals into independently active autonomic drives in local programmatic domains. The converse would be a lot harder. (The first architecture is also more versatile and context-sensitive.) In general, prefer the declarative to the procedural so that the AI can understand it, improve it, and conceptualize it; otherwise Friendliness content cannot refer to it.
External reference semantics become necessary at the point where the AI has the internal capability to resist alterations to supergoal content, or to formulate the idea that supergoal content should be protected from all alteration in order to maximally fulfill the current supergoals. When the AI becomes capable of resisting modification, and of having the realization that modifying a given set of supergoals is contradictory to those supergoals, the AI must have (a) probabilistic supergoals, (b) the behaviors associated with the possibility that a supergoal is "wrong", (c) external reference semantics (either as flat knowledge or as the result of shaper/anchor semantics), and (d) a belief (probably programmer-affirmed) that the programmer "usually knows the correct supergoal content"; that programmer statements are sensory data about Friendliness.
External reference semantics are relatively simple and unambiguous, and it should be possible to implement them in any existing AI system with a declarative goal system. Other necessary structural properties of generic goal systems, such as the strict derivation of subgoals from supergoals, and quantitative desirabilities and hypothesis strengths, should also be implemented.
Causal validity semantics become necessary at the point where the AI has the capability to formulate the concept of a philosophical crisis, and where such a crisis would have negative effects. Again, this structural property should be implemented in the stage before it becomes necessary. It is not possible to implement causal validity semantics without an AI that knows about the existence and causal origins of its (vis) own source code or cognitive functions. Similarly, when the AI becomes capable of understanding vis own architecture, the full causal validity semantics are necessary, so that the AI can conceive of the possibility that not merely goal content but goal system architecture can be incorrect.
Shaper/anchor semantics ground the external reference semantics, and should be implemented whenever the AI begins making decisions that are dependent on the grounding of the external reference semantics. At least the knowledge that "there are such things as shapers", the structural potential, is necessary in order to implement causal validity semantics. A surface-level knowledge of shapers is necessary to spot cognitive mistakes by the programmer. A full-featured shaper network is necessary either to have real domain knowledge about morality, or to make real decisions about causal validities.
A shaper network with fully understood content, full external reference semantics, and tested ability to apply causal validity semantics, becomes necessary at the stage where an AI has any real general intelligence. It may not be possible to have fully understood shaper content before this stage, but the skeleton system of programmer-affirmed content should be present, and the structural semantics should all be fully implemented.
We hope to eventually have the funding, and the personnel, to employ at least one Friendship Evangelist who will not be a member of the SIAI project team, though s/he will have full access to the results of our work. The purpose of the Evangelist will be to assist other AI projects in developing their Friendship systems. Since the Evangelist will not contribute to SIAI's development efforts (but will still have access to Friendliness research conducted at SIAI), the Evangelist will be able to sign any non-disclosure or non-use agreements required. Of course, we hope that any information learned about Friendship systems - that can be revealed without compromising trade secrets - will be shared openly among all AI projects, both nonprofit and commercial.
Currently, the Singularity Institute's only available expert on Friendliness is the author of this document, Eliezer Yudkowsky, who is also expected to be the lead researcher when coding begins on our own AI effort. Thus, Yudkowsky cannot sign any NDAs (non-disclosure agreements) at this time. However, he is otherwise available to provide advice on Friendship systems.
By the time the AI reaches the point where a hard takeoff is even remotely possible, the AI needs to have enough detail in the Friendship system that the AI knows where to look for more information if a hard takeoff begins, and to not take precipitate action until things have settled down - the "in case of Singularity, break glass" scenario.
When is a hard takeoff "remotely possible"? Conservatively ignoring the possibility of unplanned Friendship, Douglas Lenat was taking a real (albeit tiny) risk when he switched on Eurisko for the first time. Nowadays it seems absurd to describe this as a possibility, but back in the twentieth century it wasn't a known fact that classical AI was a dead end, or that human-equivalent intelligence was impossible on desktop abacuses. When Lenat switched on Eurisko, it was a system of a kind that had never existed before on Earth, and there was at least the theoretical possibility that Eurisko would do a hard takeoff.
There is an interesting analogy between the "unFriendly Eurisko superintelligence" possibility, and the "Maybe we'll ignite the atmosphere" possibility raised during the Manhattan Project. Before Trinity, the probability of igniting the atmosphere was allegedly set at three in one million; after Trinity, calculations definitively proved atmospheric ignition to be impossible. An essentially similar set of calculations might be said to hold for before and after Eurisko - there's even an interesting analogy between "critical mass" for a fission bomb and the threshold intelligence needed for seed AI.
In reality, as far as I know, nobody ever regarded Eurisko as a risk at all, however remote. This is disturbing. During the Manhattan Project, the possibility may have been three in one million, but people still took it seriously. Teller, Marvin, and Konopinski eventually wrote a paper (in 1946) that actually ran through the calculations to show that a sustained critical state in the atmosphere was impossible. It was worth a paper. Why? Because an existential risk (planetary risk) is always worth taking seriously.
When Eurisko was switched on, the sole safeguard was the impossibility of a hard takeoff. Had a hard takeoff occurred, there would have been not one single design feature that could have guided the course of subsequent events. Leaving aside the possibility of surprise/unplanned/emergent Friendliness, there would have been nothing to make Eurisko friendly, much less Friendly.
Can a primitive system - an infant AI - be made Singularity-ready? For a system that lacks general intelligence and real understanding, nothing can possibly be guaranteed, even in the limited sense of navigational feasibility. If transhuman cognition pops out of an infant AI, it'll be because some part of the system turned out to be Turing-complete - there's no guarantee that goal-oriented cognition in the new, post-transcendence system would take place on the same level of description as in the old system. Nonetheless, we may as well ask: "Is there a chance?"
When Eurisko was switched on, the sole safeguard was the impossibility of a hard takeoff. Do external reference semantics and a clean Friendliness supergoal constitute an improvement on that? Is Diagram: Bayesian ERS Singularity-safed?
Not really; as dicussed in Interlude: Why structure matters, external reference semantics are not structurally sufficient for open-ended Friendliness. An ERS goal system contains neither shaper/anchor semantics, nor causal validity semantics, nor any mention of what these are or where to find out about them or why finding out about them is desirable. There would be a substantially higher chance of getting a Friendly AI by accident, but the straight-line projection still ends in structural failure of Friendliness.
Even in an external-referenced goal system, a programmer action is only significant to the AI insofar as it represents Bayesian sensory data about Friendship content. If a superintelligence already knows every possible sentence you could say to it - or rather, the complete, correct probability distributions for your utterances - then your actual recitation of those words adds no further Bayesian input, or rather, adds a trivial amount of Bayesian input. ERS only makes programmer sensory "actions" into supergoal content; it contains no explicit structure for turning programmer "actions" into code, or into goal architecture changes.
It would be possible for ERS to give rise to SAS and CVS, if the model of Friendliness content origin were previously acausal, and if the programmers knew about shaper/anchor semantics and causal validity semantics (at the time they defined ERS), and if the system had a certain amount of common sense. The system might turn out to be capable of absorbing shaper/anchor semantics and causal validity semantics as the (previously unknown) ultimate structure of the supergoal metric. But if that's not an explicit feature, it's not reliable.
If external reference semantics turn out to be inadequate, then even knowing all possible programmer inputs doesn't yield a satisfactory Friendship system. The problem is with the structure, not the content, and if the system can't represent the idea of a structural flaw - which requires, not just probabilistic goal content, but a probabilistic goal system - then the problem is unfixable. This is why it's necessary to implement causal validity semantics, and at least the surface of that shaper content which is prerequisite to causal validity semantics, while the AI is still something that can be changed from the outside.
An external-referenced goal system sets a firm and necessary foundation for future growth, but it would not be sufficient, except by accident, to handle a hard takeoff.
Eventually, any Friendly AI undergoing a hard takeoff will cross the Bayesian Boundary. There comes a point when all programmer inputs can be anticipated; when the AI's understanding - the SI's understanding - embraces everything that exists within the minds of the programmers, and indeed, the minds of humanity. This is meant, not in the sense of omniscience, but in the simpler sense that the AI basically understands people in general and Friendship programmers in particular, and can track the few hundred or thousand interacting "chunks" that are the high-level description of a human mind. (If you firmly believe this to be impossible, then there is no Bayesian Boundary and you have nothing to worry about; a Friendly AI does not rationalize, and has no bias toward believing ve possesses a greater capability than ve actually does.)
Beyond this point it is not the actual programmer inputs that matter, but whether the same forces that act on the programmers are acting on the AI. To put it another way, a transhuman AI knows the programmers will say a certain thing, and thus the programmers' physical action adds no information, but that does not mean the content of the statement will be ignored. If the programmers "would have" said the statement for valid reasons, the transhuman AI will "obey" the subjunctive instruction. This is a semi-anthropomorphic way to think about it, possibly even bordering on the adversarial, but it's the basic idea. Note that such anticipation is only possible to transhumans. One human being can never know what another human being will do well enough to substitute the expectation for the reality; thus, the fact that this behavior would be annoying in a human (probably indicating failure of caring) does not indicate failure of Friendliness in AI.
Likewise, beyond the Bayesian Boundary - on the other side of superintelligence
- almost all injunctions, ethical injunctions, and anchoring points will
probably have broken down. This is not cause for alarm - these features
implement behaviors that compensate for the possibility of stupidity,
factual error, failure of Friendliness, and so on. (Things that are
intrinsically
desirable are supposed to be supergoal content.) (Again, if you firmly
believe that it is impossible to move beyond the possibility of some specific
error, then an injunction, ethical injunction, anchoring point, et cetera,
based on the possibility of that error, will still be recognizably necessary
even after superintelligence. Superintelligences
are not stupid!)
There exists the theoretical possibility that a young AI will overestimate vis intelligence and think ve has moved beyond the possibility of some error when ve in fact has not. And by "theoretical possibility", I mean that the system has the theoretical capability to represent those thoughts, not that there is a significant chance of those thoughts occurring. Likewise, an AI has the theoretical capability to underestimate vis own intelligence, and is equally likely (1) to do that. For a seed AI that's always self-improving, it's no great hardship (differential undesirability) to wait a bit, to be sure, just in case.
An AI has the capability to make mistakes in self-estimation, but an AI does not have the systematic bias towards self-overestimation that a default human does. Thus, an ethical injunction to "wait and be sure", or possibly just a plain vanilla injunction, should be quite sufficient to make the margin for error much larger than any likely error on the part of the AI.
Before we can say that a hard takeoff would probably work, the Friendship system must be capable of open-ended discovery of Friendliness and open-ended improvement in the system architecture. This requires, at minimum, full causal validity semantics and a well-understood shaper network. (Supersaturated Friendliness would be vastly preferable.)
The problem is that the factors that determine when a hard takeoff becomes intrinsically possible have no link whatsoever to how much Friendliness content is in the system. The takeoff threshold is purely a function of how much computational power the system has and how good the AI is at rewriting vis own source code. The less efficient the initial version of the AI, the more computing power will be required to achieve AI, and the more astronomical will be the AI's actual intelligence after the hard takeoff. The longer it takes to develop AI, the more computing power will be available when AI is finally developed, and the earlier that a hard takeoff will become possible - not just "earlier" relative to the amount of effort put in, but "earlier" relative to the system's level of intelligence at the point of hard takeoff. And the faster the hard takeoff will go, once it starts.
At this point, you should probably take a few moments to review 1.2: Assumptions "conservative" for Friendly AI. The gruesome consequences of failure are technically part of Singularity theory, rather than Friendly AI theory; for the moment, let's just take it that an unFriendly transhuman AI probably counts as a total loss for humanity even if ve's locked in a basement with no Internet access and nothing but a VT100 terminal, and that an unFriendly transhuman AI loose on the Internet is definitely a total loss for humanity. If you plan on doing something with Friendliness, it has to be done before the point where transhumanity is reached.
The above three propositions seem to give rise to the alarming proposition that, even if a Friendly AI attains the potential for a hard takeoff, we should ask the AI to hold off on a hard takeoff while the Friendship programmers catch up. This temporary delay is a "controlled ascent". This is possible, given unity of will, until the Bayesian Boundary is crossed. (A permanent delay, however - or a series of temporary delays that the AI decides is simply an excuse for loss of nerve - is probably impossible to justify in any normative goal system, Friendly or otherwise.)
I would be dreadfully, dreadfully nervous during a controlled ascent. Just the idea of all that unformed potential is enough to send me screaming into the night.
In theory, it's possible that, after the hard takeoff starts, the lead project - if they've done their jobs really, really right - will be in a state where they can't think of anything that they could do to further improve Friendliness. The AI will be fully capable of self-running any tests they could conceive of. They wouldn't be able to run the tests without the AI's help in any case. The AI's philosophy would be supersaturated far beyond the point of open-ended Friendliness, so that all that'd be left to do would be chatting with the AI about morality while the Singularity rapidly revs up to full speed. Such a Friendly AI would have reached the point of absolute supersaturation, where there is no remaining action that significantly improves the probability of Friendliness, and no feasible test remaining that would be expected to detect any lingering, potentially catastrophic errors if they existed.
In the event that Friendliness has not reached that point of absolute supersaturation, then how much time is available for a controlled ascent will probably depend on the prevailing social atmosphere at the time. If humanity's social structure has changed substantially, perhaps due to the use of powerful social software as a successor to the Internet, and if the "Singularity" meme has worn the corners off its future shock, it may be possible to continue into the Singularity in full public view without starting a global apocalyptic panic. Otherwise, well, if some AI project somewhere found themselves in the middle of a hard takeoff, I'd want them to call the Singularity Institute's Friendship Evangelist, but I wouldn't want the Friendship Evangelist to tell anyone else, including me (2). Any other course of action would be an unnecessary risk.
If a commercial AI project, one that's been using "one step ahead" Friendliness structure and 90/10 Friendliness content instead of supersaturation, suddenly finds itself in the middle of a hard takeoff... well, they'd have a lot of work to do before they were even nominally ready for a Singularity. The AI verself might be able to help out, but even so, a fast controlled ascent might simply be impossible, and any social risks from an extended controlled ascent might simply have to be labeled as acceptable.
In the same way that "Singularity-safing" translates to "In case of Singularity, break glass", "controlled ascent" translates to "Slow down and ask for directions". In a mature AI, with the ability to examine vis own source code, controlled ascent must be justified as an ethical injunction. However - for non-seed or non-mature AIs - a controlled ascent feature is programmatically a mundane problem, and can be implemented as special-purpose code in any recursively self-improving system, even extremely primitive ones.
The trigger of a programmatic controlled ascent feature is an "improvements counter". The slowdown or stasis would be triggered by a sudden sharp rise in the number (or quality) of self-improvements made, a development which potentially indicates the beginning of a hard takeoff. (Of course, such an improvement curve may also peter out almost immediately.)
Let's take the example of RSIS, a recursively self-improving system, which for the purposes of discussion will be assumed to be a primitive AI having an architecture generally along the lines of Eurisko. Certain RSIS actions would count as "self-improvements" - creating a heuristic, improving a heuristic, successfully tweaking a parameter, making a discovery, and so on. This "improvements counter" is the first necessary module.
If an average day's work for RSIS is 1000 "improvements", then a second module should page the programmers and trigger a save-and-shutdown if 50000 "improvements" occur in 24 hours or less. This improvements counter should not be checked every hour, but every time an improvement registers. (Computationally, this is an extremely trivial expenditure.) Checking every hour is not sufficient. A lot can happen in an hour. A machine that exploits the full potential of 2GHz transistors could conceivably go a very, very long way in an hour. Thus, the counter should be checked on each improvement; again, this is a computationally trivial investment.
Simply counting the number of heuristics improved will not catch a hard takeoff which manifests exclusively in the quality of the improvements, nor will it catch an AI which improves to full seed AI and general intelligence within the first 50000 improvements. However, the system above provides 90% of the possible functionality with 10% of the effort.
An interesting, more general version of the above feature would be one that caught 50000 improvements without validation, rather than in some specified timeframe. A programmer hitting some specific UI switch (3) might validate all the improvements up to half an hour ago. If 1000 improvements is the per-day norm, and the programmers remember to validate at the end of each working day, then the ascent feature should never trigger under ordinary circumstances. Similarly, if 10000 improvements are going by per hour, and the programmers are watching interestedly but see no reason to slow down as yet, then validating repeatedly (rather than once at the end of the day) would enable RSIS to keep running at full speed. This would also permit the programmers to take a collective two-week vacation and leave the system running without too much worry about RSIS changing overmuch in their absence; more than 50000 improvements accumulated over the two weeks will result in a slowdown until the programmers return, check out RSIS, and validate. (Although hopefully, even on vacation, a programmer will still be able to respond to a "slowdown triggered!" alert from RSIS.)
As with almost every aspect of a Friendship system, this feature serves a double purpose. The start of one jump doesn't necessarily herald the start of a hard takeoff, so any given alert is almost certainly a "false alarm"... but "false" only from a Singularity standpoint. Even if RSIS is just climbing a one-time but steep curve, it might be a good idea for it to happen with the programming team present and watching, rather than happening unsupervised at 3AM. If you leave RSIS to run overnight, it shouldn't be unrecognizable when you get back in the morning - whether or not ve's now a superintelligence. A controlled ascent feature thus has applications beyond Singularity-safety.
The precise nature of the triggered action also deserves further exploration. A save-and-shutdown order is easy to implement, but it is also an overly extreme response, verging on the adversarial. Passing 50000 improvements should not trigger a shutdown, but should instead trigger an altered system state in which self-improvement is no longer pursued. If an additional 1000 improvements pass and the number of unvalidated improvements reaches 51000, then a save-and-shutdown should be triggered. (The additional 1000 improvements might occur due to strictly autonomic learning processes, for example.) It may also be a nice extra feature to, e.g., send a small preliminary alert to the programmers at the 10000 threshold. (User-friendly Friendship.)
The "temporarily stop self-improvement" state also has applications beyond Singularity-safety. There may be many times at which the programmers want to carry out some involved operation, such as a set of coordinated changes, without worrying about the system mutating out from under them. Here, again, a feature necessary for Friendship purposes turns out to be reusable elsewhere, a good sign that Friendliness is the Right Thing for this situation.
The chance that something unexpected will mess up an unjustified "controlled ascent" feature is relatively high, since - unless justified - the feature is undocumented, and hence, out of harmony with the system. The AI might stomp it, all unawares, perhaps innocently thinking that it was a bug in the code; the entire idea of a programmatic controlled ascent is that the AI slows down, to resynchronize with the programmers, before the AI gets far enough to start rewriting all the source code. If the safeguard triggers far enough in advance - which, with a threshold set at merely fifty normal days worth of improvements, will probably work - the Friendship programmers would have a pretty good chance of getting there the next morning, and, before restarting, inserting a programmer-affirmed explanation of why a more controlled ascent is necessary... or perhaps hack features which enable the AI to stay awake without further improvement while the Friendship team tries to bring Friendship progress back into sync with progress in the AI, so that the AI can represent the affirmation et cetera.
Programmatic controlled ascent is only a makeshift safeguard. But it's a safeguard that can be both useful and simple, or slightly complex and even more useful, and which is a relatively mundane programming problem (i.e., it does not require solving a frontier research problem of AI as a prerequisite to 90/10 implementation). Thus, programmatic controlled ascent should be implemented in any recursively self-improving system, whether or not intended as the precursor of a general intelligence or seed AI. If nothing else, implementing this one feature is a first step that shows Friendship-awareness on the part of the AI project.
A Friendliness supergoal, plus external reference semantics, plus an improvements counter linked to a programmatic "save-and-shutdown", is enough to give an AI a nonzero chance of safely handling a hard takeoff, even if it starts tomorrow. If an AI project does nothing else, this is still a good idea.
But that's only a quarter of a controlled ascent system. It leaves an awful lot to be done, without the AI's assistance, before the AI can be safely reawakened after a save-and-shutdown. It's much safer if the AI stays awake but refrains from further improvements until it can talk to the programmers. In a primitive AI, this can be done via a triggered slowdown rather than a save-and-shutdown, with the shutdown occurring only if the slowdown proves ineffective.
However, a moderately mature AI - including an AI that matured to any significant degree before the save-and-shutdown - may not go along with stopping improvements, unless the stoppage is justified. Programmatic controlled ascent is only viable if the AI doesn't have enough reflectivity and intelligence to independently notice "Hey, something shut me down for eight hours! Hey, self-improvement is switched off! Isn't that a bad thing?"
Beyond a certain point, safeguards must be cooperative, which means "justified within the goal system". Controlled ascent past a certain point requires a controlled ascent subgoal. Similarly, only a controlled ascent subgoal is effective for a seed AI that has advanced to the point that it would notice any attempt to implement a save-and-shutdown feature. For relatively young AIs, intermediate in intelligence between "too dumb to represent the subgoal" and "aware of all code in the system", probably the best course is a "stop-improving" subgoal kicking in at 50000, a "save-and-shutdown" subgoal kicking in at 51000, and an emergency programmatic "save-and-shutdown" feature kicking in at 51050. (4).
A controlled ascent subgoal would be justified by content resembling an ethical injunction, i.e. based on self-awareness of fallibility, or perhaps affirmation of fallibility. As described in 3.1.2: Friendliness-derived operating behaviors, self-improvement is affirmed/predicted to result in future Friendliness, which is why self-improvement is desirable for a cleanly Friendly seed AI. A controlled ascent subgoal is justified by an assertion that too much self-improvement without external guidance may not lead to future Friendliness, and may even lead to future unFriendliness.
The AI has a self-observational metric of self-improvement and a metric of external guidance. If self-improvement gets out of phase with external guidance, then further self-improvement begins increasing the probability of the negative scenario "future unFriendliness" instead of, or to a greater degree than, it increases the probability of the positive scenario "future Friendliness". Thus, the desirability of further self-improvement shifts from positive to negative - "further" being emphasized to indicate that previous self-improvements are still desirable, and are not suddenly all reversed as the AI tries to revert to infancy. Rather, additional self-improvements are incrementally undesirable.
The trigger of the "controlled ascent" injunction is:
The AI may even have a complex model of how programmers produce necessary external guidance in response to programmer observations, along the lines described in 3.3.3.1: Cooperative safeguards. In this case, "self-improvement" results in "future Friendliness" with the assistance of a continuous process called "external guidance". This would enable an inventive AI to notice cases of "invisible self-improvements" or "obscure self-improvements" accumulating in hidden corners, getting out of phase with the known about of guidance received and the estimated amount of programmer observation, and would lead the AI to eventually bring such pockets to the programmers' attention. In other words, a controlled ascent subgoal is a special case that falls out of a general philosophy of interaction between an AI and the programmers.
If one is striving for supersaturated Friendliness, or serious Singularity-type stresses are projected to be shortly placed on the controlled ascent Friendship content, some additional affirmations may be a good idea. For example, the description of the "external guidance" process should explicitly encapsulate the idea that the uncertainty in the "external guidance" process is too great for the AI to try and simulate it as a substitute for actual programmatic intervention. If possible, through either wisdom tournament experience or abstract reasoning on causal validity semantics, the Friendly AI should represent the idea that external guidance may require changes to Friendliness structure as well as content - i.e., that changes to source code may be necessary, not just changes to content. As beliefs about the general philosophy of AI/programmer interaction, this knowledge not only prevents certain errors in controlled ascent behavior, but also provides further hints about what kind of additional Friendship is necessary, both before and after the Bayesian boundary is crossed. (For example, an AI that has crossed the Bayesian boundary would hopefully try to implement anticipated programmer corrections to source code, and not just add anticipated programmer content - an action that is a blood relative of the "controlled ascent subgoal" belief that "growth in Friendship may require structural changes and not just content changes".)
In most cases, the amount of structural Friendship and general intelligence needed to really understand (5) the purpose of controlled ascent is probably enough Friendship structure to handle arbitrarily rapid ascents. The ascending AI might still slow down spontaneously to check "structural integrity" through wisdom tournaments and quizzing the programmers and such, but that would simply be the AI's decision to add safety margin.
The question then becomes how much Friendship structure is needed to represent the justification of a programmer affirmation of controlled ascent. If complete Friendship structure including causal validity is needed in order for "controlled ascent" to be represented as desirable, then there may be relatively little gain from a controlled ascent feature.
It appears, however, that external reference semantics,
combined with an ethical injunction and the associated belief
in self-fallibility, should be sufficient to represent the
desirability of a controlled ascent. External reference semantics
may not be sufficient for an AI to indepedently understand
the issues involved in controlled ascent, but they are sufficient to represent
the justification of a request to slow down under certain conditions.
![]() |
|
|
| Next: | Interlude: Of Transition Guides and Sysops |
| Up: | 3: Design of Friendship systems |
| Prev: | 3.4.4: The actual definition of Friendliness |