Levels of Organization in General Intelligence is a draft of a paper by Eliezer Yudkowsky, to appear in Ben Goertzel and Cassio Pennachin, (eds.) "Artificial General Intelligence". The draft may differ from the final paper.

Next: 2.2: Levels of organization in deliberation Bookmark
Up: 2: Part II: Levels of organization in deliberative general intelligence Monolithic
Prev: 2: Part II: Levels of organization in deliberative general intelligence


2.1: An illustration of principles

My chosen entry point into deliberation is words - that is, the words we mentally speak and mentally hear in our internal narrative.  Let us take the word "lightbulb" (or the wordlike phrase "light bulb") as an example1.  When you see the letters spelling "light bulb", the phonemes for light bulb flow through your auditory cortex.  If a mental task requires it, a visual exemplar for the "light bulb" category may be retrieved as mental imagery in your visual cortex (and associated visual areas).  Some of your past memories and experiences, such as accidentally breaking a light bulb and carefully sweeping up the sharp pieces, may be associated with or stored under the "light bulb" concept.  "Light bulb" is associated to other concepts; in cognitive priming experiments, it has been shown that hearing a phrase such as "light bulb"2 will prime associated words such as "fluorescent" or "fragile", increasing the recognition speed or reaction speed when associated words are presented [Meyer71].  The "light bulb" concept can act as a mental category; it describes some referents in perceived sensory experiences or internal mental imagery, but not other referents; and, among the referents it describes, it describes some strongly and others only weakly.

To further expose the internal complexity of the "light bulb" concept, I would like to offer an introspective illustration.  I apologize to any readers who possess strong philosophical prejudices against introspection; I emphasize that the exercise is not intended as evidence for a theory, but rather as a means of introducing and grounding concepts that will be argued in more detail later.  That said:

Close your eyes, and try to immediately (without conscious reasoning) visualize a triangular light bulb - now.  Did you do so?  What did you see?  On personally performing this test for the first time, I saw a pyramidal light bulb, with smoothed edges, with a bulb on the square base.  Perhaps you saw a tetrahedral light bulb instead of a pyramidal one, or a light bulb with sharp edges instead of smooth edges, or even a fluorescent tube bent into an equilateral triangle.  The specific result varies; what matters is the process you used to arrive at the mental imagery.

Our mental image for "triangular light bulb" would intuitively appear to be the result of imposing "triangular", the adjectival form of "triangle", on the "light bulb" concept.  That is, the novel mental image of a triangular light bulb is apparently the result of combining the sensory content of two pre-existing concepts.  (DGI3 does not hold otherwise, but the assumption deserves to be pointed out explicitly.)  Similarly, the combination of the two concepts is not a collision, but a structured imposition; "triangular" is imposed on "light bulb", and not "light-bulb-like" on "triangle".

The structured combination of two concepts is a major cognitive process.  I emphasize that I am not talking about interesting complexity which is supposedly to be found in the overall pattern of relations between concepts; I am talking about complexity which is directly visible in the specific example of imposing "triangular" on "light bulb".  I am not "zooming out" to look at the overall terrain of concepts, but "zooming in" to look at the cognitive processes needed to handle this single case.  The specific example of imposing "triangular" on "light bulb" is a nontrivial feat of mind; "triangular light bulb" is a trickier concept combination than "green light bulb" or "triangular parking lot".

The mental process of visualizing a "triangular light bulb" flashes through the mind very quickly; it may be possible to glimpse subjective flashes of the concept combination, but the process is not really open to human introspection.  For example, when first imposing "triangular" on "light bulb", I would report a brief subjective flash of a conflict arising from trying to impose the planar 2-D shape of "triangular" on the 3-D "light bulb" concept.  However, before this conflict could take place, it would seem necessary that some cognitive process have already selected the shape facet of "triangular" for imposition - as opposed to, say, the color or line width of the "triangle" exemplar that appears when I try to visualize a "triangle" as such.  However, this initial selection of shape as the key facet did not rise to the level of conscious attention.  I can guess at the underlying selection process - in this case, that past experience with the usage had already "cached" shape as the salient facet for the concept triangular, and that the concept was abstracted from an experiential base in which shape, but not color, was the perceived similarity within the group of experiences.  However, I cannot actually introspect on this selection process.

Likewise, I may have glimpsed the existence of a conflict, and that it was a conflict resulting from the 2D nature of "triangular" versus the 3D nature of "light bulb", but how the conflict was detected is not apparent in the subjective glimpse.  And the resolution of the conflict, the transformation of the 2D triangle shape into a 3D pyramid shape, was apparently instantaneous from my introspective vantage point.  Again, I can guess at the underlying process - in this case, that several already-associated conceptual neighbors of "triangle" were imposed on "light bulb" in parallel, and the best fit selected.  But even if this explanation is correct, the process occurred too fast to be visible to direct introspection.  I cannot rule out the possibility that a more complex, more deeply creative process was involved in the transition from triangle to pyramid, although basic constraints on human information-processing (the 200 spike/second speed limit of the underlying neurons) still apply.  Nor can I rule out the possibility that there was a unique serial route from triangle to pyramid.

The creation of an actual visuospatial image of a pyramidal light bulb is, presumably, a complex visual process - one that implies the ability of the visuospatial modality to reverse the usual flow of information and send commands from high-level features to low-level features, instead of detecting high-level features from low-level features.  DGI hypothesizes that visualization occurs through a flow from high-level feature controllers to low-level feature controllers, creating an articulated mental image within a sensory modality through a multistage process that allows the detection of conflicts at higher levels before proceeding to lower levels.  The final mental imagery is introspectively visible, but the process that creates it is mostly opaque.

Some theorists defy introspection to assert that our mental imagery is purely abstract [Pylyshyn81].  Yet there exists evidence from neuroanatomy, functional neuroimaging, pathology of neurological disorders, and cognitive psychology to support the contention that mental imagery is directly represented in sensory modalities [Kosslyn94][Finke77] show that mental imagery can create visual afterimages4 similar to, though weaker than, the afterimages resulting from real visual experience.  [Sherman86] estimate that while the cat has roughly 106 fibers from the lateral geniculate nucleus5 to the visual cortex, there are approximately 107 fibers running in the opposite direction.  No explanatory consensus currently exists for the existence of the massive corticothalamic feedback projections, though there are many competing theories; the puzzle is of obvious interest to an AI researcher positing a theory in which inventing novel mental imagery is more computationally intensive than sensory perception.

To return to the "triangular lightbulb" example:  Once the visuospatial image of a pyramidal light bulb was fully articulated, the next introspective glimpse was of a conflict in visualizing a glass pyramid - a pyramid has sharp edges, and sharp glass can cut the user.  This implies the mental imagery had semantic content (knowledge about the material composition of the pyramidal light bulb), imported from the original "light bulb" concept, and well-integrated with the visual representation.  Like most modern-day humans, I know from early parental warnings and later real-life confirmation that sharp glass is dangerous.  Thus the rapid visual detection of sharp glass is important when dealing with real-life sensory experience.  I say this to emphasize that no extended line of intelligent reasoning (which would exceed the 200Hz speed limit of biological neurons) is required to react negatively to a fleeting mental image of sharp glass.  This reaction could reasonably happen in a single perceptual step, so long as the same perceptual system which detects the visual signature of sharp glass in real-world sensory experience also reacts to mental imagery.

The conflict detected was resolved by the imposition of smooth edges on the glass pyramid making up the pyramidal light bulb.  Again, this apparently occurred instantly; again, nontrivial hidden complexity is implied.  To frame the problem in the terms suggested by [Hofstadter85], the imaginative process needed to possess or create a "knob" governing the image's transition from sharp edges to rounded edges, and the possession or creation of this knob is the most interesting part of the process, not the selection of one knob from many.  If the "knob" was created on the fly, it implies a much higher degree of systemic creativity than selecting from among pre-existing options.

Once the final conflict was resolved by the perceptual imposition of smoothed edges, the final mental image took on a stable form.  Again, in this example, all of the mental events appeared introspectively to happen automatically and without conscious decisions on my part; I would estimate that the whole process took less than one second.

In concept combination, a few flashes of the intermediate stages of processing may be visible as introspective glimpses - especially those conflicts that arise to the level of conscious attention before being resolved automatically.  But the extreme rapidity of the process means the glimpses are even more unreliable than ordinary introspection - where introspection is traditionally considered unreliable to begin with.  To some extent, this is the point of the illustration narrated above; almost all of the internal complexity of concepts is hidden away from human introspection, and many theories of AI (even in the modern era) thus attempt to implement concepts on the token level, e.g., "lightbulb" as a raw LISP atom.

This traditional problem is why I have carefully avoided using the word symbol in the exposition above.  In AI, the term "symbol" carries implicit connotations about representation - that the symbol is a naked LISP atom (Prolog variable, etc.) whose supposed meaning derives from its relation to the surrounding atoms in a semantic net; or at most a LISP atom whose content is a "frame-based" LISP structure (that is, whose content is another semantic net).  Even attempts to argue against the design assumptions of Good Old-Fashioned AI (GOFAI) are often phrased in GOFAI's terms; for example, the "symbol grounding problem".  Much discussion of the symbol grounding problem has approached the problem as if the design starts out with symbols and "grounding" is then added.  In some cases this viewpoint has directly translated to AI architectures; e.g., a traditional semantic net is loosely coupled to a connectionist sensorimotor system [Hexmoor93].

DGI belongs to the existing tradition that asks, not "How do we ground our semantic nets?", but rather "What is the underlying stuff making up these rich high-level objects we call 'symbols'?" - an approach presented most beautifully in [Hofstadter79]; see also [Chalmers92].  From this viewpoint, without the right underlying "symbolstuff", there are no symbols; merely LISP tokens carved in mockery of real concepts and brought to unholy life by the naming-makes-it-so fallacy.

Imagine sensory modalities as solid objects with a metaphorical surface composed of the layered feature detectors and their inverse functions as feature controllers.  The metaphorical "symbolstuff" is a pattern that interacts with the feature detectors to test for the presence of complex patterns in sensory data, or inversely, interacts with the feature controllers to produce complex mental imagery.  Symbols combine through the faceted combination of their symbolstuffs, using a process that might be called "holonic conflict resolution", where information flows from high-level feature controllers to low-level feature controllers, and conflicts are detected at each layer as the flow proceeds.  ("Holonic" is a useful word to describe the simultaneous application of reductionism and holism, in which a single quality is simultaneously a combination of parts and a part of a greater whole [Koestler67].  Note that "holonic" does not imply strict hierarchy, only a general flow from high-level to low-level and vice versa.  For example, a single feature detector may make use of the output of lower-level feature detectors, and act in turn as an input to higher-level feature detectors.  The information contained in a mid-level feature is then the holistic sum of many lower-level features, and also an element in the sums produced by higher-level features.  If you pick one vantage point in a holonic structure and "look down" (reductionism) you find parts composing the local whole, with simpler behaviors that contribute to local complexity; if you "look up" (holism) you find a greater whole to which local parts contribute, and more complex processes which local behaviors support.  See also [Hofstadter79].)

I apologize for adding yet another term, "holonic conflict resolution", to a namespace already crowded with terms such as "computational temperature" [Mitchell93], "Prägnanz" [Koffka35], "Hopfield networks" [Hopfield85], "constraint propagation" [Kumar92], and many others.  Holonic conflict resolution is certainly not a wholly new idea, and may even be wholly unoriginal on a feature-by-feature basis, but the combination of features I wish to describe does not exactly match the existing common usage of any of the terms above.  "Holonic conflict resolution" is intended to convey the image of a process that flows serially through the layered, holonic structure of perception, with detected conflicts resolved locally or propagated to the level above, with a final solution that satisfices.  Many of the terms above, in their common usage, refer to an iterated annealing process which seeks a global minimum.  Holonic conflict resolution is intended to be biologically plausible; i.e., to involve a smooth flow of visualization which is computationally tractable for parallel but speed-limited neurons.

Holonic conflict resolution is not proposed as a complete solution to perceptual problems, but rather as the active canvas for the interaction of concepts with mental imagery.  In theoretical terms, holonic conflict resolution is a structural framework within which to posit specific conflict-detection and conflict-resolution methods.  Holonic imagery is the artist's medium within which symbolstuff paints mental pictures such as "triangular light bulb".

A constructive account of concepts and symbolstuff would need to supply:

This is not an exhaustive list of concept functionality; these are just the three most "interesting" challenges6.  These challenges are interesting because the difficulty of solving them simultaneously seems to be the multiplicative (rather than additive) product of the difficulties of solving them individually.  Other design requirements for a constructive account of concepts would include: association to nearby concepts; supercategories and subcategories; exemplars stored in memory; prototype and typicality effects [Rosch78]; and many others (see, e.g., [Lakoff87]).

The interaction of concepts with modalities, and the interaction of concepts with each other, illustrate what I believe to be several important rules about how to approach AI.

The first principle is that of multiple levels of organization.  The human phenotype is composed of atoms7, molecules, proteins, cells, tissues, organs, organ systems, and finally the complete body - eight distinguishable layers of organization, each successive layer built above the preceding one, each successive layer incorporating evolved adaptive complexity.  Some useful properties of the higher level may emerge naturally from lower-level behaviors, but not all of them; higher-level properties are also subject to selection pressures on heritable variation and the elaboration of complex functional adaptations.  In postulating multiple levels of organization, I am not positing that the behaviors of all higher layers emerge automatically from the lowest layer.

If I had to pick one single mistake that has been the most debilitating in AI, it would be implementing a process too close to the token level - trying to implement a high-level process without implementing the underlying layers of organization.  Many proverbial AI pathologies result at least partially from omitting lower levels of organization from the design.

Take, for example, that version of the "frame problem" - sometimes also considered a form of the "commonsense problem" - in which intelligent reasoning appears to require knowledge of an infinite number of special cases.  Consider a CPU which adds two 32-bit numbers.  The higher level consists of two integers which are added to produce a third integer.  On a lower level, the computational objects are not regarded as opaque "integers", but as ordered structures of 32 bits.  When the CPU performs an arithmetic operation, two structures of 32 bits collide, under certain rules which govern the local interactions between bits, and the result is a new structure of 32 bits.  Now consider the woes of a research team, with no knowledge of the CPU's underlying implementation, that tries to create an arithmetic "expert system" by encoding a vast semantic network containing the "knowledge" that two and two make four, twenty-one and sixteen make thirty-seven, and so on.  This giant lookup table requires eighteen billion billion entries for completion.

In this hypothetical world where the lower-level process of addition is not understood, we can imagine the "common-sense" problem for addition; the launching of distributed Internet projects to "encode all the detailed knowledge necessary for addition"; the frame problem for addition; the philosophies of formal semantics under which the LISP token thirty-seven is meaningful because it refers to thirty-seven objects in the external world; the design principle that the token thirty-seven has no internal complexity and is rather given meaning by its network of relations to other tokens; the "number grounding problem"; the hopeful futurists arguing that past projects to create Artificial Addition failed because of inadequate computing power; and so on.

To some extent this is an unfair analogy.  Even if the thought experiment is basically correct, and the woes described would result from an attempt to capture a high-level description of arithmetic without implementing the underlying lower level, this does not prove the analogous mistake is the source of these woes in the real field of AI.  And to some extent the above description is unfair even as a thought experiment; an arithmetical expert system would not be as bankrupt as semantic nets.  The regularities in an "expert system for arithmetic" would be real, noticeable by simple and computationally feasible means, and could be used to deduce that arithmetic was the underlying process being represented, even by a Martian reading the program code with no hint as to the intended purpose of the system.  The gap between the higher level and the lower level is not absolute and uncrossable, as it is in semantic nets.

An arithmetic expert system that leaves out one level of organization may be recoverable.  Semantic nets leave out multiple levels of organization.  Omitting all the experiential and sensory grounding of human symbols leaves no raw material to work with.  If all the LISP tokens in a semantic net were given random new names, there would be no way to deduce whether G0025 formerly meant hamburger or chair[Harnad90] describes the symbol grounding problem arising out of semantic nets as similar to learning Chinese as a first language using only a Chinese-to-Chinese dictionary.

I believe that many (though not all) cases of the "commonsense problem" or "frame problem" arise from trying to store all possible descriptions of high-level behaviors that, in the human mind, are modeled by visualizing the lower level of organization from which those behaviors emerge.  For example, [Lakoff99] give a sample list of "built-in inferences" emerging from what they identify as the Source-Path-Goal metaphor:

A general intelligence with a visual modality has no need to explicitly store an infinite number of such statements in a theorem-proving production system.  The above statements can be perceived on the fly by inspecting depictive mental imagery.  Rather than storing knowledge about trajectories, a visual modality actually simulates the behavior of trajectories.  A visual modality uses low-level elements, metaphorical "pixels" and their holonic feature structure, whose behaviors locally correspond to the real-world behaviors of the referent.  There is a mapping from representation to referent, but it is a mapping on a lower level of organization than traditional semantic nets attempt to capture.  The correspondence happens on the level where 13 is the structure 00001101, not on the level where it is the number thirteen.

I occasionally encounter some confusion about the difference between a visual modality and a microtheory of vision.  Admittedly, microtheories in theorem-proving systems are well known in AI, so some confusion is understandable.  But layered feature extraction in the visual modality - which is an established fact of neuroscience - is also very well known even in the pure computer science tradition of AI, and has been well-known ever since David Marr's tremendously influential 1982 book Vision [Marr82] and earlier papers.  To make the difference explicit, the human visual cortex "knows" about edge detection, shading, textures of curved surfaces, binocular disparities, color constancy under natural lighting, motion relative to the plane of fixation, and so on.  The visual cortex does not know about butterflies.  In fact, a visual cortex "knows" nothing; a sensory modality contains behaviors which correspond to environmental invariants, not knowledge about environmental regularities.

This illustrates the second-worst error in AI, the failure to distinguish between things that can be hardwired and things that must be learned.  We are not preprogrammed to know about butterflies.  Evolution wired us with visual circuitry that makes sense of the sensory image of the butterfly, and with object-recognition systems that form visual categories.  When we see a butterfly, we are then able to recognize future butterflies as belonging to the same kind.  Sometimes evolution bypasses this system to give us visual instincts, but this constitutes a tiny fraction of visual knowledge.  A modern human recognizes a vast number of visual categories with no analogues in the ancestral environment.

What problems result from failing to distinguish between things that can be hardwired and things that must be learned?  "Hardwiring what should be learned" is so universally combined with "collapsing the levels of organization" that it is difficult to sort out the resulting pathologies.  An expert systems engineer, in addition to acting on the assumption that knowledge of butterflies can be preprogrammed, is also likely to act on the assumption that knowledge about butterflies consists of a butterfly LISP token which derives meaning from relations to other LISP tokens - rather than butterfly being a stored pattern that interacts with the visual modality and recognizes a butterfly.  A semantic net not only lacks richness, it lacks the capacity to represent richness.  Thus, I would attribute the symbol grounding problem to "collapsing the levels of organization", rather than "hardwiring what should be learned".

But even if a programmer who understood the levels of organization tried to create butterfly-recognizing symbolstuff by hand, I would still expect the resulting butterfly pattern to lack the richness of the learned butterfly pattern in a human mind.  When the human visual system creates a butterfly visual category, it does not write an opaque, procedural butterfly-recognition codelet using abstract knowledge about butterflies and then tag the codelet onto a butterfly frame.  Human visual categorization abstracts the butterfly category from a store of visual experiences of butterflies.

Furthermore, visual categorization - the general concept-formation process, not just the temporal visual processing stream - leaves behind an association between the butterfly concept and the stored memories from which "butterfly" was abstracted; it associates one or more exemplars with the butterfly category; it associates the butterfly category through overlapping territory to other visual categories such as fluttering; it creates butterfly symbolstuff that can combine with other symbolstuffs to produce mental imagery of a blue butterfly; and so on.  To the extent that a human lacks the patience to do these things, or to the extent that a human does them in fragile and hand-coded ways rather than using robust abstraction from a messy experiential base, lack of richness will result.  Even if an AI needs programmer-created concepts to bootstrap further concept formation, bootstrap concepts should be created using programmer-directed tool versions of the corresponding AI subsystems, and the bootstrap concepts should be replaced with AI-formed concepts as early as possible.

Two other potential problems emerging from the use of programmer-created content are opacity and isolation.

Opacity refers to the potential inability of an AI's subsystems to modify content that originated outside the AI.  If a programmer is creating cognitive content, it should at least be the kind of content that the AI could have created on its own; it should be content in a form that the AI's cognitive subsystems can manipulate.  The best way to ensure that the AI can modify and use internal content is to have the AI create the content.  If an AI's cognitive subsystems are powerful enough to create content independently, then hopefully those same subsystems will be capable of adding to that content, manipulating it, bending it in response to pressures exerted by a problem, and so on.  What the AI creates, the AI can use and improve.  Whatever the AI accomplishes on its own is a part of the AI's mind; the AI "owns" it and is not simply borrowing it from the programmers.  This is a principle that extends far beyond abstracting concepts!

Isolation means that if a concept, or a piece of knowledge, is handed to the AI on a silver platter, the AI may be isolated from the things that the AI would have needed to learn first in order to acquire that knowledge naturally, in the course of building up successive layers of understanding to handle problems of increasing complexity.  The concept may also be isolated from similar concepts and related concepts that the AI would otherwise have learned at around the same time, denying the AI useful associations and slippages.  Conceivably programmers could try to second-guess isolation by hardwiring many similar "knowledges", but this is no substitute for a natural ecology of cognition.



Next: 2.2: Levels of organization in deliberation
Up: 2: Part II: Levels of organization in deliberative general intelligence
Prev: 2: Part II: Levels of organization in deliberative general intelligence