Levels of Organization in General Intelligence is a draft of a paper by Eliezer Yudkowsky, to appear in Ben Goertzel and Cassio Pennachin, (eds.) "Artificial General Intelligence". The draft may differ from the final paper.

Next: 2.5: The concept level Bookmark
Up: 2: Part II: Levels of organization in deliberative general intelligence Monolithic
Prev: 2.3: The code level


2.4: The modality level

2.4.1: The evolutionary design of modalities in humans

Most students of AI are familiar with the high-level computational processes of at least one human sensory modality, vision, at least to the extent of being acquainted with David Marr's "2 1/2D world" and the concept of layered feature extraction [Marr82].  Further investigations in computational neuroscience have both confirmed Marr's theory and rendered it enormously more complex.  Although many writers, including myself, have been known to use the phrase "visual cortex" when talking about the entire visual modality, this is like talking about the United States by referring to New York.  About 50% of the neocortex of nonhuman primates is devoted exclusively to visual processing, with over 30 distinct visual areas identified in the macaque monkey [Felleman91].

The major visual stream is the retinal-geniculate-cortical stream, which goes from the retina to the lateral geniculate nucleus to the striate cortex1 to the higher visual areas.  Beyond the visual cortex, processing splits into two major secondary streams; the ventral stream heading toward the temporal lobe for object recognition, and the dorsal stream heading toward the parietal lobe for spatial processing.  The visual stream begins in the retina, which contains around 100 million rods and 5 million cones, but feeds into an optic cable containing only around 1 million axons.  Visual preprocessing begins in the first layer of the retina, which converts the raw intensities into center-surround gradients, a representation that forms the basis of all further visual processing.  After several further layers of retinal processing, the final retinal layer is composed of a wide variety of ganglion types that include directionally selective motion detectors, slow-moving edge detectors, fast movement detectors, uniformity detectors, and subtractive color channels.  The axons of these ganglions form the optic nerve and project to the magnocellular, parvocellular, and koniocellular layers of the lateral geniculate nucleus; currently it appears that each class of ganglion projects to only one of these layers.  It is widely assumed that further feature detection takes place in the lateral geniculate nucleus, but the specifics are not currently clear.  From the lateral geniculate nucleus, the visual information stream continues to area V1, the primary visual cortex, which begins feature extraction for information about motion, orientation, color and depth.  From primary visual cortex the information stream continues, making its way to the higher visual areas, V2 through V6.  Beyond the visual cortex, the information stream continues to temporal areas (object recognition) and parietal areas (spatial processing).

As mentioned earlier, primary visual cortex sends massive corticothalamic feedback projections to the lateral geniculate nucleus [Sherman86].  Corticocortical connections are also typically accompanied by feedback projections of equal strength [Felleman91].  There is currently no standard explanation for these feedback connections.  DGI2 requires sensory modalities with feature controllers that are the inverse complements of the feature detectors; this fits with the existence of the feedback projections.  However, it should be noted that this assertion is not part of contemporary neuroscience.  The existence of feature controllers is allowed for, but not asserted, by current theory; their existence is asserted, and required, by DGI.  (The hypothesis that feedback projections play a role in mental imagery is not limited to DGI; for example, [Kosslyn94] cites the existence of corticocortical feedback projections as providing an underlying mechanism for higher-level cognitive functions to control depictive mental imagery.)

The general lesson learned from the human visual modality is that modalities are not microtheories, that modalities are not flat representations of the pixel level, and that modalities are functionally characterized by successive layers of successively more elaborate feature structure.  Modalities are one of the best exhibitions of this evolutionary design pattern - ascending layers of adaptive complexity - which also appears, albeit in very different form, in the ascending code-modality-concept-thought-deliberation model of the human mind.  Each ascending layer is more elaborate, more complex, more flexible, and more computationally expensive.  Each layer requires the complexity of the layer underneath - both functionally within a single organism, and evolutionarily within a genetic population.

The concept layer is evolvable in a series of short steps if, and only if, there already exists substantial complexity within the modality layer.  The same design pattern - ascending layers of adaptive complexity - also appears within an evolved sensory modality.  The first features detected are simple, and can evolve in a single step or a small series of adaptive short steps.  The ability to detect these first features can be adaptive even in the absence of a complete sensory modality.  The eye, which is currently believed to have independently evolved in many different species, may have begun, each time, as a single light-sensitive spot on the organism's skin.

In modalities, each additional layer of feature detectors makes use of the information provided by the first layer of feature detectors.  In the absence of the first layer of feature detectors, the "code" for the second layer of feature detectors would be too complex to evolve in one chunk.  With the first layer of feature detectors already present, feature detectors in the second layer can evolve in a single step, or in a short series of locally adaptive steps.  The successive layers of organization in a sensory modality are a beautiful illustration of evolution's design signature, the functional ontogeny of the information recapitulating the evolutionary phylogeny.

Evolution is a good teacher but a poor role model; is this design a bug or a feature?  I would argue that it is generally a feature.  There is a deep correspondence between evolutionarily smooth fitness landscapes and computationally smooth fitness landscapes.  There is a deep correspondence between each successive layer of feature detectors being evolvable, and each successive layer of feature detectors being computable in a way that is "smooth" rather than "fragile", as described in the earlier discussion of the code layer.  Smooth computations are more evolvable, so evolution, in constructing a system incrementally, tends to construct linear sequences or ascending layers of smooth operations.

An AI designer may conceivably discard the requirement that each ascending layer of feature detection be incrementally useful/adaptive - although this may make the subsystem harder to incrementally develop and test!  It is cognitively important, however, that successive layers of feature detectors be computationally "smooth" in one specific sense.  DGI concepts interact with inverse feature detectors, "feature controllers", in order to construct mental imagery.  For the task of imposing a concept and the still more difficult task of abstracting a concept to be simultaneously tractable, it is necessary that sensory modalities be a continuum of locally smooth layers, rather than consisting of enormous, intractable, opaque chunks.  There is a deep correspondence between the smooth design that renders concepts tractable and the smooth architecture emergent from incremental evolution.

The feature controllers used to create mental imagery are evolvable and preadaptive in the absence of mental imagery; feature controllers could begin as top-down constraints in perceptual processing, or even more simply as a perceptual step which happens to be best computed by a recurrent network.  In both cases, the easiest (most evolvable) architecture is generally one in which the feedback connection reciprocates the feedforward connection.  Thus, the feature controller layers are not a separate system independent from the feature detector layers; rather, I expect that what is locally a feature detector is also locally a feature controller.  Again, this smooth reversibility helps render it possible to learn a single concept which can act as a category detector or a category imposer.  It is the simultaneous solution of concept imposition, concept satisfaction, concept faceting, and concept abstraction that requires reversible features - feature controllers which are the local inverses of the feature detectors.  I doubt that feature controllers reach all the way down to the first layers of the retina (I have not heard of any feedback connections reaching this far), but direct evidence from neuroimaging shows that mental imagery activates primary visual cortex [Kosslyn93]; I am not sure whether analogous tests have been performed for the lateral geniculate nucleus, but the feedback connections are there.

2.4.2: The human design of modalities in AI

An AI needs sensory modalities - but which modalities?  How do those modalities contribute materially to general intelligence outside the immediate modality?

Does an AI need a visuospatial system modeled after the grand complexity of the visuospatial system in primates and humans?  We know more about the human visual modality than about any other aspect of human neurology, but that doesn't mean we know enough to build a visual modality from scratch.  Furthermore, the human visual modality is enormously complex, computationally intensive, and fitted to an environment which an AI does not necessarily have an immediate need to comprehend.  Should humanlike 3D vision3 be one of the first modalities attempted?

I believe it will prove best to discard the human modalities or to use them as inspiration only - to use a completely different set of sensory modalities during the AI's early stages.  An AI occupies a different environment than a human and direct imitation of human modalities would not be appropriate.  For an AI's initial learning experiences, I would advocate placing the AI in complex virtual environments, where the virtual environments are internal to the computer but external to the AI.  The programmers would then attempt to develop sensory modalities corresponding to the virtual environments.  Henceforth I may use the term "microenvironment" to indicate a complex virtual environment.  The term "microworld" is less unwieldy, but should not be taken as having the Good Old-Fashioned AI connotation of "microworlds" in which all features are directly represented by predicate logic, e.g., SHRDLU's simplified world of blocks and tables [Winograd72].

Abandoning the human modalities appears to introduce an additional fragile dependency on the correctness of the AI theory, in that substituting novel sensory modalities for the human ones would appear to require a correct understanding of the nature of sensory modalities and how they contribute to intelligence.  This is true, but I would argue that the existence of an additional dependency is illusory.  An attempt to blindly imitate the human visual modality, without understanding the role of modalities in intelligence, would be unlikely to contribute to general intelligence except by accident.  Our modern understanding of the human visual modality is not so perfect that we could rely on the functional completeness of a neurologically inspired design; for example, a design based only on consensus contemporary theory might omit feature controllers!  However, shifting to microworlds does require that experience in the microworlds reproduce functionally relevant aspects of experience in real life, including unpredictability, uncertainty, real-time process control, holonic (part-whole) organization, et cetera.  I do not believe that this introduces an additional dependency on theoretic understanding, over and above the theoretic understanding that would be required to build an AI that absorbed complexity from these aspects of real-world environments, but it nonetheless represents a strong dependency on theoretic understanding.

Suppose that we are designing, de novo, a sensory modality and virtual environment.  Three possible modalities that come to mind as reasonable for a very primitive and early-stage AI, in ascending order of implementational difficulty, would be:

  1. A modality for Newtonian billiard balls;
  2. A modality for a 100x100 "Go" board;
  3. A modality for some type of interpreted code (a metaphorical "codic cortex").
In human vision, the very first visual neurons are the "rods and cones" which transduce impinging environmental photons to a neural representation as sensory information.  For each of the three modalities above, the "rods and cones" level would probably use essentially the same representation as the data structures used to create the microworld, or virtual environment, in which the AI is embodied.  This is a major departure from the design of naturally evolved modalities, in which the basic level - the quark level, as far as we know - is many layers removed from the high-level objects that give rise to the indirect information that reaches the senses.  Evolved sensory modalities devote most of their complexity to reconstructing the world that gives rise to the incoming sensory impressions - to reconstructing the 3D moving objects that give rise to the photons impinging on the rods-and-cones layer of the retina.  Of course, choosing vision as an example is arguably a biased selection; sound is not as complex as vision, and smell and taste are not as complex as sound.  Nonetheless, eliminating the uncertainty and intervening layers between the true environment and the organism's sensory data is a major step.  It should significantly reduce the challenges of early AI development, but is a dangerous step nonetheless because of its distance from the biological paradigm and its elimination of a significant complexity source.

I recommend eliminating environmental reconstruction as a complexity source in early AI development.  Visualizing the prospect of deliberately degrading the quality of the AI's environmental information on one end, and elaborating the AI's sensory modality on the other end, I find it likely that the entire operation will cancel out, contributing nothing.  An AI that had to learn to reconstruct the environment, in the same way that evolution learned to construct sensory modalities, might produce interesting complexity as a result; but if the same programmer is creating environmental complexity and modality complexity, I would expect the two operations to cancel out.  While environmental reconstruction is a nontrivial complexity source within the human brain, I consider the ratio between the difficulty of programmer development of the complexity, and the contribution of that complexity to general intelligence, to be relatively small.  Adding complexity for environmental reconstruction, by introducing additional layers of complexity in the microworld and deliberately introducing information losses between the topmost layer of the microworld and the AI's sensory receptors, and then attempting to create an AI modality which could reconstruct the original microworld content from the final sensory signal, would require a relatively great investment of effort in return for what I suspect would be a relatively small boost to general intelligence.

Suppose that for each of the three modalities - billiards, Go, code - the "pre-retinal" level consists of true and accurate information about the quark level of the virtual microworld, although perhaps not complete information, and that the essential complexity which renders the model a "sensory modality" rests in the feature structure, the ascending layers of feature detectors and descending layers of feature controllers.  Which features, then, are appropriate?  And how do they contribute materially to general intelligence?

The usual statement is that the complexity in a sensory modality reflects regularities of the environment, but I wish to offer a slightly different viewpoint.  To illustrate this view, I must borrow and severely simplify the punchline of a truly elegant paper, "The Perceptual Organization of Colors" by Roger Shepard [Shepard92].  Among other questions, this paper seeks to answer the question of trichromancy:  Why are there three kinds of cones in the human retina, and not two, or four?  Why is human visual perception organized into a three-dimensional color space?  Historically, it was often theorized that trichromancy represented an arbitrary compromise between chromatic resolution and spatial resolution; that is, between the number of colors perceived and the grain size of visual resolution.  As it turns out, there is a more fundamental reason why three color channels are needed.

To clarify the question, consider that surfaces possess a potentially infinite number of spectral reflectance distributions.  We will focus on spectral reflectance distributions, rather than spectral power distributions, because adaptively relevant objects that emit their own light are environmentally rare.  Hence the physically constant property of most objects is the spectral reflectance distribution, which combines with the spectral power distribution of light impinging on the object to give rise to the spectral power distribution received by the human eye.  The spectral reflectance distribution is defined over the wavelengths from 400nm to 700nm (the visible range), and since wavelength is a continuum, the spectral reflectance distribution can theoretically require an unlimited number of quantities to specify.  Hence, it is not possible to exactly constrain a spectral reflectance distribution using only three quantities, which is the amount of information transduced by human cones.

The human eye is not capable of discriminating among all physically possible reflecting surfaces.  However, it is possible that for "natural" surfaces - surfaces of the kind commonly encountered in the ancestral environment - reflectance for each pure frequency does not vary independently of reflectance for all other frequencies.  For example, there might exist some set of basis reflectance functions, such that the reflectance distributions of almost all natural surfaces could be expressed as a weighted sum of the basis vectors.  If so, one possible explanation for the trichromancy of human vision would be that three color channels are just enough to perform adequate discrimination in a "natural" color space of limited dimensionality.

The ability to discriminate between all natural surfaces would be the design recommended by the "environmental regularity" philosophy of sensory modalities.  The dimensionality of the internal model would mirror the dimensionality of the environment.

As it turns out, natural surfaces have spectral reflectance distributions that vary along roughly five to seven dimensions [Maloney86].  There thus exist natural surfaces that, although appearing to trichromatic viewers as "the same color", nonetheless possess different spectral reflectance distributions.

[Shepard92] instead asks how many color channels are needed to ensure that the color we perceive is the same color each time the surface is viewed under different lighting conditions.  The amount of ambient light can also potentially vary along an unlimited number of dimensions, and the actual light reaching the eye is the product of the spectral power distribution and the spectral reflectance distribution.  A reddish object in bluish light may reflect the same number of photons of each wavelength as a bluish object in reddish light.  Similarly, a white object in reddish light may reflect mostly red photons, while the same white object in bluish light may reflect mostly blue photons.  And yet the human visual system manages to maintain the property of color constancy; the same object will appear to be the same color under different lighting conditions.

[Judd64] measured 622 spectral power distributions for natural lighting, under 622 widely varying natural conditions of weather and times of day, and found that variations in natural lighting reduce to three degrees of freedom.  Furthermore, these three degrees of freedom bear a close correspondence to the three dimensions of color opponency that were proposed for the human visual system based on experimental examination [Hurvich57].  The three degrees of freedom are:

The three color channels of the human visual system are precisely the number of channels needed in order to maintain color constancy under natural lighting conditions4.  Three color channels are not enough to discriminate between all natural surface reflectances, but three color channels are the exact number required to compensate for ambient natural lighting and thereby ensure that the same surface is perceptually the "same color" on any two occasions.  This simplifies the adaptively important task of recognizing a previously experienced object on future encounters.

The lesson I would learn from this tale of color constancy is that sensory modalities are about invariants and not just regularities.  Consider the task of designing a sensory modality for some form of interpreted code.  (This is a very challenging task because human programming languages tend toward non-smooth fitness landscapes, as previously discussed.)  When considering which features to extract, the question I would ask is not "What regularities are found in code?" but rather "What feature structure is needed for the AI to perceive two identical algorithms with slightly different implementations as 'the same piece of code'?"  Or more concretely:  "What features does this modality need to extract to perceive the recursive algorithm for the Fibonacci sequence and the iterative algorithm for the Fibonacci sequence as 'the same piece of code'?"

Tip your head slightly to the left, then slightly to the right.  Every retinal receptor may receive a different signal, but the experienced visual field remains almost exactly the "same".  Hold up a chess pawn, and tip it slightly to the left or slightly to the right.  Despite the changes in retinal reception, we see the "same" pawn with a slightly different orientation.  Could a sensory modality for code look at two sets of interpreted bytecodes (or other program listing), completely different on a byte-by-byte basis, and see these two listings as the "same" algorithm in two slightly different "orientations"?

The modality level of organization, like the code level, has a characteristic kind of work that it performs.  Formulating a butterfly concept and seeing two butterflies as members of the same category is the work of the concept level, but seeing a chess pawn in two orientations as the same pawn is the work of the modality level.  There is overlap between the modality level and the concept level, just as there is overlap between the code level and the modality level.  But on the whole, the modality level is about invariants rather than regularities and identities rather than categories.

Similarly, the understanding conferred by the modality level should not be confused with the analytic understanding characteristic of thoughts and deliberation.  Returning to the example of a codic modality, one possible indication of a serious design error would be constructing a modality that could analyze any possible piece of code equally well.  The very first layer of the retina - rods and cones - is the only part of the human visual system that will work on all possible pixel fields.  The rest of the visual system will only work for the low-entropy pixel fields experienced by a low-entropy organism in a low-entropy environment.  The very next layer, after rods and cones, already relies on center-surround organization being a useful way to compress visual information; this only holds true in a low-entropy visual environment.

Designing a modality that worked equally well for any possible computer program would probably be an indication that the modality was extracting the wrong kind of information.  Thus, one should be wary of an alleged "feature structure" that looks as if it would work equally well for all possible pieces of code.  It may be a valid analytical method but it probably belongs on the deliberation level, not the modality level.  (Admittedly not every local step of a modality must be dependent on low-entropy input; some local stages of processing may have the mathematical nature of a lossless transform that works equally well on any possible input.  Also, hardware is probably better suited than wetware to lossless transforms.)

The human brain is constrained by a characteristic serial speed of 200 sequential steps per second, and by the ubiquitous internal use of the synchronous arrival of associated information, to arrange processing stages that flow smoothly forward.  High-level "if-then" or "switch-case" logic is harder to arrive at neurally, and extended complex "if-then" or "switch-case" logic is probably almost impossible unless implemented through branching parallel circuitry that remains synchronized.  Probably an exceptional condition must be ignored, averaged out, or otherwise handled using the same algorithms that would apply to any other modality content. Can an AI modality use an architecture that applies different algorithms to different pieces of modality content?  Can an AI modality handle exceptional conditions through special-case code?  I would advise caution, for several reasons.  First, major "if-then" branches are characteristic of deliberative processes, and being tempted to use such a branch may indicate a level confusion.  Second, making exceptions to the smooth flow of processing will probably complicate the meshing of concepts and modalities.  Third, modalities are imperfect but fault-tolerant processes, and the fault tolerance plays a role in smoothing out the fitness landscapes and letting the higher levels of organization be built on top; thus, trying to handle all the data by detecting exceptional conditions and correcting them, a standard pattern in human programming, may indicate that the modality is insufficiently fault-tolerant.  Fourth, handling all exceptions is characteristic of trying to handle all inputs and not just low-entropy inputs.  Hence, on the whole, sensory modalities are characterized by the smooth flow of information through ascending layers of feature detectors.  Of course, detecting an exceptional condition as a feature may turn out to be entirely appropriate!

Another issue which may arise in artificial sensory modalities is that unsophisticated artificial modalities may turn out to be significantly more expensive, computationally, for the effective intelligence they deliver.  Sophisticated evolved modalities conserve computing power in ways that might be very difficult for a human programmer to duplicate.  An example would be the use of partial imagery, modeling only the features that are needed for a high-level task [Hayhoe98]; a simplified modality that does not support partial imagery may consume more computing power.  Another example would be the human visual system's selective concentration on the center of the visual field - the "foveal architecture", in which areas of the visual field closer to the center are allocated a greater number of neurons.  The cortical magnification factor for primates is inverse-linear [Tootell85]; the complex logarithm is the only two-dimensional map function that has this property [Schwartz77], as confirmed experimentally by [Schwartz89].  A constant-resolution version of the visual cortex, with the maximum human visual resolution across the full human visual field, would require 10,000 times as many cells as our actual cortex [Rojer90].

But consider the programmatic problems introduced by the use of a logarithmic map.  Depending on where an object lies in the visual field, its internal representation on a retinotopic map will be completely different; no direct comparison of the data structures would show the identity or even hint at the identity.  That an off-center object in our visual field can rotate without perceptually distorting, as its image distorts wildly within the physical retinotopic map, presents a nontrivial computational problem5.

Evolution conserves computing power by complicating the algorithm.  Evolution, considered as a design pressure, exerts a steady equipotential design pressure across all existing complexity; a human programmer wields general intelligence like a scalpel.  It is not much harder for evolution to "design" and "debug" a logarithmic visual map because of this steady "design pressure"; further adaptations can build on top of a logarithmic visual map almost as easily as a constant-resolution map.  A human programmer's general intelligence would run into difficulty keeping track of all the simultaneous design complications created by a logarithmic map.  It might be possible, but it would be difficult, especially in the context of exploratory research; the logarithmic map transforms simple design problems into complex design problems and hence transforms complex design problems into nightmares.

I would suggest using constant-resolution sensory modalities during the early stages of an AI - as implied above by suggesting a sensory modality modeled around a 100x100 Go board - but the implication is that these early modalities will be lower-resolution, will have a smaller field, and will be less efficient computationally.  An opposing theoretic view would be that complex but efficient modalities introduce necessary issues for intelligence.  An opposing pragmatic view would be that complex but efficient modalities are easier to accommodate in a mature AI if they have been included in the architecture from the beginning, so as to avoid metaphorical "Y2K" issues (ubiquitous dependencies on a simplifying assumption which is later invalidated).



Next: 2.5: The concept level
Up: 2: Part II: Levels of organization in deliberative general intelligence
Prev: 2.3: The code level