| Next: | 2.3: Concepts | Bookmark | |
| Up: | 2: Mind | Monolithic | |
| Prev: | Interlude: The Consensus and the Veil of Maya |
A human has a visual cortex, an auditory cortex, a sensorimotor cortex - areas of the brain specifically devoted to particular senses. Each such "cortex" is composed of neural modules which extract important mid-level and high-level features from the low-level data, in a way determined by the "laws of physics" of that domain. The visual cortex and associated areas (1) are by far the best-understood parts of the brain, so that's what we'll use for an example.
Visual information starts out as light hitting the retina; the resulting information can be thought of as being analogous to a two-dimensional array of pixels (although the neural "pixels" aren't rectangular). "Low-level" feature extraction starts right in the retina, with neurons that respond to edges, intensity changes, light spots, dark spots, et cetera. From this new representation - the 2D pixels, plus features like edges, light spots, and so on - the lateral geniculate nucleus and striate cortex extract mid-level features such as edge orientation, movement, direction of moving features, textures, the curvature of textured surfaces, shading, and binocular perception. This information yields David Marr's two-and-a-half-dimensional world, which is composed of scattered facts about the three-dimensional properties of two-dimensional features - this is a continuous surface, this surface is curving away and to the left, these two surfaces meet to form an edge, these three edges meet to form a corner.
Finally, a 3D representation of moving objects is constructed from the 2.5D world. Constraint propagation: If the 3D interpretation of one corner requires an edge to be convex, then that edge cannot be concave in another corner. Object assembly: Multiple surfaces that move at the same speed, or that move in a fashion consistent with rotation, are part of a single object. Consistency: An object (or an edge, or a surface) cannot simultaneously be moving in two directions.
The resulting 3D representation, still bound to the 2.5D features and the 2D pixels, is sent to the temporal cortex for object recognition and to the parietal cortex for spatial visualization.
The visual cortex is the foundation of one of the seven senses. (Yes, at least seven. In addition to sight, sound, taste, smell, and touch, there's proprioception (the nerves that tell us where our arms and legs are) and the vestibular sense (the inner ear's inertial motion-detectors). (2).) The neural areas that are devoted solely to processing one sense or another account for a huge chunk of the human cortex. In the modular partitioning of the human brain, the single most common type of module is a sensory modality, or a piece of one. This demonstrates a fundamental lesson about minds in general.
Classical AI programs, particularly "expert systems", are often partitioned into microtheories. A microtheory is a body of knowledge, i.e. a big semantic net, e.g. propositional logic, a.k.a. suggestively named LISP tokens. A typical microtheory subject is a human specialty, such as "cars" or "childhood diseases" or "oil refineries". The content of knowledge typically consists of what would, in a human, be very high-level, heuristic statements: "A child that is sick on Saturday is more likely to be seriously ill than a child who's sick on a schoolday."
How do the microtheory-based modules of classical AI differ from the sensory modules that are common in the human mind? How does a "microtheory of vision" differ from a "visual cortex"? Why did the microtheory approach fail?
There are two fundamental clues that, in retrospect, should have alerted expert-system theorists ("knowledge engineers") that something was wrong. First, microtheories attempt to embody high-level rules of reasoning - heuristics that require a lot of pre-existing content in the world-model. The visual cortex doesn't know about butterflies; it knows about edge-detection. The visual cortex doesn't contain a preprogrammed picture of a butterfly; it contains the feature-extractors that let you look at a butterfly, parse it as a distinct object standing out against the background, remember that object apart from the background, and reconstruct a picture of that object from memory. We are not born with experience of butterflies; we are born with the visual cortex that gives us the capability to experience and remember butterflies. The visual cortex is not visual knowledge; it is the space in which visual knowledge exists.
The second, deeper problem follows from the first. All of an expert system's microtheories have the same underlying data structures (in this case, propositional logic), acted on by the same underlying procedures (in this case, a few rules of Bayesian reasoning). Why separate something into distinct modules if they all use the same data structures and the same functions? Shouldn't a real program have more than one real module?
I'm not suggesting that data formats and modules be proliferated because this will magically make the program work better. Any competent programmer knows not to use two data formats where one will do. But if the data and processes aren't complex enough to seize the programmer by the throat and force a modular architecture, then the program is too simple to give rise to real intelligence.
Besides, a single-module architecture certainly isn't the way the brain does it. Maybe there's some ingenious way to represent auditory and visual information using a single underlying data structure. If we can get away with it, great. But if no act of genius is required to solve the very deep problem of getting domain-specific representations to interact usefully, if the problem is "solved" because all the content of thought takes the form of propositional logic, if all the behaviors can fit comfortably into a single programmatic module - then the program doesn't have enough complexity to be a decent video game, much less an AI. (3).
We shouldn't be too harsh on the classical-AI researchers. Building an AI that operates on "pure logic" - no sensory modalities, no equivalent to the visual cortex - was worth trying. As Ed Regis would say, it had a certain hubristic appeal. Why does human thought use the visual cortex? Because it's there! After all, if you've already evolved a visual cortex, further adaptations will naturally take advantage of it. It doesn't mean that an engineer, working ab initio, must be bound by the human way of doing things.
But it didn't work. The recipe for intelligence presented by GISAI assumes an AI that possesses equivalents to the visual cortex, auditory cortex, and so on. Not necessarily these particular cortices; after all, Helen Keller (who was blind and deaf, and spoke in hand signs) learned to think intelligently. But even Helen Keller had proprioception, and thus a parietal lobe for spatial orientations; she had a sense of touch, which she could use to "listen" to sign language; she could use the sensory modalities she had to perceive signed symbols, and form symbols internally, and string those symbols together to form sentences, and think. (4) Some equivalent of some type of "cortex" is necessary to the GISAI design.
"Cortex" is a specifically neurological term referring to the surface area of the brain, and therefore I will use the term "sensory modality", or "modality", instead of cortex.
| DEFN: | Modality: Modalities in an AI are analogous to human cortices - visual cortex, auditory cortex, et cetera - enabling the AI to visualize processes in the target domain. Modalities capture, not high-level knowledge, but low-level behaviors. A modality has data structures suited to representing the target domain, and codelets or processing stages which extract higher-level features from raw data. |
Why does an AI need a visual modality? Because the human visual cortex and associated neuroanatomy - our visual modality - is what makes our thoughts of 2D and 3D objects real. Drew McDermott, in Artificial Intelligence Meets Natural Stupidity, pointed out that, just because a LISP token is labeled with the character string "hamburger", it does not mean that the program understands hamburgers. The program has not even noticed hamburgers. If the symbol were called G0025 instead of hamburger, nobody would ever be able to figure out that the token was supposed to represent a hamburger.
When two objects collide, we don't just have a bit of propositional logic that says collide(car, truck); we imagine two moving objects. We model 2D pixels and 3D features and visualize the objects crashing together. The edges touch, not as touch(edge-of(car), edge-of(truck)), but as two curves meeting and deforming at all the individual points along the edge. You could successfully look at a human brain and deduce that the neurons in question were modelling edges and colliding objects; this is, in fact, what visual neuroanatomists do. But if you did the same to a classical AI, if you stripped away the handy English variable names from the propositional logic, you'd be left with G0025(Q0423, U0111) and H0096(D0103(Q0423), D0103(U0111)). No amount of reasoning could bind those cryptic numbers to real-world cars or trucks.
Furthermore, our visual cortex is useful for more than vision. Philosophy in the Flesh (George Lakoff and Mark Johnson) talks about the Source-Path-Goal pattern (5) - a trajector that moves, a starting point, a goal, a route; the position of the trajector at a given time, the direction at that time, the actual final destination... Philosophy in the Flesh also talks about "internal spatial 'logic' and built-in inferences": If you traverse a route, you have been at all locations along the route; if you travel from A to B and B to C, you have traveled from A to C; if X and Y are traveling along a direct route from A to B and X passes Y, then X is further from A and closer to B than Y is.
These are all behaviors of spatial reality. Classical AI would attempt to capture descriptions of this behavior; i.e. "if travel(X, A, B) and travel(X, B, C) then travel(X, A, C)". The problem is that the low-level elements (pixels, trajectors, velocities) making up the model can yield a nearly infinite number of high-level behaviors, all of which - under the classical-AI method - must be described independently. If A is-contained-in B, it can't get out - unless B has-a-hole. Unless A is-larger-than the hole. Unless A can-turn-on-its-side or the hole is-flexible. Trying to describe all the possible behaviors exhibited by the high-level characteristics, without directly simulating the underlying reality, is like trying to design a CPU that multiplies two 32-bit numbers using a doubly-indexed lookup table with 2^64 (around eighteen billion billion) entries.
Real CPUs take advantage of the fact that 32-bit numbers are made of bits. This enables transistors to multiply using the wedding-cake method (or whatever it is modern CPU designs use). A 32-bit number is not a monolithic object. The numerical interpretation of 32 binary digits is not intrinsic, but rather a high-level characteristic, an observation, an abstraction. The individual bits interact, and yield a 32-bit (or 64-bit) result which can then be interpreted as the resulting number. The computer can multiply 9825 by 767 and get 7535775, not because someone told it that 9825 times 767 is 7535775, but because someone told it about how to multiply the individual bits.
A visual modality grants the power to observe, predict, decide, and manipulate objects moving in trajectories, not because the modality captures knowledge of high-level characteristics, but because the modality has elements which behave in the same way as the external reality. An AI with a visual modality has the potential to understand the concept of "closer", not because it has vast stores of propositional logic about closer(A, B), but because the model of A and B is composed of actual pixels which are actually getting closer. (6).
Source-Path-Goal is not just a visual pattern. It is a metaphor that applies to almost any effort. Force and resistance aren't just people pushing carts, they're companies pushing products. Source-Path-Goal applies not just to walking to Manhattan, but a programmer struggling to write an application that conforms to the requirements spec. It applies to the progress of these very words, moving across the screen as I type them, decreasing the distance to the goal of a publishable Web page. Furthermore, the visual metaphor is in many cases a useful metaphor, one which binds predictively. (7). A metaphor is useful when it involves, not just a similarity of high-level characteristics, but a similarity of low-level elements, or a single underlying cause. (See previous footnote.) The visual metaphor that maps the behavior of a programming task to the Source-Path-Goal pattern (a visual object moving along a visual line) is useful if some measure of "task completed" can be mapped to the quantitative position of the trajector, and the perceived velocity used to (correctly!) predict the amount of time remaining on the task.
Of course, one must realize that having a visual modality is Necessary, But Not Sufficient, to pulling that kind of stunt. In such cases, noticing the analogy is ninety percent of the creativity. The atomic case of such noticing would consist of generating models at random, either by generating random data sets or by randomly mixing previously acquired models, until some covariance, some similarity, is noticed between the model and the reality. And then the AI says "Eureka!"
Of course, except for very simple metaphors, the search space is too large for blind constructs to ever match up with reality. It is more often necessary to deliberately construct a model - in this case, a visual model - whose behaviors correspond to reality. Discussion of such higher-level reasoning doesn't belong in the section on "sensory modalities", but being able to "deliberately construct" anything requires a way to manipulate the visual model. In addition to the hardware/code for taking the external action of "draw a square on the sheet of paper", a mind requires the hardware/code to take the internal action of "imagine a square". The consequence, in terms of how sensory modalities are programmed, is that feature extraction needs to be reversible. Not all of the features all of the time, of course, but for the cognitive act of visualization to be possible, there must be a mechanism whereby the perception that detects the "line" feature has an inverse function that constructs a line, or transforms something else into a line.
Feature reconstruction is much more difficult to program than feature extraction. More computationally intensive, too. It's the difference between multiplying the low-level elements of "7" and "17", and reconstructing two low-level elements which could have yielded the high-level feature of "119". This may be one of the reasons why thalamocortical sensory pathways are always reciprocated by corticothalamic projections of equal or greater size; for example, a cat has 10^6 neural fibers leading from the lateral geniculate nucleus to the visual cortex, but 10^7 fibers going in the reverse direction. (8).
Even a complete sensory modality, capable of perception and visualization, is useless without the rest of the AI. "Necessary, But Not Sufficient," the phrase goes. A modality provides some of the raw material that concepts are made of - the space in which visualizations exist, but nothing more. But, granting that the rest of the AI has been done properly, a visual modality will create the potential to understand the concept of "closer"; to use the concept of "closer", and heuristics derived from examining instances of the concept "closer", as a useful visual metaphor for other tasks; and to use deliberately constructed models, existing in the visual modality, to ground thinking about generic processes and interactions. (In other words, when considering a "fork" in chess or an "if" statement in code, it can be visualized as an object with a Y-shaped trajectory.)
Is a complete visual modality - pixels, edge detectors, surface-texture decoders, and all - really necessary to engage in spatial reasoning? Would a world of Newtonian billiard balls, with velocities and collision-detection, do as well? It would apparently suffice to represent concepts such as "fork", "if statement", "source-path-goal", "closer", and to create metaphors for most generic systems composed of discrete objects. The billiard-ball world has significantly less representative power; it's harder to understand a "curved trajectory" in spacetime if you can't visualize a curve in space. (9). But, considering the sheer programmatic difficulty of coding a visual modality, are metaphors with billiard balls composed of pixels that superior to metaphors with billiard balls implemented directly as low-level elements?
Well, yes. In a visual modality, you can switch from round billiard balls to square billiard balls, visualize them deforming as they touch, and otherwise "think outside the box". The potential for thinking outside the box, in this case, exists because the system being modeled has elements that are represented by high-level visual objects; these high-level visual objects in turn are composed of mid-level visual features which are composed of low-level visual elements. This provides wiggle room for creativity.
Consider the famous puzzle with nine dots arranged in a square, where you're supposed to draw four straight lines, without lifting pen from paper, to connect the dots. (10). To solve the puzzle one must "think outside the box" - that is, draw lines which extend beyond the confines of the square. A conventional computer program written to solve this problem would probably contain the "box" as an assumption built into the code, which is why computers have a reputation for lack of creativity. (11). A billiard-ball metaphor, even assuming that it could represent lines, might run into the same problem.
I suspect that many solvers of the nine-dot problem reach their insight because a particular configuration of tried-out lines suggests an incomplete triangle whose corners lie outside the box. "Seeing" an "incomplete triangle" is an optical illusion, which is to say that it's the result of high-level features being triggered and suggesting mid-level features - in this case, some extra lines that turn out to be the solution to the problem. Sure, you can make up ways that this could happen in a billiards modality, but then the billiards modality starts looking like a visual cortex. The point is that, for our particular human style of creativity, it is Necessary (But Not Sufficient) to have a modality with rich "extraneous" perceptions, and where high-level objects in the metaphor can be made to do unconventional things by mentally manipulating the low-level elements. (Even so, it would make development sense to start out with a billiards modality and work up to vision gradually.)
There are two final reasons for giving a seed AI sensory modalities: First, the possession of a codic modality may improve the AI's understanding of source code, at least until the AI is smart enough to make its own decisions about the balance between slow-conscious and fast-autonomic thought. Second, as will be discussed later, thoughts don't start out as abstract; they reach what we would consider the "abstract" level by climbing a layer cake of ideas. That layer cake starts with the non-abstract, autonomic intuitions and perceptions of the world described by modalities. The concrete world provided by modalities is what enables the AI to learn its way up to tackling abstract problems.
| NOTE: | One of the greatest advantages of
seed AI - second only to recursive self-improvement - is going
beyond the human sensory modalities. It's possible to create a sensory
modality for source code. The converse is also true: Various
processes that are autonomic in humans - memory storage, symbol formation
- can become sensory modalities subject to deliberate manipulation.
In programmatic terms, any program module with a coherent set of data structures and an API, which could benefit from higher-level thinking, is a candidate for transformation into a modality with world-model-capable representations, feature extraction, reversible features to allow mental actions, and the other design characteristics required to support concept formation. |
If proprioception does have a separate area of cortex (with distinct representations and extractable features), then it's a distinct sensory modality and should be known as such.
Otherwise, it's like suggesting that translating between Microsoft Word and HTML should be programmatically trivial because both files are really just magnetic patterns in the atoms of the hard disk. What matters is the level where they're different - that's where the Law of Pragmatism says the intelligence is. And if they aren't different anywhere - why, then, there's probably no intelligence.
It [the thalamus] has a simple position in the overall architecture; virtually all information arriving at the cerebral cortex comes from the thalamus, which receives it from subcortical structures... In particular, all visual, auditory, tactile, and proprioceptive information passes through the thalamus on its way to cortex...The most popular hypothesis is that these fibers play a gatekeeping role, assisting in focus of attention (why do you need more fibers to do that?); or, more plausibly, top-down constraints in feature extraction. And since this particular statistic is for cats, the latter hypothesis may be mostly correct. Visualization - imagination - is stereotypically associated with minds directed by general intelligence. While cats may need a memory, and thus the ability to reconstruct images from remembered high-level features, they probably don't need the detailed, fine-grained imagination of a human. So I wouldn't be surprised to find an even greater discrepancy in humans!These facts give rise to the classic view that the thalamus is a passive relay station which generates virtually all the information bearing input to the cortex...
BUT the above picture has omitted one fundamental fact: all projections from thalamus to cortex are reciprocated by feedback projections from cortex to thalamus of the same or even larger size. For instance, Sherman and Koch (1986) estimate that in cat there are roughly 10^6 fibers from the lateral geniculate nucleus in the thalamus to the visual cortex, but 10^7 fibers in the reverse direction! (Italics in original.)
Or perhaps, even for cats, more fibers go from cortex to thalamus than vice versa because even mnemonic sensory manipulation is just computationally harder than sensory perception.
| Next: | 2.3: Concepts |
| Up: | 2: Mind |
| Prev: | Interlude: The Consensus and the Veil of Maya |