Top-Down and Bottom-Up Processing in Object Recognition

Humans and animals rely heavily on object recognition, classification, and identification for their survival. However, the textbook introduction to visual perception is based on the physics of light reflecting off objects and the world and entering the eyes of the perceiver. Next, the light enters the biological optical structures that filter and focus the reflected light, which is detected on photoreceptors on the retina and then transformed into neural energy. However, once your senses take in the information and stimulation from the environment, your brain still has to understand what it is and figure it all out. What was that brown, small thing? Is it an animal, insect, or stone? 

We know the ‘how’ of light entering the eye, and its processing from the retina and routing to the brain. The question is how the brain actually perceives what we see. Do we only process what is projected to the retina or does our memory, knowledge, emotions, environment, and bodies have a role to play?

The ‘how’ of our perception of the world can be categorized into bottom-up and top-down processing theories. Taking sensory input of the stimulus and sending it upward for analysis of relevant information is called bottom-up or data-driven processing. It’s the starting point in the identification of data or sensory evidence you take from the environment. However, when you use your knowledge, emotions, motivations, expectations, cultural background, and bodies to perceive the world, it’s termed top-down or conceptually driven processing. 

Object recognition is a complex integration of top-down and bottom-up processes that have often been presented in contrast to each other because they rely on different types of information. 

Bottom-up processing

Direct Perception 

In bottom-up processing, the perceiver uses the bits of information received from the environment through the senses and combines them to form a precept. It uses information like edges, shapes, and light and puts together this information to understand the scene. Thus, the system works in one direction, from the input to a final interpretation of the object. Higher cognitive processes do not play a role in this system, and the system cannot go back to an earlier point to adjust.

Gibson’s ecological theory (1966, 1979/1986) emphasizes that perception involves innate mechanisms that we perceive without learning. James J. Gibson (1904–1980) questioned the 19th-century Danish psychologist Harald Hoffding’s associationism; he associated perception to what is seen with what is remembered. According to Gibson’s theory of direct perception, we do not require cognitive processes (i.e., Existing beliefs, memory, inferential thought processes, etc.) for perception and that the information in our sensory receptors is enough to perceive anything.

Gibson also stated that his theory is not just a mode of cognition but “the simplest and best kind of knowing” (Gibson. 1979). Gibson also argued that perception involves “keeping in touch with the environment” (1979), and it developed during evolution to allow our ancestors to respond rapidly to the environment. 

In his words, “When I assert that perception of the environment is direct, I mean that it is not mediated by retinal pictures, neural pictures, or mental pictures. Direct perception is the activity of getting information from the ambient array of light. I call this a process of information pickup that involves … looking around, getting around, and looking at things.” (Gibson, 1979, p. 147)

With the ecological approach, Gibson suggested that information about the world is available in the detectable patterns in the environment such that we directly perceive without first transforming a distal stimulus into a proximal stimulus.

Template Theories  

According to template theory, we compare the sensory input with diverse sets of templates for object recognition. These are highly detailed models for patterns that are stored in our minds. Some modern examples are fingerprints matching, bar codes, chess players using matching strategy, machines processing imprinted numerals on checks, etc. 

In the model of perception, template theory says that every event object awesomeness that we encounter and want to process for information requires it to be compared to a previously stored pattern or template. The perception process incoming information with stored templates to find the one that matches or comes close. This model implies that we have a part of the brain where millions of different templates are stored in the knowledge base, one for every pattern object we can recognize. 

However, template theory runs into problems when applied to biological systems. It cannot explain how perception works. First, In higher organisms, the memory load to store a template for every variant of a stimulus would be massive, and the number of templates for alternative forms would be infinite. Secondly, template matching cannot explain how we come to recognize new experiences objects such as laptops, computers, smartphones, and DVDs.

Feature-Matching Theories 

Unlike template theory, where whole templates or patterns were compared, Feature-Matching Theories only features of a pattern are matched to features stored in memory (Stankiewicz, 2003). This model attempts to correct the shortcomings of template matching models, which require a knowledge base of millions of templates. 

Also called prototype matching, this model matches a prototype rather than the complete pattern. According to prototype theory, various degrees belong to a conceptual category where some members are likely more central than others due to various defining traits. Thus, for example, the prototypical cat would have common features of the category cat, which might not have the distinctive features of its types. 

According to the prototype matching model, we compare it with previously stored prototypes whenever a new stimulus is registered. Unlike template matching, we do not require an exact match because we are looking at prototypes, and an approximate match is expected. This model allows for discrepancies, giving more flexibility than the template model. It does not require the object to have a specific feature or set of features to recognize it. 

One such model is Pandemonium, described by Selfridge (1959) where the four demons: image demons, feature demons, cognitive demons, and decision demons where all demons at one level are connected to all demons at the next upper level. 

Feature analysis

As information about the position, color, form, motion, presentation, and spatial frequency is encoded quite early in visual processing, it’s quite possible that the gestalt type mechanism also employs at an early stage. If that is true, feature analysis will occur before sensory input features are matched with memory. 

The visual search experiments of Neisser (1964, 1967) where he found Americans could easily identify the face of late President Kennedy among thousands of faces in a photograph of a crowd at a baseball match, are in favor of feature analysis. He also found that there is an advantage for the identification of a target when a dissimilar background is used. 

 Some studies of the retinas of frogs (Lettvin, Maturana, McCullogh, & Pitts, 1959) found that specific stimuli cause the retina cells to fire more frequently while certain cells respond strongly to edge detectors in the visual boundary between light and dark. This evidence supports feature analysis because whenever a particular feature is present, the detectors respond rapidly, and when it is not, they do not respond as strongly. 

Top-down processing

In top-down processing, the understanding begins with a general concept and moves towards more specific, where the object recognition is influenced by expectation and prior knowledge of the person. The information for this processing generates from your learning and experience. It means that the brain uses contextual information to fill in the blanks to make sense of information brought in by our senses.

We currently live in a world surrounded by limitless information and sensory experiences, and top-down processing helps us navigate and make sense of the environment. Top-down information is knowledge of objects regarding their shape, function, color, size, and location. For example, you know what a refrigerator looks like, the common parts associated with it, and where it is often located. Whenever a visual input matches these descriptions, you can be said to recognize the refrigerator. However, there will be some variation in the visual input of different refrigerators—they might have one, two, or multiple doors, and some might be on the floor while others will be built into the wall. 

Gregory’s Theory

In 1970, psychologist Richard Gregory stated that perception depends on top-down processing. However, he went against the bottom-up approaches and explained how prior knowledge and experience related to a stimulus help us perceive. For Gregory, perception involves making the best guess about what we see because 90% of visual information of a stimulus is lost when it arrives in the brain for processing. This event creates a perceptual hypothesis about the stimulus based on his memory and experience related to it. However, with visual illusions, such as the Necker cube, Gregory believed that the brain might create incorrect hypotheses, leading to several errors of perception.

Bruner and Goodman’s “Value and need as organizing factors in perception.”

One especially noticeable landmark in the importance of top-down processing in perception came with the publication of Bruner and Goodman’s “Value and need as organizing factors in perception” Bruner and Goodman (1947). Bruner and Goodman’s study reported that children perceive coins as more significant than worthless cardboard discs of the same physical size. However, children from low-income families perceived the coins as larger than worthless cardboard than did wealthy children. The early results of these studies ignited a movement in perceptual psychology that depicts top-down influences on perception. 

This momentum stalled under theoretical and methodological scrutiny when various other symbols failed to produce similar results. Finally, however, the last two decades have seen traction in the effects of top-down processing in perception, where alleged effects are motivation, emotion, action, categorization, and language. 

Combined Perceptual model

Top-down processes have to interact with bottom-up processes to perceive unexpected things and not perceive things as expected. David Marr’s theory incorporates both bottom-up and top-down processes. 

According to David Marr’s (1982) perceptual theory, perception proceeds in terms of several, special-purpose computational mechanisms, such as a module to analyze color, another to analyze motion, and so on. In this theory, all modules operate autonomously without considering input, output, other modules, and real-world knowledge. Thus, they are bottom-up processes.

Marr believes that visual perception constructs three mental representations: primal sketch, 2½-D (two-and-a-half-dimensional) sketch, and 3-D sketch. The first sketch depicts areas of relative brightness and darkness in a two-dimensional image and localized geometric structure. This allows the viewer to detect boundaries between areas but not to “know” what the visual information “means.” 

Next, the viewer uses the primal sketch to create a more complex representation, called a 2½-D (two-and-a-half-dimensional) sketch. In this, the viewer uses cues such as shading, texture, edges, and others to derive information about the surfaces, and it is depth relative to the viewer’s vantage point. 

According to Marr, first and second sketches rely almost only on bottom-up processes, and the expectation, real-world knowledge, and other high-level cognitive processes are incorporated in constructing the final, 3-D sketch of the visual scene. 

Constructive Perception

This theory of perception is a contrast to direct perception, where the perceiver uses sensory information of the stimulus along with other sources of information to construct the wholesome understanding of the stimulus. It shows the relationship between intelligence and perception, as it involves high-order cognitive skills in the process of perception. Some evidence for this theory comes from the perceptual constancy from visual (size, shape, and color), auditory, and tactile. 


  1. Bruner, J. S., & Goodman, C. C. (1947). Value and need as organizing factors in perception. The journal of abnormal and social psychology, 42(1), 33.
  2. Gibson, J. J. (1966). The problem of temporal order in stimulation and perception. The Journal of psychology, 62(2), 141-149.
  3. Gibson, J. J. (1979). The Ecological Approach to Visual Perception.
  4. Gilbert, C. D., & Sigman, M. (2007). Brain states: top-down influences in sensory processing. Neuron, 54(5), 677-696.
  5. Kinchla, R. A., & Wolfe, J. M. (1979). The order of visual processing:“Top-down,”“bottom-up,” or “middle-out”. Perception & psychophysics, 25(3), 225-231.
  6. Lettvin, J. Y., Maturana, H. R., McCulloch, W. S., & Pitts, W. H. (1959). What the frog’s eye tells the frog’s brain. Proceedings of the IRE, 47(11), 1940-1951.
  7. Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information.
  8. Neisser, U. (1964). Visual search. Scientific American, 210(6), 94-103.
  9. Stankiewicz, A. (2003). Reactive separations for process intensification: an industrial perspective. Chemical Engineering and Processing: Process Intensification, 42(3), 137-144.
  10. Ullman, S. (1980). Against direct perception.

Discover more from Osheen Jain

Subscribe now to keep reading and get access to the full archive.

Continue reading

Scroll to Top