Outline of the Thesis - 生物啟發之色彩與質感混合邊界偵測模型

Chapter 1 Introduction

1.5 Outline of the Thesis

To achieve the goals described in Section 1.4, this thesis is organized as follows:

Chapter 2 introduces the knowledge from physiology and psychophysics about vision. Some acknowledged evidences of mechanisms of human visual system which reveal effective visual processing procedure will be employed during our algorithm development.

Chapter 3 proposes the modeling strategy in this thesis. Three primitives:

luminance, texture, and color can be combined into a functional system properly and adaptively. Some ignored issues during individual modeling for specific goals will also be described well and solved for the hybrid-order scheme.

Chapter 4 gives a large number of experimental results and discussions among them. Discussions and comparisons for considered issues in Chapter 3 will also be present.

Chapter 5 concludes the innovations and contributions of this thesis and gives suggestions for future researches.

2. Chapter 2

Knowledge about Human Visual System

Human visual system is a powerful and elaborate system that is capable of extracting features and integrating them effectively. From physiological and psychophysical findings, there are large amounts of evidences revealing that human visual system carries out the task at its early stages [47], [48]. The initial stages of visual processing are very important in respect of detecting and grouping various types of visual primitives, such as curvature, line orientation, color, spatial frequency,

etc. In this chapter, some important knowledge about vision will be introduced, and

the modeling strategy to be introduced in the next chapter is based on these evidences to develop the overall framework.

2.1 Anatomical Structure of Human Visual System

2.1.1 The Visual Pathway

Figure 2-1: Schematic diagram of human visual system model.

Fig. 2-1 shows a probable model of human visual system at early stages. Part 1 is the optics of eye including cornea and lens which focus a scene onto the retina (part 2) where many receptors (cones and rods) spread over. The properties and uses of two types of receptors are quite different: rods, which are about 10 times as sensitive as cones, are the only functioning receptors at very low light levels. Cones, on the other hand, do not respond to dim light but be responsible for our ability to see fine details and for our color perception. There are three types of cones and each of them is categorized by the wavelength sensitivity. It is well believed that color perception is originated from the differences in wavelength selectivity of the three types of cones.

After a scene is projected on retina, the receptors will translate the light signals into neuro-electrical signals (the process is usually called transduction) and transmit them from back to front of eyes. Passing through layers of horizontal cells, bipolar cells, and amacrine cells, visual signals then arrive the layer of ganglion cells whose axons pass across the surface of the retina, collect in a bundle, and leave the eye to form the optic nerves. The optic nerves of two eyes join and split in the optic chiasma (part 3) and then reach the central part of visual information, called lateral geniculate nucleus (LGN, part 4). In LGN, visual information is divided into two pathways:

mangocellular pathway mainly dealing with motion perception and spatial information; and parvocellular pathway mainly dealing with color, shape, texture, etc.

Next to LGN, the visual information is delivered to the striate cortex (V1, part 5). The numbers of cells in V1 (about 250 million) are much more than ones in LGN (about 1 million) and the functions are more complex. V1 preserves the most precise topography map in cortex. After some basic processes in V1, various kinds of visual information are delivered to corresponding pathways for further processes. Roughly speaking, the visual pathway from retina to V1 is usually called the early vision level where large amounts of previous analyses and investigations focus on. Most of our

computational model was developed based on biological evidences at this level.

2.1.2 Receptive Fields in Visual Pathway

In retina, between the layer of receptors and the layer of ganglion cells, there are three types of nerve cells: bipolar cells, horizontal cells, and amacrine cells. Among the cells, receptive fields of them reveal most direct messages about their functional processing. The region where specific receptors feed into a given sensory neuron is usually called the cell’s receptive field (RF). Receptive fields have a substructure that stimulating different parts of the receptive fields will give different responses qualitatively and quantitatively thus two similar receptors in a cell’s receptive field might feed onto the cell diversely due to their spatial positions in the receptive field.

Besides, stimulating a large area will result in cancellation from the subdivisions rather than summation. The antagonism can be found in kinds of our sensory systems to avoid ambiguous sensations. The physical brightness amount of the “black” word in sunlight is more than the amount of “white” paper at low light levels. However, we never feel difficulty to discriminate the white paper and black word printed on. The discrimination can not be fulfilled by recording absolute information but relative one.

The main concept of receptive field is not only the connection but also the opponent form, and it is obvious that to understand functions in visual pathway, we should refer to receptive fields of nerve cells in visual pathway in advance.

The bipolar cells occupy a strategic position in the retina, since all signals originated from the receptors then transmitted toward the ganglion cells must pass through them. Visual signals are delivered from receptors to bipolar cells in two separate paths: a direct path where the receptors synapse onto the bipolar; and an indirect path where the receptors contact the horizontal cells which in turn synapse onto the bipolar cell. That is, each bipolar cell is connected to receptors in a two-path

form. In the direct path, a bipolar cell obtains inputs from the receptors in a circle-shape area of retina; from the indirect connections via horizontal cells the bipolar cell receives inputs from a larger, overlapping, and concentric disk-shape area (Fig. 2-2). The two paths deliver opposite tendency to the bipolar cell. That is, for identical stimuli, one path will deliver excitatory response while the other will deliver inhibitive one. The substructure of bipolar cell’s receptive field is called as the opponent center-surround mechanism. Similarly, the ganglion cell’s receptive field also has such substructure, and in actual, the opponent center-surround mechanism was firstly discovered in ganglion cell [49], [50]. The layer of ganglion cells is the last stage of visual signals in eyes (the output of eyes); that is, the center-surround antagonism provides a preliminary understanding of processes in retina and helps the exploration for higher level visual processing.

Figure 2-2: Schematic diagram of receptive field of the bipolar cell.

From retina to posterior stages in visual pathway, main processing for spatial vision task does not change until V1. On the whole, the LGN is like a relay station for

visual signals, and receptive field profile maintains the form in retina. As visual signals are delivered to V1, the receptive field appears more elaborate properties. V1 cells have several characteristics not seen earlier: binocularity, direction selectivity, and much narrow orientation and spatial frequency selectivity. The pioneers in this field are Hubel and Wiesel (Nobel Prize, 1981). They discovered most V1 cells do not respond to isotropic stimuli (e.g. point) but to specific line stimuli [51]-[53].

According to responses to various types of stimuli, Hubel and Wiesel classified V1 cells into two categories: simple cells and complex cells. Simple cells respond most to stimuli with specific preferred orientations. Receptive fields of simple cells, like cells in retina, also have excitatory region and inhibitive region. There are two types of simple cells determined by the arrangement of excitatory region and inhibitive region.

One is organized with inhibition- excitation- inhibition arrangement (even symmetric) while the other is organized with excitation- inhibition arrangement (odd symmetric) as shown in Fig. 2-3. The other category of V1 cells, complex cells, respond most to line segments with specific orientations moving along specific directions. In V1 and prior stages, the receptive fields of most neuron cells appear simple profiles and can be easily understood. Still, there seems not an integral analysis considering the properties more thoroughly. The next section to be introduced is the main integrating theory: linear filtering theory.

Figure 2-3: Schematic diagram of receptive field of the V1 cells.

2.2 Linear Filtering Theory

As more and more evidences from physiological and psychophysical experiments about receptive fields were recorded, many researchers attempted to further discover the fundamental processing of visual information. Among them, linear filtering theory [47], [54], [55] has made a great impact on recent visual researches. Spectral analyzing procedure is the principal concept of linear filtering theory. Researchers in the field indicated that recording visual information by spatial frequency decomposition is a more efficient way and large amounts of experiments have been performed to verify the argument. We would like to point out that the concept of linear filtering theory is a little bit different to the ideal Fourier analysis because we cannot exactly calculate frequency components in a spatially delimited image. Moreover, no neuron cell has a receptive field extending to unlimited region;

that is, information from a receptive field can reach neither the most precise frequency resolution nor spatial resolution. The issue can be explained more thoroughly by introducing another important concept: multiple spatial frequency channels.

The receptive field of a unit at pre-cortical stages is possessed of center-surround antagonism which can be interpreted as a band-pass filter extracting a specific range of spatial frequencies. The contrast sensitivity function (CSF) of human visual system supports the assumption that the CSF attenuates at low and high frequencies (Fig. 2-4).

Until the late 1960s, it was assumed that all ganglion cells have the same broad sensitivity profile as the CSF. In 1968, however, Campbell and Robson [54] made a revolutionary suggestion that the visual system might contain a group of independent, band-pass filters, which are narrowly tuned for ranges of frequencies (Fig. 2-4). In other words, human visual system does not employ a single mechanism to deal with all spatial frequencies but a group of mechanisms, and each of them is responsive to

only some fraction of total range.

Figure 2-4: The overall luminance contrast sensitivity function (CSF) consists of multiple spatial frequency channels.

The assumption was soon supported and verified by many physiological and psychophysical evidences. Anatomical records showed that there exist cells with different sizes of receptive fields corresponding to different sensitivity frequencies;

and responses of cells measured by micro-electrode also showed much narrower sensitivities than the overall CSF. In addition, in many psychophysical experiments such as pattern adaptation, frequency masking, subthreshold summation, etc. [56], stimulus at specific frequency did not result in an overall effect on CSF but a local influence near the stimulating frequency that supported the assumption about multiple spatial frequency channels as well. These evidences all revealed that there is no truly Fourier analyzer in human visual system, and of course, what vision system functions is not global analysis that requires extremely narrow channels but a group of channels operating spatial-frequency filtering. Besides these evidences, from the viewpoint of signal analysis, it is more economical and suitable to represent the contents on surroundings (e.g. objects, illuminations, etc.) by local spatial frequency filtering. In human visual system, such phenomenon appears in ganglion cells, LGN cells, and V1 cells. An image in V1 is decomposed into not only spatial frequencies but also

orientations. A schematic model of columnar organization in V1 shown in Fig. 2-5 represents that various two-dimensional spatial frequencies are considered to be in a polar arrangement, with spatial frequencies increasing from the center. By choosing appropriate basis, the organization can be represented well by spatial-frequency analysis, e.g. wavelets transform [27]. In actual, it had been verified that Gabor function [17], [57] could fit well the receptive field profile of V1 cell and a two-dimensional Gabor representation could also characterize an image completely [35], [36]. We will discuss the Gabor function more detailed in Chapter 3.

Figure 2-5: Schematic diagram of columnar organization in V1 (adapted from [47]).

2.3 Color Vision

The backgrounds described above mainly discussed the luminance content in an image, and in fact, those biological evidences were derived from experiments with gray-scale images. For color images, the processing complexities do not merely add one feature. Here I will briefly introduce the color perception and issues about spatial vision from color.

Like texture, color is not a physical quantity but perception. As described in

Section 2.1, there are three types of cones categorized by their wavelength sensitivity.

We usually call the three types of cones as L-cone, M-cone, and S-cone (Long-, Median-, and Short- wavelength sensitive). Unlike luminance information, which directly corresponds to responses of receptors, color information is a manufactured output of visual processes that responses of three types of cones are integrated within a region and delivered. Vision system can roughly analyze the content of perceived spectrum to make up our color perception. The opponent-process theory [45]

proposed by Hering is a very important theory characterizing color information processes at neuron stages. Like luminance representation from receptive fields with antagonism, vision system transmits color information by a similar way. After the layer of receptors, color information is delivered in three opponent channels including red-to-green channel, blue-to-yellow channel (two chromatic channels), and white-to-black channel (one luminance channel). Such representation was directly supported by records in LGN cells [58] and some investigations on complementary colors. Like receptive field profiles, under limited amounts of neurons and nerves, the opponent representation could provide a more economical and robust transmission.

Figure 2-6: Representation of Hering’s opponent-process theory.

Here comes another issue, at neuron stages, color information is encoded in one

luminance and two chromatic channels, and the luminance is represented in a group of spatial frequency channels. Are there similar mechanisms in the other two chromatic channels? Fig. 2-7 is the schematic plot of overall CSF of chromatic and luminance stimuli. Compared with luminance CSF, sensitivity range of chromatic CSF tends toward lower frequency and there is no significant attenuation at low frequencies.

Also, the sensitivity amounts are less than those in luminance CSF. Texture discrimination had been seldom attributed to color information for two reasons: (a) Chromatic features are extracted within regions that inevitably lead to coarser resolution and lower sensitivity frequency in chromatic channels. (b) Due to transmission form, cells for chromatic information are possessed of opponent mechanisms containing excitatory and inhibitive wavelength ranges, that the contrast range in chromatic channel is more limited than the range in luminance channel.

Some researchers [59] even asserted that color information provides nothing for texture discrimination. In fact, as long as chromatic stimuli are manipulated within proper bandwidth and range, cells for chromatic information still preserve operations for texture perception. From some experiments with isoluminant stimuli [42]-[44], chromatic information also revealed the representation of spatial frequency channels as luminance information. Moreover, records of V1 cells revealed orientation selectivity for pure chromatic stimuli. That is, except for some basic odds of sensitive frequency and resolution, the multiple spatial-frequencies filtering scheme can describe well all three opponent channels. The economical and significant visual processing scheme reveals that an efficient and simple implementation is possible.

Figure 2-7: Contrast sensitivity function of luminance and chromatic stimuli.

2.4 Feature Integration Theory

In Section 1.2, we mentioned that if textures can be discriminated immediately is a rough but important judgment on whether the proposed approach functions correctly or not. The preattentive visual task is completed at very early stages without attention involved; that is, there will be no top-down process thoroughly analyzing visual primitives. The definition is clear except for the word “preattentive.” Purpose of this thesis would not be definite until a clear description about preattentive processing could be given.

Feature integration theory (FIT) by Treisman and Gelade [60] gave an intuitive and critical definition to demarcation between preattentive stage and attentive stage.

Fig. 2-8 is a schematic diagram of FIT composed of two stages of visual processing:

At first, visual primitives of objects are analyzed in parallel and coded in feature maps.

At the second stage, focal attention serially deploys to particular positions and serves to “glue” visual primitives into object representation. Some features glued from basic primitives by attention will cost more time to be perceived since the gluing procedure is not parallel but serial. Thus, to judge what stage a visual feature is processed at, the reaction time is an indicative clue. Treisman indicated that at preattentive stage, the

reaction time is fast (pop-out) no matter how many distractors are present on display.

At attentive stage, however, increasing the number of distractors will increase the reaction time as shown in Fig. 2-9.

Figure 2-8: Schematic diagram of feature integration theory.

Figure 2-9: Typical results of a visual search experiment: (a) the result when pop-out occurs; (b) the result without occurrence of pop-out.

So far, we have introduced some relevant biological backgrounds about this research. In Chapter 3, a computational model will be developed based on these backgrounds.

3. Chapter 3

Modeling Strategy

The physiological and psychophysical evidences introduced in the preceding sections did not lead to a convenient computational model representing visual primitives. In this chapter, a novel boundary detection algorithm will be proposed.

This algorithm combines the 1^st- and 2^nd- order features to model the texture segregation task at preattentive stage of human visual system.

Fig. 3-1 shows the overall framework. In the beginning, a color image is decomposed into one luminance and two chromatic channels in CIELAB color space.

We apply Gaussian function to extract the 1^st- order features, and Gabor filters to extract the 2^nd- order features, respectively. In the two chromatic channels, only the lowest vertical and horizontal Gabor filters are applied due to coarser resolution and lower sensitivity frequency in chromatic channels. The 2^nd- order features still need some operations like rectification and Gaussian smoothing after Gabor filtering, and the issues from hybrid-order scheme should also be considered. A typical issue is false responses to non-texture region (e.g. sharp edge) in the 2^nd- order features which can be detected and removed by the proposed criterion. Another critical issue is the computational loads from the Gabor filter-bank approach. To relieve the problem, only significant features determined by variance will be reserved. After feature extraction, we then apply a local variance calculation to get the 1^st- and 2^nd- order boundaries respectively. Finally, with an adaptive weights selection, the merged boundary can be obtained. We may go a step further to thin the boundary by local peak detection, and get boundary similar to human visual system.

Figure 3-1: Flow chart of the proposed framework.

The proposed hybrid-order boundary detection algorithm will be presented in detail. In Section 3.1, the way to extract two important features in gray-scale images, luminance and texture, will be reviewed and discussed. Section 3.2 will introduce the strategy to extract hybrid-order features and some issues in luminance and chromatic channels. The nonlinear operations for the 2^nd- order features will be described in Section 3.3, and in Section 3.4, the way to find the boundary will be illustrated.

3.1 Luminance and Texture Features Extraction

As mentioned, there are similar feature extracting mechanisms for boundary detection in three color channels. Here we will review and discuss two important

L* a* b*

1^st- order boundar

2^nd- order boundary

features: luminance and texture in gray-scale images first. In Section 3.2, it will be

在文檔中生物啟發之色彩與質感混合邊界偵測模型 (頁 27-0)