Visual Attention - 紋理合成貼圖的注視點與評分預測

Lin et al. proposed a synthetic way to evaluate the qaulity of a texture by human perception [1]. The contribution of their work is that they carry out a systematic comparison study on the performance. Geometric regularity(G-score) and apearance regularity(A-score) are two math-ematical criteria utilized to compare the quality of textures. They also set up an experiment to record user evaluation data and analyzed the result. According to their result, they suggested that in evaluating the quality of synthetic near regular textures, geometric structure is a more important feature to viewers than color, intensities, or orientations.

Benard et al.[2] designed a rating experiment. They chose twenty gray-scale 2D textures sufficiently representative of the main traditional media used in NPR. To create a sufficient re-dundancy in the results, they designed two sets of ten texture pairs. For each set, they chose one representative texture per class (pigments on canvas, paint, paper, hatching, cross-hatching, dots, near-regular or irregular patterns, noise and grid). Consequently, they paid special atten-tion to assessing the statistical validity of the resulting data. In this work, they indicated the average co-occurrence error as a meaningful quality assessment metric for fractalized NPR tex-tures. They validated the relevance of this predictor by showing its strong correlation with the results of a user-based ranking experiment; however, co-occurrence error primarily reflects the results of NPR texture but fails to predict the others.

2.2 Visual Attention

Saliency map proposed by Itti et al.[4] is the most popular algorithm used to predict human fixation. In contrast to the early research of visual attention which concentrates on subjective awareness of the world, saliency map (shown as Figure 2.1) divides an image into three seper-ated feature channels: color contrast, luminance contrast, and four orientations. These features detect salient parts in the visual stimulus using the center-surround architecture. The generated feature map will then be normalized to mimic the literal inhibition effect. The sum of the feature

2.2 Visual Attention 7

Figure 2.1: Saliency map model.

map for each feature channel results in the conspicuity map, which will also be normalized and then summed up to obtain the saliency map that quantifies visual attention.

Figure 2.2: Saliency map model.

Peter et al.[9] build up a least-square based model(shown as Figure 2.2) of spatial atten-tion that combines a general computaatten-tional implementaatten-tion of both bottom-up saliency and dynamic top-down task relevance. The bottom-up component computes a saliency map from

2.2 Visual Attention 8

12 low-level multi-scale visual features. The top-down component computes a low-level signa-ture of the entire image, and learns to associate different classes of signature with the different gaze patterns recorded from human subjects. In a simple statement, the basic idea of this thesis is to train a model associating to the features of saliency map.

Mathew et al.[10] presented a neural network model to simulate saliency map[4]. They in-troduced a model that expands on Itti and Koch’s model by implementing the feature maps and saliency map as a network of neural populations with dynamics based on data from electrophys-iological experiments. Their main motivation for this model was to propose a hypothesis for how Itti and Koch adstract model could be implemented by neural networks with biologically realistic dynamics.

Most saliency approaches are based on bottom-up computation that does not consider top-down image semantics and often does not match actual eye movement. Judd et al.[11], on the other hand, proposed a support vector machine(SVM) based model trained with low, middle and high-level features. These features include subband features, Itti and Koch saliency channels, distance to the center, color features and automatic horizontal, face, person and car detectors.

Compare with the former related work, this thesis contains more object-relevant features con-sidered as interesting parts in an image.

Normalized scanpath salience(NSS) proposed by Robert et al.[7] can be used to measure the average normalized salience value across all fixation locations. The normalized scanpath salience indicates that, on average, the model-predicted salience at fixated locations. Since the NSS is scale-free, it can be used to compare the degree of correspondence between observed and predicted behavior for different observers and images.

Stas et al.[12] proposed a new type of saliency, context-aware saliency, which aimed at de-tecting the image regions that represent the scene. They presented a detection algorithm which

2.2 Visual Attention 9

was based on four priciples observed in the psychological literature, such as local low-level considerarions, global considerations, visual organization rules and highlevel factors.

Yu et al.[13] proposed a computational model of visual attention on structural textures by analyzing human subjects’ gaze behavior. We keep the eye-tracking data and user’s rating score data. Additionally, we modify her feature extraction and the association model to guarantee a beeter prediction. Instead of training whole feature map of textures, we sample the training patterns from feature map to reduce the computational cost. Moreover, we replace the training model with SONFIN regarding to her appreciating speed and performance.

C H A P T E R 3

Eye Tracking Experiments

Eye-tracking has become much more attractive recently. Why is eye-tracking important? Sim-ply put, we move our eyes to bring a particular portion of the visible field of view into high resolution so that we may see in fine detail whatever is at the central direction of gaze. Most often we also divert our attention to that point so that we can focus our concentration on the object or region of interest. Thus, we may presume that if we can track someone’s eye move-ments, we can follow along the path of attention developed by the observer.

We recorded eye movements from human observers while they are watching, comparing and judging a synthesized texture. The collected eye-tracking data is utilized to train our model.

This may give us some insight into what the observer found interesting. In this chapter, we will describe the settings of our experiment and how we process the eye-tracking system.

3.1 Experiment Settings

To record the most natural reaction of viewers, we need to reduce the effects from eye-tracking equipment and provide a relaxing environment for our subjects. For these two reason, our

ex-10

3.1 Experiment Settings 11

Figure 3.1: This photo shows Tobii T120 Eye-tracker and experiment environment.

periment was done on the Tobii T120 Eye-tracker, which is a contact free gaze measurement device, as shown in Figure 3.1. The eye tracking system allows for a large degree of head move-ment, providing a distraction-free test environment that ensures natural behavior, and therefore valid results. The eye tracking technology’s high level of accuracy and precision ensures that the research results are reliable. This helps to acquire a more realistic response from human subjects.

The following are the specifications of Tobii T120 eye-tracker:

Data Rate: 120Hz

Accuracy: typical 0.5 degrees

Head Movement Error: typical 0.2 degrees Head Movement Box: 30*22cm at 70cm Tracking Distance: 50-80 cm

Max Gaze Angles: 35 degrees

Top Head-motion Speed: 25 cm/second Screen Size: 17” TFT

Screen Resolution: 1280*1024 pixels

3.1 Experiment Settings 12

Display Colors: 16.7M

20 undergraduate and graduate students participated in our experiment. After excluding those subjects whose eye movements cannot be successfully tracked, the data of 18 subjects were analysized. The remaining 18 subjects consist of 14 males and 4 females, aged from 19 to 24 with normal or corrected to normal vision. None of them has relevant knowledge in texture synthesis. No subjects have been exposed to this experiment more than once, so the learning effects can be avoided. They are all naive to the purpose of the whole process.

Several well-known techniques of texture synthesis, such as graph cut [14], image quilting [15], near-regular texture synthesis (NRT) [16], regularized patch-based and patch based [17], are widely applied to many fields. The graph cut approach attempts to handle the global regu-larity by incorporating a local correlation technique to determine the best pasting location. The main idea of image quilting is to synthesize new texture by taking patches of existing texture and stitching them together in a consistent way. NRT proposed by can depart from regular tiling along different axes of appearance. It is able to produce a regular structural layout and control the color variation. The basic idea of the patch based algorithm is to synthesize textures by di-rectly copying image patches from the input texture. They also propose a modified approximate nearest-neighbor technique to speed up their model.

In our experiment, to make sure the data of textures having no risk of leaving out crucial details, we collect these data from Lin et al.[1]. These data was not produced by reimplemen-tations of Lin et al., but they asked the authors to run their own algorithms or allow them to run their source code on the same set of input textures.

The images used in this experiment are eleven different structural textures as shown in Fig-ure 3.2and 3.3. Our database includes 10 near-regular and 1 irregular textFig-ures. Each textFig-ure has four synthesized textures generated by graph-cut[14], image-quilting[15], patch-based texture

在文檔中紋理合成貼圖的注視點與評分預測 (頁 11-18)