Visual Content Features - 使用圖像和深度學習了解社交互動

4.3 Features

4.3.1 Visual Content Features

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

to describe different visual facets of images based on their content. Then, in Section 4.3.2, we explore several social features based on the contextual information of images and their owners.

4.3.1 Visual Content Features

No one can deny that an image’s popularity is related to some extent to its visual content.

The visual information represents the most interesting parts of the image. These parts are commonly containing the salient objects in the image. To demonstrate the influence of an image’s content on its popularity, we adopt different types of visual features, including low-level, high-level, and deep learning features. More details about these features are presented as described below.

Low-level Features

The low-level features describe the visual content of the image, which can be automatically extracted from the pixel information. There are several types of low-level computer vision features, such as texture, color, shape, gist, gradient, and spatial location. These features are likely used by humans for visual processing. In this study, we adopt the three following features: color, texture, and gist. For each of the features, we describe our motivation and the method used for extraction below.

Color: A perfect color distribution in an image attracts viewer attention and aids in determining object properties and understanding scenes. In this study, a color histogram

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

descriptor that results in a vector of 32 dimensions and characterizes the color feature is used [148].

Texture: We routinely interact both visually and through touch with various textures and materials in our surroundings. Therefore, individuals commonly have a different visual perception of images with different texture features. The texture feature is often used to describe the homogeneity of colors or intensities in an image. It can also be used to identify the most interesting objects or regions [149]. To investigate the importance of this type of feature in predicting image popularity, we employ one of the most widely used features for texture description, namely, local binary patterns (LBP) [150]. More precisely, we use the uniform LBP descriptor [151], resulting in a 59-dimensional feature vector.

Gist: The Gist descriptor has demonstrated high performance in several tasks of computer vision such as scene classification. It provides a rough description of a scene by epitomizing the gradient information (scales and orientations) for various parts of a photo. To extract the GIST feature of an image, we adopt the widely used GIST descriptor proposed in [152], resulting in a feature vector with 512 dimensions.

High-level Features

The quality and aesthetic appearance of an image are important for its popularity. For instance, the clear images, as well as images that contain appealing or interesting objects, usually attract significant viewer attention and become popular. Therefore, we adopt certain aesthetic features for image popularity prediction based on the various photographic techniques and

‧

the aesthetic standards used by professional photographers. These features are designed to evaluate the visual quality of a photograph by separating the subject area from the background using the blur detection technique [153]. Then, based on the result of this separation process, six types of aesthetic features are computed as described below.

• Clarity contrast: Clarity contrast indicates a specific partition of an image that can be easily recognized because of its obvious difference from the background of the image.

To attract the viewer’s attention to the key point of a photograph and to isolate the subject region from the background, professional photographers normally adjust the lens to keep the subject in focus and make the background out of focus. Accordingly, a clear photograph will have relatively more high-frequency components than a blurred photograph [153]. To characterize this property, the clarity contrast feature is defined as:

where ∥ S ∥²and ∥ I ∥²are the areas of the subject region and the original photograph, respectively, and

M_I =(m, n) | F_I(m, n) > γ maxF_I(m, n) , (4.2)

M_S=(m, n) | F_S(m, n) > γ maxF_S(m, n) , (4.3)

F = FFT (I), F = FFT (S). (4.4)

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

∥M_S∥²

∥S∥² and ^∥M_∥I∥^I^∥₂² are the ratios of the area of the high-frequency components to the area of all frequency components in S and I, respectively. m and n are the spatial frequencies. γ is a predefined threshold set to 0.2 in our experiments. FFT (I) and FFT(S) are the fast Fourier transforms calculated over I and S, respectively. The value of fcis going to be high if the subject area of the given image I is in focus and the background is out of focus.

• Hue count: The hue count of an image is a metric of its simplicity. It can also be used to evaluate image quality. Although professional photographs appear bright and vivid, their hue number is normally less than that of amateur photographs. We thus compute the hue count feature of an image using a 20-bin color histogram H_c, which is computed on the good hue values. This can be formulated as follows [154]:

f_l= 20 − N_c, (4.5)

N_c=i | H_c(i) > β m . (4.6)

where N_c denotes the set of bins with values larger than β m, m is the maximum histogram value, and β is used to control the noise sensitivity of the hue count. We selected β = 0.05 in our experiments.

• Brightness contrast: Brightness contrast implies the difference in brightness of two adjacent surfaces. In high-quality photographs, the subject area’s brightness significantly differs from that of the background because professional photographers

‧

frequently use different subject and background lightings. However, most amateurs use natural lighting and allow the camera to adjust the brightness of a picture automatically;

this usually reduces the difference in brightness between the subject area and the background. To discern the difference between these two types of photographs, the brightness contrast feature f_bis calculated as [155]:

f_b= ln

B_s B_b

, (4.7)

where B_sand B_bdenote the average brightness of the subject region and the background, respectively.

• Color entropy: Due to the unique interrelationship between the color planes of drawings and natural photos, the entropy of RGB and Lab color space components is calculated to distinguish drawings from natural photos [156].

• Composition geometry: The good geometrical composition is a fundamental demand to obtain high-quality photographs. The rule of thirds is one of the most important photographic composition principles utilized by professional photographers to bring more balance and high quality to their photos. If a photo is divided into nine parts of equal size by two equally-spaced horizontal lines and two equally-spaced vertical lines, the rule of thirds suggests that the subjects of the photo and any important compositional objects should be placed along these lines or their intersections. This is because most of the studies have demonstrated that when viewing images, people

‧

usually look at one of the intersection points rather than the center of the image. To formulate this criterion, the composition feature is defined as [155]:

f_m= min intersection points in the photo, and W and H are the width and height of the photo.

• Background simplicity: Professional photographers normally maintain simplicity within the shot to enhance the composition of the photo. This is because photographs that are clean and free of distracting backgrounds look more appealing and naturally draw the attention of a viewer to the subject. The color distribution in a simple background tends to be less dispersed. Therefore, we compute the simplicity of the background using the color distribution of the background [157]. First, we consider the regions of an image not determined as a subject region to be the background.

Then, each of the RGB channels of the image is quantized into 16 values, generating a histogram H_bof 16 × 16 × 16 = 4096 bins, that shows the numbers of quantized colors present in the background. Thus, the feature that represents the background simplicity of an image can be calculated from H_bas [153]:

f_s= N_b 4096

× 100%, (4.9)

N_b=i | H_b(i) ≥ αhmax . (4.10)

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

where h_maxis the maximum count among all the bins of the histogram and α is used to control the noise sensitivity of the hue count. We choose α = 0.01 in our experiments.

In this study, we combine the six aesthetic features indicated above, resulting in an 11-dimensional feature vector.

Fig. 4.1. The plots of deep learning feature vector values of different images from the dataset.

Deep Learning Features

Recently, deep learning methods have been widely used for image representation owing to their effectiveness [108], [111]. In this study, the CNN architecture of the VGG19 model was employed to learn the deep features of photographs [108]. The VGG19 model was trained on 1.2 million images from the ImageNet database to classify these images into 1000 categories [111]. The Keras framework of the VGG19 pre-trained CNN model [158] was used for feature extraction from the layer situated immediately prior to the final classification layer, (i.e., the last fully connected layer (fc7)). The output of this layer is a 4,096-dimensional

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

feature vector. A few images selected from the dataset and the plots of their respective deep feature vector values are shown in Fig. 4.1.

在文檔中使用圖像和深度學習了解社交互動 - 政大學術集成 (頁 87-94)