Measuring Annotation Quality for Determining Effective Training Images 20

The annotation quality of a training image candidate means the possibility that the candidate is correctly labeled corresponding to the target facial attribute. The candidates with higher annotation quality have higher priority to be chosen as the training images.

To begin with, we measure the annotation quality from the degree of visual relevance.

Furthermore, the annotation quality would be optimized by both the visual relevance and the textual relevance (estimated by Eq. 3.1) to prevent the training images from being dominated by some special visual appearances. For training generalized facial attributes, we further measure relative visual relevance in a specific geographic location to include facial images more uniformly around the world.

3.5.1 Measuring Visual Relevance

A facial image carries essential information for evaluating annotation quality for a designated attribute, hence we measure the visual relevance v_kof an facial image to rep-resent the likelihood of belonging to an attribute category through visual modality only.

After executing the voting process described in Sec. 3.4.4, each image candidate would be assigned M×K⁺×K⁻ votes with labels. According to the assigned labels, the visual relevance v_kof the k-th candidate can be measured as follows,

v_k =

∑M m

x_m×v^ak,m(v_k,m⁺ − vk,m⁻ ), (3.6)

where x are the feature weights measured by Eq. 3.4, which indicate the effectiveness of each feature. v_k,m⁺ , v⁻_k,m and v^a_k,m are the accumulated positive votes (label 1), negative votes (label -1) and abstained votes (label 0) for the k-th candidate on the m-th feature space, respectively (cf. Algorithm 1). All the accumulated votes are normalized to [0, 1].

(v_k,m⁺ − vk,m⁻ ) is considered to favor the images which have more tendency to be positives than to be negatives. However, it is important to choose the most informative example for learning a function. One interpretation of this is to choose the examples with high uncertainty such as the strategy of uncertainty sampling in active learning [71]. Consid-ering the trade-off, we use v_k,m^a , the uncertainty of classifying a candidate, for moderately encouraging the candidates carrying informative cues for classification. In the following processes, the visual relevance vkwould be used to rank the annotation quality of the faces from the visual aspect.

3.5.2 Combining Textual and Visual Relevance

Examining the visual relevance of candidates can suppress the false positives, but sac-rifices the diversity in visual appearances, which is essential for collecting training images of a generalized facial attribute. Balancing visual relevance vk and the textual relevance t_k(semantic relevance), we refine the annotation quality score p_k for each candidate face

Figure 3.5: Inherently uneven distribution in user-contributed photos: the data skew in the web images due to the huge gap of Internet usage, e.g., USA (239 million users) vs.

Tanzania (0.6 million users) [2].

The first term is to measure the error between the estimated annotation quality and the textual relevance. The second term is to refine the possible error annotations by the visual relevance, and the last term is for regularization. The equation favors the candidates with higher visual relevance v_k. α, γ are the parameters used to control the effect of visual relevance and to prevent the overfitting effect. These parameters will be further investi-gated in Sec. 3.6.2 for maximizing the system performance. The equation can be solved by gradient descent which iteratively updates the annotation quality pk starting from an arbitrary vector. pkis the annotation quality of the k-th candidate image. Annotation qual-ity means the correlation between the designated facial attribute and the facial attribute in the candidate image. The higher the p_kis, the better annotation quality the k-th candidate image has. The candidates with higher annotation quality would be chosen as the training images.

3.5.3 Considering Geo-locations

The statistics of global Internet users [2] reveals that there is a big gap of Internet usage across countries; for example, 239 million users in USA and 0.6 million in Tanzania.

So do the numbers for community-contributed photos across countries. Learning with those biased face distributions neglects the generality of facial attributes, since the visual appearances of people from the same area are probably more similar (e.g., Europeans) than those from other areas (e.g., Asians). Though many applications only concern specific groups (usually the majority), the proposed approach aims to deal with more general cases in real life. Thus reducing the geographic bias is critical for the purpose of enhancing generalization. To tackle the problem, we divide the world into equal grids, where the grids containing the continents (solid-line rectangles in Fig. 3.5) are preserved as individual location groups (totally 34 groups) and the other grids (dashed-line rectangles in Fig. 3.5) are aggregated to the same location group. We evaluate the relative visual relevance, which is the Borda rank [9] of a training face candidate within a location group assigned by the location contexts (e.g., GPS) along with the photos containing that face. Relative visual relevance will favor the training image candidates with higher visual relevance within each location group, therefore prevents the appearances of training candidates from being dominated by the same places.

Given a training face candidate with visual relevance vkin a location grid G, the rela-tive visual relevance g_kin its location group is measured by the following equation.

g_k= 1−B_G(v_k)

|G| . (3.8)

B_G(v_k) is the number of image candidates in G which have the visual relevance value larger than v_k, where B_G(v_k)∈ {0, 1, 2, ..., |G|−1}. |G| is the number of photos collected in a location grid. To prevent the training images from including too many photos of the same location grid, we limit the value of|G| to be the number of required training images divided by that of total location grids. Through the arrangement, the faces with higher visual relevance within a group are given higher relative visual relevance according to their location context, hence only a few photos in a location group get the opportunities to be chosen as the training data. The relative visual relevance gk is further integrated to the annotation quality measurement for introducing more locational diversity into the

acquired training images. The optimization formulation in Eq. 3.7 is refined as follows,

minp

∑K k

[(1− β)(pk− tk)²− βgkp_k+ γ∥pk∥²]. (3.9)

The sample with higher relative visual relevance g_k has higher priority to be selected as the training samples in that area. Moreover, β and γ are the parameters used to control the influence of the geo-context and the regularization process, respectively. These parame-ters will be further investigated in Sec. 3.6.3 for analyzing the merits and the limitations of the location contexts.

在文檔中基於使用者生成多媒體內容之巨量資料分析 (頁 36-40)