• 沒有找到結果。

Beyond the low-level features commonly used for face recognition, the rich set of facial attributes such as gender, race, age, beard, smile, etc., have been shown to be very promising for characterizing designated persons [41] as well as for identity verification [42]. Moreover, facial attributes make photo management easier. Lei et al.[45] designed an efficient framework to retrieve photos of the target persons by graphically specifying the face icons with attributes on a query canvas. In addition, the statistics of (automatically detected) facial attributes from certain user groups (e.g., young girls) can approximate users’ preferences. Cheng et al.[21] proposed a travel recommender by mining people attributes from community-contributed photos. Combining with specific time, location, etc., the plentiful facial attributes greatly benefit mining consumer activities from large-scale and less organized photos.

Prior research for facial attribute detection [54, 5, 41] solely relied on supervised learn-ing with manually annotated trainlearn-ing photos, which is very time-consumlearn-ing and labor in-tensive. On average, manual annotation requires 5-6 seconds for tagging a photo or 15 seconds through gaming-based annotations [82]. Furthermore, manual annotation is

sub-woman

Figure 3.1: Goal – automatically acquiring training images for generic facial attribute detection by leveraging visual and contextual cues from publicly available community-contributed photos in an unsupervised manner. (a) Besides visual appearances in Internet images, the rich contextual cues such as tags, geo-locations are promising to ease the data bias problem in training facial attributes. Though contextual cues are noisy (e.g., the crossed tags), we aim to mine the effective training images from them. (b) Combining visual relevance and contexts to rank the effective and visually diverse training images. (c) Learning and detecting generic facial attributes from the automatically acquired training images for each attribute.

jective and biased; for example, being restricted to limited domains or locations. The problems get worse when preparing to analyze a large set of facial attributes as proposed in [41, 42].

With the prevalence of capturing devices and photo sharing services such as Flickr and Youtube, the volume of multi-media resources have been dramatically increased. There are reportedly more than four billion images in Flickr and even more than 70,000TB broad-cast video data generated every year [51]. Such ultra-large-scale multimedia brings about profound social impact upon the society and has potential for easing the burden of large-scale training image acquisition [27, 63, 55] by means of freely available user-contributed data. In this work, we aim to acquire effective training images from community-contributed photos for facial attribute detection. It is promising since social media are full of user ac-tivities via the photos associated with tags, comments, locations, etc. However, simply acquiring training images by keywords (e.g., “beard”) brings significant amount of false

Attribute

Figure 3.2: The framework to automatically acquire training images for learning generic facial attributes includes: (a) harvesting photos and the associated context information (e.g., tags, GPS) from the community-contributed photos by keyword queries as the ini-tial candidates, (b) extracting the visual features from the detected (frontal) faces and the context features from the associated text as well as geo-locations, (c) measuring feature quality according to the discriminability voting results from image candidates over multi-ple visual feature spaces, (d) optimizing feature set of a designated attribute for measuring the visual relevance, (e) fusing the visual relevance (estimated in (d)) and the contextual cues extracted in (b) to estimate and rank the annotation quality, and (f) learning generic facial attributes by the automatically acquired training images.

positives due to an uncontrolled annotation quality; learning with such noisy data degrades the accuracy of facial attribute detectors.

With an effective feature representation (e.g., supervector [83]) to a designated facial attribute (e.g., age), examining visual relevance has been shown to be promising to re-ject certain false positives in the previous research [55]. In reality, users are not expected to predetermine well which features are important to a designated attribute, for example, edge features for detecting eyeglasses [80] and texture features for estimating age [34].

To enable automatic training image acquisition to be adaptive to various facial attributes, we propose to automatically select effective features from a rich set of visual features, which are potential feature candidates for different facial attributes. The proposed fea-ture selection mechanism first measures the discriminant capability of each visual feafea-ture by discriminability voting – voting upon unlabeled images by pseudo-positives (nega-tives) retrieved by textual relevance – and then it selects effective features according to the estimated discriminant capability and the degree of mutual similarity. Discriminabil-ity voting can reduce the interference of noisy labels in the training images and does not require heuristic thresholds. Therefore, it has better generalization capability for multiple feature modalities. Another critical deficiency in prior research is rejection of false

posi-tives by the use of visual relevance only (e.g., [27, 55]) because that may cause the set of acquired training images to be dominated by color or other visual features (cf. Fig 3.8(b)).

The above mentioned images do bring marginal improvement for learning facial attribute detectors. However, it may cause data skew at the same time. Therefore, we propose to exploit the rich context cues (e.g., tags, geo-locations, etc.) along with the community-contributed photos to increase the degree of diversity for the training images (Fig 3.8(c)).

The proposed approach is conducted in an unsupervised manner and most importantly, it can be applied to different facial attributes.

For the proposed framework, as shown in Fig. 3.1, we first measure the quality of each visual feature given a noisy set of keyword-retrieved (e.g., “beard”) training image candidates. Optimized by discriminability and mutual similarity, the selected features are then used to evaluate the annotation quality of the training image candidates from the vi-sual aspect. Second, context information is further augmented to ensure the degree of diversity and the quality of automatically collected training images. Experiments show that the proposed method – balancing visual and context cues, outperformed two baseline approaches (1) measuring textual relevance (text-based) and (2) measuring visual recon-struction error via Principal Component Analysis (PCA-based); the error rates are reduced by up to 23.24% and up to 38.50% (relative improvement), respectively. More excitingly, we found that the facial attribute detectors trained by the proposed method are competitive with those trained by the use of manually annotated photos. Note that our work requires no manually collected training images but automatically mines semantically related training images from the initial candidate photos and their associated metadata retrieved by facial attribute keywords. The primary contributions of the work include:

• Devising a generic framework for learning numerous facial attributes by automat-ically acquiring training images from freely available and growing community-contributed photos without tedious manual annotations.

• Proposing a robust-to-noise feature selection approach by discriminability voting to measure visual relevance adaptive to different facial attributes (Sec. 3.4).

• Balancing visual relevance and contextual cues along with community-contributed photos to optimize automatic training image acquisition (Sec. 3.5).

• Experimenting on consumer photo benchmarks and showing great improvement in accuracy for facial attribute detection and superiority to its counterpart which re-quires costly manual annotations (Sec. 3.6).