Bor-ChunChen,Yan-YingChen,Yin-HsiKuo,ThanhDucNgo,Duy-DinhLe,Shin’ichiSatoh,WinstonH.Hsu ScalableFaceTrackRetrievalinVideoArchivesusingBag-of-FacesSparseRepresentation

(1)

Scalable Face Track Retrieval in Video Archives using Bag-of-Faces Sparse Representation

Bor-Chun Chen, Yan-Ying Chen, Yin-Hsi Kuo, Thanh Duc Ngo, Duy-Dinh Le, Shin’ichi Satoh, Winston H. Hsu

Abstract—Huge video archives consisting of news programs, dramas, movies, and web videos (e.g., YouTube) are available in our daily life. In all these videos, human is usually one of the most important subjects. Using state-of-the-art techniques, we can efficiently detect and track faces in the videos. In order to organize large-scale face tracks, containing sequences of (detected) consecutive faces in the videos, we propose an efficient method to retrieve human face tracks using bag-of-faces sparse representation. Using the proposed method, a face track is encoded as a single bag-of-faces sparse representation and therefore allowing efficient indexing method to handle large-scale data. To further consider the possible variations in face tracks, we generalize our method to find multiple sparse representations, in an unsupervised manner, to represent a bag of faces and balance the trade-off between performance and retrieval time. Experi- mental results on two real-world (million-scale) datasets confirm that the proposed methods achieve significant performance gains compared to different state-of-the-art methods.

Index Terms—Face Track Retrieval, Bag-of-Faces Sparse Rep- resentation, Multiple Sparse Representations

I. INTRODUCTION

Huge collections of videos are generated everyday in the form of news program, drama, movies, web videos, family recordings, etc. How to efficiently manage and mine information from these videos is a really important topic for many researchers. In all of these videos, human is usually one of the most important subjects; therefore, many studies focus on manipulating human faces (i.e., retrieval, recognition, annotation, etc.) in the videos [1], [2], [3], [4].

Different from traditional face recognition in still images, face recognition in videos can benefit from additional temporal redundancy because faces detected from consecutive frames at the similar location are usually of the same person. Using this extra information, face recognition based on sets of images is applied to improve the accuracy. Such face sequences detected from the videos can be regarded as a face track or bag of faces.

With the explosive growth of the videos, besides of face recognition, the emerging research is to conduct content-based face track retrieval [5], [6]. However, most of the existing face recognition methods for image set rely on complex distance measures between two sets of faces and therefore can not

B.-C. Chen (e-mail: [email protected]) and Y.-Y. Chen (e- mail: [email protected]) are with the Department of Computer Science and Information Engineering, National Taiwan University. Y.-H.

Kuo (e-mail: [email protected]) and W. H. Hsu (e-mail: win- [email protected]) are with the Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan. T. D. Ngo (e-mail:

[email protected]), D.-D. Le (e-mail: [email protected]) and S. Satoh (e-mail:

[email protected]) are with National Institute of Informatics, Tokyo, Japan. Prof.

Hsu is the contact person.

Inverted index Bag of faces

(Bag-of-faces) sparse codewords

4 JOURNAL OF L^ATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

video or TV series. Due to leveraging the transcript data and temporal information for people search, those approaches are hard to apply to user-contributed videos associated with noisy (or missing) name labels. In addition, these past works do not discuss the scalability issue for processing large scale videos.

When the amount of videos increases, the total computation time could also increases exponentially. It limits the practi- cability of applying these previous searching approaches on user-contributed videos.

In order to solve the mentioned problems, we propose an unsupervised people search method for user-contributed videos by leveraging a face graph to correlate video segments and informative people names. The face graph is constructed by generating local face clusters with affinity propagation for each videos. On such graph, the problem of missing and noisy people labels originally associated with videos can be remedied by propagating weighted names from neighboring clusters. In addition, the representative faces selected by affinity propagation in generating face clusters are used to reduce efficiently the computation during face matching. The representative faces also provide robust face matching results.

Considering the issue of scalability in graph construction, we apply some hash-based methods which dramatically save computation time and make this approach more applicable for large scale searches. Moreover, a people disambiguation process is applied to fix the ambiguity problem when several people share a same name. This process allows user to easily browse video segments for different people even they have same name.

In summary, comparing to past works for people search in videos, our method presents the following differences :

•

The largest and noisiest dataset. Total duration of all videos is more than 250 hours. All videos are collected from user-contributed video websites which makes it the noisiest dataset.

•

Transcript information is not available. The only text information used is the sparse and erroneous text attached to video, rather than the complete transcripts with abun- dant temporal information as in other works.

•

Proposing to associate people faces and names across multiple videos. The association could be used to correct the poor labels attached with other videos.

•

Applying hash-methods to reduce the computation time.

It allows our method to apply on large scale dataset.

•

Fixing the ambiguity problem that several people share a same name by applying a people disambiguation process.

III. NAME PROPAGATION ON FACE GRAPH Figure 3 shows the steps of our name propagation method.

After collecting videos from web, we detect the frontal faces and extract candidate names associated with the videos. For each video, a local clustering process is applied to cluster the duplicate and near-duplicate faces. Extracted people names for the video are assigned to the local clusters as initial people labels. Then we construct a similarity graph for all videos. The nodes represent local face clusters and the weights of edges between nodes are measured with the similarities between

Fig. 4. Face processing steps to extract candidate frontal faces. Faces are detected from videos. Then ASM [19] is applied to remove faces without apparent facial features. Selected faces are aligned at the same eye level.

Color histograms of the left and right sides of face are compared to choose symmetrical faces. The remaining faces are processed in gray with histogram equalization in order to further extract the LBP features.

clusters. Once the graph is constructed, finally, the name propagation algorithm is applied to compute the likelihood of people names for clusters by propagating names weighted by cluster similarities from neighboring clusters. Note that the above steps are totally unsupervised. In the following sections we will describe the details.

A. Frontal Face Feature and Name Extraction

As Figure 3 shows, we apply a sequence of procedures to detect the (frontal) candidate faces for matching. It is reasonable since all detected faces in videos might be vary from poses and lighting, which make them cannot be used for matching easily. First, we adopt the Adaboost-like algorithm for every 5 frames for face detection. We remove non-frontal faces in order to improve the matching quality between face images. Second, we apply active shape model (ASM) [19] to every possible faces and therefore filter out faces which can not be correctly located facial features. ASM is a statistical model widely used to localize facial feature points (e.g., eyes, mouths, etc.). Leveraging the facial feature points extracted by ASM, we can rectify the faces by rotating the faces horizontally. Third, we align the faces at the same eye level and resize these faces in 144x144 pixels. Fourth, we compare the color histograms of the left and right sides for each face to further remove the faces that are not symmetric. Finally, we normalize the rectified and aligned faces by the process of histogram equalization to ease the light variation in further feature extraction.

To represent a face, we use the local binary pattern (LBP) [9] feature. LBP is an efficient and effective face feature widely used in face classification. Although there are other features which perform better than LBP, most of them require longer computation time to extract. Because the face feature is not the issue we want to address in this work, we simply choose LBP for its efficiency and effectiveness. Every face image in our dataset is represented as a LBP feature vector of 4860 dimension.

Due to the absence of complete transcripts in videos, the typical name-entity detection could not be applied for name extration. Thus we collect names of celebrities from [20] [21]

and build a name collection. The candidates of people names for each video then are simply extracted by matching attached text of video with the name collection.

video or TV series. Due to leveraging the transcript data and temporal information for people search, those approaches are hard to apply to user-contributed videos associated with noisy (or missing) name labels. In addition, these past works do not discuss the scalability issue for processing large scale videos.

When the amount of videos increases, the total computation time could also increases exponentially. It limits the practi- cability of applying these previous searching approaches on user-contributed videos.

In order to solve the mentioned problems, we propose an unsupervised people search method for user-contributed videos by leveraging a face graph to correlate video segments and informative people names. The face graph is constructed by generating local face clusters with affinity propagation for each videos. On such graph, the problem of missing and noisy people labels originally associated with videos can be remedied by propagating weighted names from neighboring clusters. In addition, the representative faces selected by affinity propagation in generating face clusters are used to reduce efficiently the computation during face matching. The representative faces also provide robust face matching results.

Considering the issue of scalability in graph construction, we apply some hash-based methods which dramatically save computation time and make this approach more applicable for large scale searches. Moreover, a people disambiguation process is applied to fix the ambiguity problem when several people share a same name. This process allows user to easily browse video segments for different people even they have same name.

In summary, comparing to past works for people search in videos, our method presents the following differences :

• The largest and noisiest dataset. Total duration of all videos is more than 250 hours. All videos are collected from user-contributed video websites which makes it the noisiest dataset.

• Transcript information is not available. The only text information used is the sparse and erroneous text attached to video, rather than the complete transcripts with abun- dant temporal information as in other works.

• Proposing to associate people faces and names across multiple videos. The association could be used to correct the poor labels attached with other videos.

• Applying hash-methods to reduce the computation time.

It allows our method to apply on large scale dataset.

• Fixing the ambiguity problem that several people share a same name by applying a people disambiguation process.

III. NAME PROPAGATION ON FACE GRAPH Figure 3 shows the steps of our name propagation method.

After collecting videos from web, we detect the frontal faces and extract candidate names associated with the videos. For each video, a local clustering process is applied to cluster the duplicate and near-duplicate faces. Extracted people names for the video are assigned to the local clusters as initial people labels. Then we construct a similarity graph for all videos. The nodes represent local face clusters and the weights of edges between nodes are measured with the similarities between

clusters. Once the graph is constructed, finally, the name propagation algorithm is applied to compute the likelihood of people names for clusters by propagating names weighted by cluster similarities from neighboring clusters. Note that the above steps are totally unsupervised. In the following sections we will describe the details.

A. Frontal Face Feature and Name Extraction

As Figure 3 shows, we apply a sequence of procedures to detect the (frontal) candidate faces for matching. It is reasonable since all detected faces in videos might be vary from poses and lighting, which make them cannot be used for matching easily. First, we adopt the Adaboost-like algorithm for every 5 frames for face detection. We remove non-frontal faces in order to improve the matching quality between face images. Second, we apply active shape model (ASM) [19] to every possible faces and therefore filter out faces which can not be correctly located facial features. ASM is a statistical model widely used to localize facial feature points (e.g., eyes, mouths, etc.). Leveraging the facial feature points extracted by ASM, we can rectify the faces by rotating the faces horizontally. Third, we align the faces at the same eye level and resize these faces in 144x144 pixels. Fourth, we compare the color histograms of the left and right sides for each face to further remove the faces that are not symmetric. Finally, we normalize the rectified and aligned faces by the process of histogram equalization to ease the light variation in further feature extraction.

To represent a face, we use the local binary pattern (LBP) [9] feature. LBP is an efficient and effective face feature widely used in face classification. Although there are other features which perform better than LBP, most of them require longer computation time to extract. Because the face feature is not the issue we want to address in this work, we simply choose LBP for its efficiency and effectiveness. Every face image in our dataset is represented as a LBP feature vector of 4860 dimension.

Due to the absence of complete transcripts in videos, the typical name-entity detection could not be applied for name extration. Thus we collect names of celebrities from [20] [21]

and build a name collection. The candidates of people names for each video then are simply extracted by matching attached text of video with the name collection.

4 JOURNAL OF L

^A

TEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

video or TV series. Due to leveraging the transcript data and temporal information for people search, those approaches are hard to apply to user-contributed videos associated with noisy (or missing) name labels. In addition, these past works do not discuss the scalability issue for processing large scale videos.

When the amount of videos increases, the total computation time could also increases exponentially. It limits the practi- cability of applying these previous searching approaches on user-contributed videos.

In order to solve the mentioned problems, we propose an unsupervised people search method for user-contributed videos by leveraging a face graph to correlate video segments and informative people names. The face graph is constructed by generating local face clusters with affinity propagation for each videos. On such graph, the problem of missing and noisy people labels originally associated with videos can be remedied by propagating weighted names from neighboring clusters. In addition, the representative faces selected by affinity propagation in generating face clusters are used to reduce efficiently the computation during face matching. The representative faces also provide robust face matching results.

Considering the issue of scalability in graph construction, we apply some hash-based methods which dramatically save computation time and make this approach more applicable for large scale searches. Moreover, a people disambiguation process is applied to fix the ambiguity problem when several people share a same name. This process allows user to easily browse video segments for different people even they have same name.

In summary, comparing to past works for people search in videos, our method presents the following differences :

• The largest and noisiest dataset. Total duration of all videos is more than 250 hours. All videos are collected from user-contributed video websites which makes it the noisiest dataset.

• Transcript information is not available. The only text information used is the sparse and erroneous text attached to video, rather than the complete transcripts with abun- dant temporal information as in other works.

• Proposing to associate people faces and names across multiple videos. The association could be used to correct the poor labels attached with other videos.

• Applying hash-methods to reduce the computation time.

It allows our method to apply on large scale dataset.

• Fixing the ambiguity problem that several people share a same name by applying a people disambiguation process.

III. NAME PROPAGATION ON FACE GRAPH

Figure 3 shows the steps of our name propagation method.