Scalable Face Track Retrieval in Video Archives using Bag-of-Faces Sparse Representation
Bor-Chun Chen, Yan-Ying Chen, Yin-Hsi Kuo, Thanh Duc Ngo, Duy-Dinh Le, Shin’ichi Satoh, Winston H. Hsu
Abstract—Huge video archives consisting of news programs, dramas, movies, and web videos (e.g., YouTube) are available in our daily life. In all these videos, human is usually one of the most important subjects. Using state-of-the-art techniques, we can efficiently detect and track faces in the videos. In order to organize large-scale face tracks, containing sequences of (detected) consecutive faces in the videos, we propose an efficient method to retrieve human face tracks using bag-of-faces sparse representation. Using the proposed method, a face track is encoded as a single bag-of-faces sparse representation and therefore allowing efficient indexing method to handle large-scale data. To further consider the possible variations in face tracks, we generalize our method to find multiple sparse representations, in an unsupervised manner, to represent a bag of faces and balance the trade-off between performance and retrieval time. Experi- mental results on two real-world (million-scale) datasets confirm that the proposed methods achieve significant performance gains compared to different state-of-the-art methods.
Index Terms—Face Track Retrieval, Bag-of-Faces Sparse Rep- resentation, Multiple Sparse Representations
I. INTRODUCTION
Huge collections of videos are generated everyday in the form of news program, drama, movies, web videos, family recordings, etc. How to efficiently manage and mine infor- mation from these videos is a really important topic for many researchers. In all of these videos, human is usually one of the most important subjects; therefore, many studies focus on manipulating human faces (i.e., retrieval, recognition, annotation, etc.) in the videos [1], [2], [3], [4].
Different from traditional face recognition in still images, face recognition in videos can benefit from additional temporal redundancy because faces detected from consecutive frames at the similar location are usually of the same person. Using this extra information, face recognition based on sets of images is applied to improve the accuracy. Such face sequences detected from the videos can be regarded as a face track or bag of faces.
With the explosive growth of the videos, besides of face recognition, the emerging research is to conduct content-based face track retrieval [5], [6]. However, most of the existing face recognition methods for image set rely on complex distance measures between two sets of faces and therefore can not
B.-C. Chen (e-mail: [email protected]) and Y.-Y. Chen (e- mail: [email protected]) are with the Department of Computer Science and Information Engineering, National Taiwan University. Y.-H.
Kuo (e-mail: [email protected]) and W. H. Hsu (e-mail: win- [email protected]) are with the Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan. T. D. Ngo (e-mail:
[email protected]), D.-D. Le (e-mail: [email protected]) and S. Satoh (e-mail:
[email protected]) are with National Institute of Informatics, Tokyo, Japan. Prof.
Hsu is the contact person.
Inverted index Bag of faces
(Bag-of-faces) sparse codewords
4 JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
video or TV series. Due to leveraging the transcript data and temporal information for people search, those approaches are hard to apply to user-contributed videos associated with noisy (or missing) name labels. In addition, these past works do not discuss the scalability issue for processing large scale videos.
When the amount of videos increases, the total computation time could also increases exponentially. It limits the practi- cability of applying these previous searching approaches on user-contributed videos.
In order to solve the mentioned problems, we propose an unsupervised people search method for user-contributed videos by leveraging a face graph to correlate video segments and informative people names. The face graph is constructed by generating local face clusters with affinity propagation for each videos. On such graph, the problem of missing and noisy people labels originally associated with videos can be remedied by propagating weighted names from neighboring clusters. In addition, the representative faces selected by affinity propagation in generating face clusters are used to reduce efficiently the computation during face matching. The representative faces also provide robust face matching results.
Considering the issue of scalability in graph construction, we apply some hash-based methods which dramatically save computation time and make this approach more applicable for large scale searches. Moreover, a people disambiguation process is applied to fix the ambiguity problem when several people share a same name. This process allows user to easily browse video segments for different people even they have same name.
In summary, comparing to past works for people search in videos, our method presents the following differences :
•
The largest and noisiest dataset. Total duration of all videos is more than 250 hours. All videos are collected from user-contributed video websites which makes it the noisiest dataset.
•
Transcript information is not available. The only text information used is the sparse and erroneous text attached to video, rather than the complete transcripts with abun- dant temporal information as in other works.
•
Proposing to associate people faces and names across multiple videos. The association could be used to correct the poor labels attached with other videos.
•
Applying hash-methods to reduce the computation time.
It allows our method to apply on large scale dataset.
•
Fixing the ambiguity problem that several people share a same name by applying a people disambiguation process.
III. NAME PROPAGATION ON FACE GRAPH Figure 3 shows the steps of our name propagation method.
After collecting videos from web, we detect the frontal faces and extract candidate names associated with the videos. For each video, a local clustering process is applied to cluster the duplicate and near-duplicate faces. Extracted people names for the video are assigned to the local clusters as initial people labels. Then we construct a similarity graph for all videos. The nodes represent local face clusters and the weights of edges between nodes are measured with the similarities between
Fig. 4. Face processing steps to extract candidate frontal faces. Faces are detected from videos. Then ASM [19] is applied to remove faces without apparent facial features. Selected faces are aligned at the same eye level.
Color histograms of the left and right sides of face are compared to choose symmetrical faces. The remaining faces are processed in gray with histogram equalization in order to further extract the LBP features.
clusters. Once the graph is constructed, finally, the name propagation algorithm is applied to compute the likelihood of people names for clusters by propagating names weighted by cluster similarities from neighboring clusters. Note that the above steps are totally unsupervised. In the following sections we will describe the details.
A. Frontal Face Feature and Name Extraction
As Figure 3 shows, we apply a sequence of procedures to detect the (frontal) candidate faces for matching. It is reasonable since all detected faces in videos might be vary from poses and lighting, which make them cannot be used for matching easily. First, we adopt the Adaboost-like algorithm for every 5 frames for face detection. We remove non-frontal faces in order to improve the matching quality between face images. Second, we apply active shape model (ASM) [19] to every possible faces and therefore filter out faces which can not be correctly located facial features. ASM is a statistical model widely used to localize facial feature points (e.g., eyes, mouths, etc.). Leveraging the facial feature points extracted by ASM, we can rectify the faces by rotating the faces horizontally. Third, we align the faces at the same eye level and resize these faces in 144x144 pixels. Fourth, we compare the color histograms of the left and right sides for each face to further remove the faces that are not symmetric. Finally, we normalize the rectified and aligned faces by the process of histogram equalization to ease the light variation in further feature extraction.
To represent a face, we use the local binary pattern (LBP) [9] feature. LBP is an efficient and effective face feature widely used in face classification. Although there are other features which perform better than LBP, most of them require longer computation time to extract. Because the face feature is not the issue we want to address in this work, we simply choose LBP for its efficiency and effectiveness. Every face image in our dataset is represented as a LBP feature vector of 4860 dimension.
Due to the absence of complete transcripts in videos, the typical name-entity detection could not be applied for name extration. Thus we collect names of celebrities from [20] [21]
and build a name collection. The candidates of people names for each video then are simply extracted by matching attached text of video with the name collection.
4 JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
video or TV series. Due to leveraging the transcript data and temporal information for people search, those approaches are hard to apply to user-contributed videos associated with noisy (or missing) name labels. In addition, these past works do not discuss the scalability issue for processing large scale videos.
When the amount of videos increases, the total computation time could also increases exponentially. It limits the practi- cability of applying these previous searching approaches on user-contributed videos.
In order to solve the mentioned problems, we propose an unsupervised people search method for user-contributed videos by leveraging a face graph to correlate video segments and informative people names. The face graph is constructed by generating local face clusters with affinity propagation for each videos. On such graph, the problem of missing and noisy people labels originally associated with videos can be remedied by propagating weighted names from neighboring clusters. In addition, the representative faces selected by affinity propagation in generating face clusters are used to reduce efficiently the computation during face matching. The representative faces also provide robust face matching results.
Considering the issue of scalability in graph construction, we apply some hash-based methods which dramatically save computation time and make this approach more applicable for large scale searches. Moreover, a people disambiguation process is applied to fix the ambiguity problem when several people share a same name. This process allows user to easily browse video segments for different people even they have same name.
In summary, comparing to past works for people search in videos, our method presents the following differences :
• The largest and noisiest dataset. Total duration of all videos is more than 250 hours. All videos are collected from user-contributed video websites which makes it the noisiest dataset.
• Transcript information is not available. The only text information used is the sparse and erroneous text attached to video, rather than the complete transcripts with abun- dant temporal information as in other works.
• Proposing to associate people faces and names across multiple videos. The association could be used to correct the poor labels attached with other videos.
• Applying hash-methods to reduce the computation time.
It allows our method to apply on large scale dataset.
• Fixing the ambiguity problem that several people share a same name by applying a people disambiguation process.
III. NAME PROPAGATION ON FACE GRAPH Figure 3 shows the steps of our name propagation method.
After collecting videos from web, we detect the frontal faces and extract candidate names associated with the videos. For each video, a local clustering process is applied to cluster the duplicate and near-duplicate faces. Extracted people names for the video are assigned to the local clusters as initial people labels. Then we construct a similarity graph for all videos. The nodes represent local face clusters and the weights of edges between nodes are measured with the similarities between
Fig. 4. Face processing steps to extract candidate frontal faces. Faces are detected from videos. Then ASM [19] is applied to remove faces without apparent facial features. Selected faces are aligned at the same eye level.
Color histograms of the left and right sides of face are compared to choose symmetrical faces. The remaining faces are processed in gray with histogram equalization in order to further extract the LBP features.
clusters. Once the graph is constructed, finally, the name propagation algorithm is applied to compute the likelihood of people names for clusters by propagating names weighted by cluster similarities from neighboring clusters. Note that the above steps are totally unsupervised. In the following sections we will describe the details.
A. Frontal Face Feature and Name Extraction
As Figure 3 shows, we apply a sequence of procedures to detect the (frontal) candidate faces for matching. It is reasonable since all detected faces in videos might be vary from poses and lighting, which make them cannot be used for matching easily. First, we adopt the Adaboost-like algorithm for every 5 frames for face detection. We remove non-frontal faces in order to improve the matching quality between face images. Second, we apply active shape model (ASM) [19] to every possible faces and therefore filter out faces which can not be correctly located facial features. ASM is a statistical model widely used to localize facial feature points (e.g., eyes, mouths, etc.). Leveraging the facial feature points extracted by ASM, we can rectify the faces by rotating the faces horizontally. Third, we align the faces at the same eye level and resize these faces in 144x144 pixels. Fourth, we compare the color histograms of the left and right sides for each face to further remove the faces that are not symmetric. Finally, we normalize the rectified and aligned faces by the process of histogram equalization to ease the light variation in further feature extraction.
To represent a face, we use the local binary pattern (LBP) [9] feature. LBP is an efficient and effective face feature widely used in face classification. Although there are other features which perform better than LBP, most of them require longer computation time to extract. Because the face feature is not the issue we want to address in this work, we simply choose LBP for its efficiency and effectiveness. Every face image in our dataset is represented as a LBP feature vector of 4860 dimension.
Due to the absence of complete transcripts in videos, the typical name-entity detection could not be applied for name extration. Thus we collect names of celebrities from [20] [21]
and build a name collection. The candidates of people names for each video then are simply extracted by matching attached text of video with the name collection.
4 JOURNAL OF L
ATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
video or TV series. Due to leveraging the transcript data and temporal information for people search, those approaches are hard to apply to user-contributed videos associated with noisy (or missing) name labels. In addition, these past works do not discuss the scalability issue for processing large scale videos.
When the amount of videos increases, the total computation time could also increases exponentially. It limits the practi- cability of applying these previous searching approaches on user-contributed videos.
In order to solve the mentioned problems, we propose an unsupervised people search method for user-contributed videos by leveraging a face graph to correlate video segments and informative people names. The face graph is constructed by generating local face clusters with affinity propagation for each videos. On such graph, the problem of missing and noisy people labels originally associated with videos can be remedied by propagating weighted names from neighboring clusters. In addition, the representative faces selected by affinity propagation in generating face clusters are used to reduce efficiently the computation during face matching. The representative faces also provide robust face matching results.
Considering the issue of scalability in graph construction, we apply some hash-based methods which dramatically save computation time and make this approach more applicable for large scale searches. Moreover, a people disambiguation process is applied to fix the ambiguity problem when several people share a same name. This process allows user to easily browse video segments for different people even they have same name.
In summary, comparing to past works for people search in videos, our method presents the following differences :
• The largest and noisiest dataset. Total duration of all videos is more than 250 hours. All videos are collected from user-contributed video websites which makes it the noisiest dataset.
• Transcript information is not available. The only text information used is the sparse and erroneous text attached to video, rather than the complete transcripts with abun- dant temporal information as in other works.
• Proposing to associate people faces and names across multiple videos. The association could be used to correct the poor labels attached with other videos.
• Applying hash-methods to reduce the computation time.
It allows our method to apply on large scale dataset.
• Fixing the ambiguity problem that several people share a same name by applying a people disambiguation process.
III. NAME PROPAGATION ON FACE GRAPH
Figure 3 shows the steps of our name propagation method.
After collecting videos from web, we detect the frontal faces and extract candidate names associated with the videos. For each video, a local clustering process is applied to cluster the duplicate and near-duplicate faces. Extracted people names for the video are assigned to the local clusters as initial people labels. Then we construct a similarity graph for all videos. The nodes represent local face clusters and the weights of edges between nodes are measured with the similarities between
Fig. 4. Face processing steps to extract candidate frontal faces. Faces are detected from videos. Then ASM [19] is applied to remove faces without apparent facial features. Selected faces are aligned at the same eye level.
Color histograms of the left and right sides of face are compared to choose symmetrical faces. The remaining faces are processed in gray with histogram equalization in order to further extract the LBP features.
clusters. Once the graph is constructed, finally, the name propagation algorithm is applied to compute the likelihood of people names for clusters by propagating names weighted by cluster similarities from neighboring clusters. Note that the above steps are totally unsupervised. In the following sections we will describe the details.
A. Frontal Face Feature and Name Extraction
As Figure 3 shows, we apply a sequence of procedures to detect the (frontal) candidate faces for matching. It is reasonable since all detected faces in videos might be vary from poses and lighting, which make them cannot be used for matching easily. First, we adopt the Adaboost-like algorithm for every 5 frames for face detection. We remove non-frontal faces in order to improve the matching quality between face images. Second, we apply active shape model (ASM) [19] to every possible faces and therefore filter out faces which can not be correctly located facial features. ASM is a statistical model widely used to localize facial feature points (e.g., eyes, mouths, etc.). Leveraging the facial feature points extracted by ASM, we can rectify the faces by rotating the faces horizontally. Third, we align the faces at the same eye level and resize these faces in 144x144 pixels. Fourth, we compare the color histograms of the left and right sides for each face to further remove the faces that are not symmetric. Finally, we normalize the rectified and aligned faces by the process of histogram equalization to ease the light variation in further feature extraction.
To represent a face, we use the local binary pattern (LBP) [9] feature. LBP is an efficient and effective face feature widely used in face classification. Although there are other features which perform better than LBP, most of them require longer computation time to extract. Because the face feature is not the issue we want to address in this work, we simply choose LBP for its efficiency and effectiveness. Every face image in our dataset is represented as a LBP feature vector of 4860 dimension.
Due to the absence of complete transcripts in videos, the typical name-entity detection could not be applied for name extration. Thus we collect names of celebrities from [20] [21]
and build a name collection. The candidates of people names for each video then are simply extracted by matching attached text of video with the name collection.
4 JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
video or TV series. Due to leveraging the transcript data and temporal information for people search, those approaches are hard to apply to user-contributed videos associated with noisy (or missing) name labels. In addition, these past works do not discuss the scalability issue for processing large scale videos.
When the amount of videos increases, the total computation time could also increases exponentially. It limits the practi- cability of applying these previous searching approaches on user-contributed videos.
In order to solve the mentioned problems, we propose an unsupervised people search method for user-contributed videos by leveraging a face graph to correlate video segments and informative people names. The face graph is constructed by generating local face clusters with affinity propagation for each videos. On such graph, the problem of missing and noisy people labels originally associated with videos can be remedied by propagating weighted names from neighboring clusters. In addition, the representative faces selected by affinity propagation in generating face clusters are used to reduce efficiently the computation during face matching. The representative faces also provide robust face matching results.
Considering the issue of scalability in graph construction, we apply some hash-based methods which dramatically save computation time and make this approach more applicable for large scale searches. Moreover, a people disambiguation process is applied to fix the ambiguity problem when several people share a same name. This process allows user to easily browse video segments for different people even they have same name.
In summary, comparing to past works for people search in videos, our method presents the following differences :
•
The largest and noisiest dataset. Total duration of all videos is more than 250 hours. All videos are collected from user-contributed video websites which makes it the noisiest dataset.
•
Transcript information is not available. The only text information used is the sparse and erroneous text attached to video, rather than the complete transcripts with abun- dant temporal information as in other works.
•
Proposing to associate people faces and names across multiple videos. The association could be used to correct the poor labels attached with other videos.
•
Applying hash-methods to reduce the computation time.
It allows our method to apply on large scale dataset.
•
Fixing the ambiguity problem that several people share a same name by applying a people disambiguation process.
III. NAME PROPAGATION ON FACE GRAPH Figure 3 shows the steps of our name propagation method.
After collecting videos from web, we detect the frontal faces and extract candidate names associated with the videos. For each video, a local clustering process is applied to cluster the duplicate and near-duplicate faces. Extracted people names for the video are assigned to the local clusters as initial people labels. Then we construct a similarity graph for all videos. The nodes represent local face clusters and the weights of edges between nodes are measured with the similarities between
Fig. 4. Face processing steps to extract candidate frontal faces. Faces are detected from videos. Then ASM [19] is applied to remove faces without apparent facial features. Selected faces are aligned at the same eye level.
Color histograms of the left and right sides of face are compared to choose symmetrical faces. The remaining faces are processed in gray with histogram equalization in order to further extract the LBP features.
clusters. Once the graph is constructed, finally, the name propagation algorithm is applied to compute the likelihood of people names for clusters by propagating names weighted by cluster similarities from neighboring clusters. Note that the above steps are totally unsupervised. In the following sections we will describe the details.
A. Frontal Face Feature and Name Extraction
As Figure 3 shows, we apply a sequence of procedures to detect the (frontal) candidate faces for matching. It is reasonable since all detected faces in videos might be vary from poses and lighting, which make them cannot be used for matching easily. First, we adopt the Adaboost-like algorithm for every 5 frames for face detection. We remove non-frontal faces in order to improve the matching quality between face images. Second, we apply active shape model (ASM) [19] to every possible faces and therefore filter out faces which can not be correctly located facial features. ASM is a statistical model widely used to localize facial feature points (e.g., eyes, mouths, etc.). Leveraging the facial feature points extracted by ASM, we can rectify the faces by rotating the faces horizontally. Third, we align the faces at the same eye level and resize these faces in 144x144 pixels. Fourth, we compare the color histograms of the left and right sides for each face to further remove the faces that are not symmetric. Finally, we normalize the rectified and aligned faces by the process of histogram equalization to ease the light variation in further feature extraction.
To represent a face, we use the local binary pattern (LBP) [9] feature. LBP is an efficient and effective face feature widely used in face classification. Although there are other features which perform better than LBP, most of them require longer computation time to extract. Because the face feature is not the issue we want to address in this work, we simply choose LBP for its efficiency and effectiveness. Every face image in our dataset is represented as a LBP feature vector of 4860 dimension.
Due to the absence of complete transcripts in videos, the typical name-entity detection could not be applied for name extration. Thus we collect names of celebrities from [20] [21]
and build a name collection. The candidates of people names for each video then are simply extracted by matching attached text of video with the name collection.
4 JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
video or TV series. Due to leveraging the transcript data and temporal information for people search, those approaches are hard to apply to user-contributed videos associated with noisy (or missing) name labels. In addition, these past works do not discuss the scalability issue for processing large scale videos.
When the amount of videos increases, the total computation time could also increases exponentially. It limits the practi- cability of applying these previous searching approaches on user-contributed videos.
In order to solve the mentioned problems, we propose an unsupervised people search method for user-contributed videos by leveraging a face graph to correlate video segments and informative people names. The face graph is constructed by generating local face clusters with affinity propagation for each videos. On such graph, the problem of missing and noisy people labels originally associated with videos can be remedied by propagating weighted names from neighboring clusters. In addition, the representative faces selected by affinity propagation in generating face clusters are used to reduce efficiently the computation during face matching. The representative faces also provide robust face matching results.
Considering the issue of scalability in graph construction, we apply some hash-based methods which dramatically save computation time and make this approach more applicable for large scale searches. Moreover, a people disambiguation process is applied to fix the ambiguity problem when several people share a same name. This process allows user to easily browse video segments for different people even they have same name.
In summary, comparing to past works for people search in videos, our method presents the following differences :
• The largest and noisiest dataset. Total duration of all videos is more than 250 hours. All videos are collected from user-contributed video websites which makes it the noisiest dataset.
• Transcript information is not available. The only text information used is the sparse and erroneous text attached to video, rather than the complete transcripts with abun- dant temporal information as in other works.
• Proposing to associate people faces and names across multiple videos. The association could be used to correct the poor labels attached with other videos.
• Applying hash-methods to reduce the computation time.
It allows our method to apply on large scale dataset.
• Fixing the ambiguity problem that several people share a same name by applying a people disambiguation process.
III. NAME PROPAGATION ON FACE GRAPH Figure 3 shows the steps of our name propagation method.
After collecting videos from web, we detect the frontal faces and extract candidate names associated with the videos. For each video, a local clustering process is applied to cluster the duplicate and near-duplicate faces. Extracted people names for the video are assigned to the local clusters as initial people labels. Then we construct a similarity graph for all videos. The nodes represent local face clusters and the weights of edges between nodes are measured with the similarities between
Fig. 4. Face processing steps to extract candidate frontal faces. Faces are detected from videos. Then ASM [19] is applied to remove faces without apparent facial features. Selected faces are aligned at the same eye level.
Color histograms of the left and right sides of face are compared to choose symmetrical faces. The remaining faces are processed in gray with histogram equalization in order to further extract the LBP features.
clusters. Once the graph is constructed, finally, the name propagation algorithm is applied to compute the likelihood of people names for clusters by propagating names weighted by cluster similarities from neighboring clusters. Note that the above steps are totally unsupervised. In the following sections we will describe the details.
A. Frontal Face Feature and Name Extraction
As Figure 3 shows, we apply a sequence of procedures to detect the (frontal) candidate faces for matching. It is reasonable since all detected faces in videos might be vary from poses and lighting, which make them cannot be used for matching easily. First, we adopt the Adaboost-like algorithm for every 5 frames for face detection. We remove non-frontal faces in order to improve the matching quality between face images. Second, we apply active shape model (ASM) [19] to every possible faces and therefore filter out faces which can not be correctly located facial features. ASM is a statistical model widely used to localize facial feature points (e.g., eyes, mouths, etc.). Leveraging the facial feature points extracted by ASM, we can rectify the faces by rotating the faces horizontally. Third, we align the faces at the same eye level and resize these faces in 144x144 pixels. Fourth, we compare the color histograms of the left and right sides for each face to further remove the faces that are not symmetric. Finally, we normalize the rectified and aligned faces by the process of histogram equalization to ease the light variation in further feature extraction.
To represent a face, we use the local binary pattern (LBP) [9] feature. LBP is an efficient and effective face feature widely used in face classification. Although there are other features which perform better than LBP, most of them require longer computation time to extract. Because the face feature is not the issue we want to address in this work, we simply choose LBP for its efficiency and effectiveness. Every face image in our dataset is represented as a LBP feature vector of 4860 dimension.
Due to the absence of complete transcripts in videos, the typical name-entity detection could not be applied for name extration. Thus we collect names of celebrities from [20] [21]
and build a name collection. The candidates of people names for each video then are simply extracted by matching attached text of video with the name collection.
Face track
Fig. 1. Illustration of the proposed method. (a) The sheer amount of videos is available nowadays and millions of faces can be detected and tracked in the videos. (b) We aim to efficiently retrieve face tracks extracted from videos as the query and the target large-scale collections. In our work, each face track is represented by a bag-of-faces sparse representation – exploiting temporal redundancy in the videos. Non-zero entries of the sparse representation are then used as codewords for building inverted index and enabling scalable and effective retrieval in large-scale data.
easily work with current index frameworks, which are essential as witnessing the exponential growth of the video collections.
To overcome this problem, we propose a novel coding method to encode the bag of faces into a single sparse repre- sentation. As shown in Figure 1, each bag of faces is repre- sented by a sparse representation, using the non-zero entries in the sparse representation as discrete codewords, inverted index is built with millions of faces extracted from videos and can enable scalable retrieval over large-scale database. To improve the retrieval performance, we further generalize the proposed coding method to find multiple sparse representations which might accommodate possible face variations in the bag of faces and further balance the trade-off between performance and retrieval time.
In order to evaluate the performance of the proposed meth- ods, we conduct extensive experiments on two real-world datasets. One of the datasets is constructed from TRECVid [7] videos during 2004 to 2006; another dataset is constructed from a Japanese news program “NHKnews7” during 2001 to 2011. These datasets contain faces in unconstrained envi-
ronments 1 and are really challenging for content-based face retrieval. In the experiments we show that the proposed method can achieve significant performance gains over the prior state- of-the-art face recognition methods for face tracks or image sets while maintaining an highly scalable structure.
To sum up, our contributions include:
• We propose an novel coding method to encode the face track as a bag-of-faces sparse representation to solve face track retrieval problem in large-scale videos.
• We generalize the proposed coding method to enable multiple sparse representations for bag of faces, accom- modate possible face variations, and balance the trade-off between performance and retrieval time.
• We conduct extensive experiments by the proposed meth- ods on two face track datasets constructed from real- world videos and compare the results with state-of-the- art face retrieval methods for image sets. The datasets are publicly available2 for future studies on face retrieval in videos.
II. RELATED WORK
Faces are always the subjects of interest for researchers because they are close to our daily life. Although studies on face recognition have shown promising on datasets in controlled environments, performance on real-world datasets is still unsatisfactory because face appearances have large variations in pose, expression, illumination, etc.
To overcome this issue, recently many studies focus on face recognition from sets of images. Instead of recognizing people using single image, they use a set of face images from the same person for recognition. In [8], X. Liu and T. Chen use adaptive Hidden Markov Models to model the faces extracted from the videos. In [9], Lee et al. use probabilistic appearance manifolds to model the faces. Some studies represent a set of images as a parametric distribution function such as Gaussian [10] or Gaussian Mixture Model [11] and use KL-Divergence to measure the distance between two sets. Some other studies use linear subspace [2], [12], [13] or mixture of linear subspace [14], [15] to represent a set of faces and use principal angles [16] to measure the distance between the two subspaces. In [4], Satoh proposes to use minimum distances between samples in two sets as the set distance; a similar idea is adopted by Cevikalp and Triggs [1], but instead of directly using samples in the set, they model the image set using an affine hull and find the closest points in the affine hull by solving a convex optimization problem. Hu et al. [17] further propose to find sparse approximated nearest point distance between points in affine hull to improve the performance. Although many effective methods are proposed to compute distance between two set of face images, they all ignore the scalability issues including (1) retrieval efficiency (by linear search versus by indexing), (2) the memory consumption (dense features versus sparse features), (3) similarity measurement (real values
1Unconstrained environments mean that the wild photos are taken in real life where the parameter settings of environment, e.g., lighting, angle, position, are unconstrained.
2http://satoh-lab.ex.nii.ac.jp/users/ndthanh/NIIFacetrackDatasets/
versus binary values), which should be considered to meet the scalability requirements of online large-scale retrieval system.
Therefore, these methods can not be directly applied for face track retrieval in videos as the dataset grows.
Recently, some studies are trying to solve content-based face image retrieval problem. In [18], Wu et al. propose an identity- based quantization method for large-scale face image retrieval.
Theodorakopoulos et al. [19] propose local sparse coding to represent a face by patch-based overcomplete dictionaries and to express pairwise similarities between faces. Chen et al.
[20] propose to use sparse coding with identity constraints to improve the retrieval performance. Motivated by these methods, we propose to use bag-of-faces sparse representation to represent a face track extracted from the video. A face track is represented by a single sparse representation using the proposed method, and therefore efficient indexing method (i.e. inverted indexing) can be directly applied on large-scale dataset for real-time face track retrieval in large-scale videos.
III. SYSTEM OVERVIEW
We first use the face tracking method proposed in [21] to track faces in the videos, faces in the same track are grouped as a bag of faces. For each face in one bag, we apply facial landmark detection and extracted 149 dimension pixel-wise features at 13 different landmark locations to describe the faces as in [3]. Methods described in Section IV are then used to encode each bag of faces into one or more sparse representations. Inverted indexing is then built using non- zero entries in sparse representations as codewords for better performance and efficiency in retrieval [22], [23]. The system diagram is illustrated in Figure 2.
IV. PROPOSED METHOD
For construction of face tracks, we take temporal informa- tion to extract faces shown in consecutive video frames. Note that, within a track, we do not consider their temporal orders because faces of a person in different tracks comprise different expressions and motion, which have no exact correspondences between their temporal orders. In the following subsections, we first describe how to find sparse representation of a single (face) image using sparse coding. Secondly, we describe how we generalize sparse coding framework to find sparse representation of a bag of faces. Finally, we describe how to improve retrieval performance by using multiple sparse representations for a bag of faces when the bag of faces contains large variations.
A. Sparse representation for single face image (SR)
Sparse representation has been proved very effective for face related work. Wright et al. [24] propose to use sparse representation for face recognition and achieve state-of-the- art performance. In [20], Chen et al. propose to use sparse representation of image patch as codewords for face image retrieval and demonstrated its effectiveness over prior common features in two open benchmarks. Here we show how to derive the sparse representation of a single face image for face image retrieval as shown in Figure 2(a). Let p be the number
(a) SR (b) BoF-SR
(c) MBoF-SR Face image
Image patches
Patch-level codewords
Final sparse representation
Bag of faces
Bag of patches at different facial landmarks
Landmark-level codewords
Final bag-of-faces sparse representation
Bag of faces
Multiple bags of patches
at different facial landmarks Landmark-level multiple codewords
Final multiple bag-of-faces sparse representation
V(p) V(1)
m m
Fig. 2. (a) Using sparse coding for face retrieval with still image. Several patches are extracted from a face image at different facial landmarks (e.g., eyes corners, nose tips, mouth corners, etc.). For each patch, a sparse representation v(i)is found using Equation (1). All sparse representations are then concatenated together to form the final sparse representation to describe the face image. (b) The proposed bag-of-faces sparse representation method for face tracks. Patches extracted from the same facial landmark in bag of faces (from the same face track) are grouped together as a bag of patches and are used to find a sparse representation by Equation (2). Sparse representations at different locations are then concatenated together to form the final representation for the bag of faces. (c) Because the bag might contain faces with large variations, multiple sparse representations (indexed by m) are computed based on Algorithm 1 at each facial landmark. For instance, two sparse representations can be found to represent the bag of faces at the mouth location; in an automatic and approximate manner, one is used to represent faces in the bag with mouth closed and the other is with mouth opened. All sparse representations are aggregated together to represent the bag of faces. Equation (6) is then adopted to compute the distance between two bags (i.e., face tracks). Note that the sparse codewords can be indexed to facilitate large-scale face retrieval in video archives.
of landmark location in faces. Given a set of p dictionaries used to encode the p image patches and 149-dimensional pixel-wise features extracted from these patches, we find a sparse representation for each patch by solving the following optimization problem:
minimize
v(1)...v(p) p
X
i=1
(||x(i)− D(i)v(i)||22+ λ||v(i)||1), (1) where p is the total number of patches in the face image, x(i) is the feature vector extracted from patch at location i (e.g., left eye and nose) of the image, D(i) ∈ Rd×k is a dictionary contains k codewords with d dimensions and is used to encode the patch extracted from location i of the face. v(1), v(2), . . . , v(p) are the sparse representations of the image patches from location 1, 2, . . . , p respectively. Since the objective function is convex over D(i)while v(i) is fixed and vice versa. We solve the optimization problem by iteratively minimizing D(i)and v(i)by an efficient online algorithm [25].
Using sparse coding, a patch feature is encoded as a sparse linear combination of the column vectors of the dictionary.
After the sparse representations are found, each non-zero entries of v(i) is considered as a codeword of the image for inverted indexing; note that the positive and negative value are consider as different codewords and the dimension of v(i) is k, therefore the size of the vocabulary (number of different codewords) is 2 × p × k. The above problem is a set of
unconstrained L1-regularized least square problem which can be solved efficiently using many different algorithms such as LARS [26].
B. Bag-of-faces sparse representation (BoF-SR)
For bags of faces, because number of faces is different in each bag, instead of finding a sparse representation for each patch, we propose to aggregate all the patches extracted from the same location and find a sparse representation for each bag-of-patches at certain location as shown in Figure 2(b). To find the sparse representation at each location, we solve the following optimization problem generalize from Equation (1):
minimize
v(1)...v(p) p
X
i=1
(1 n
n
X
j=1
||x(i)j − D(i)v(i)||2
2+ λ||v(i)||1), (2) where n is the number of faces in the bag, x(i)j is the feature extracted from jth face at location i. By solving the above optimization problem, x(i)1 , x(i)2 , . . . , x(i)n are represented by a single sparse representation v(i) where v(i) minimize the average of reconstruction error for all patches at location i in the bag. The idea is to find a best sparse representation v(i)to encode all the patches at certain location in the bag of faces.
Each v(i)in the above problem can be solved separately with an unconstrained L1-regularized sum of least square problem, which can be viewed as a larger L1-regularized least square
Fig. 3. Two examples of bag of faces that are hard to represent by single sparse representation. (a) The bag of faces contains two facial expressions – looking at the camera and looking at the script. (b) The bag of faces contains some noises due to possible tracking errors. In these cases, using multiple bag-of-faces sparse representations (i.e., codebooks) can achieve better performance.
problem and can also be solved with LARS algorithm [26].
Note that when there is only one face in the bag, the above problem is reduced to Equation (1). The size of the vocabulary for a bag of faces is the same as the case in single image, and the size of database is reduced from millions of faces to tens of thousands of bags. Therefore, we can achieve very efficient online retrieval response.
C. Multiple sparse representations for bag of faces (MBoF- SR)
Using the above method, we can find a sparse representation of each bag of faces and achieve efficient retrieval speed, but sometimes a single sparse representation can not well characterize all the patches at a single location. Figure 3 shows two failure cases. Figure 3 (a) is a bag of faces extracted from a news video with a person in speech. There are two types of expressions in the bag of faces, one is when the person is looking at the camera, the other is when she is looking at the scripts. Figure 3 (b) is another bag of faces containing the same person; in the bag of faces, some of the faces are noisy due to face tracking errors. In these two cases, some patches extracted at the same facial landmark are quite different, therefore, we propose to use multiple sparse representations to represent the bag of faces where each sparse representation is used to represent a subset of the patches in the bag of patches at certain landmark location as shown in Figure 2 (c). We formulate this into the following optimization problem:
minimize
V(i),S(i),∀i p
X
i=1
(1 n
n
X
j=1
||x(i)j − D(i)V(i)s(i)j ||2
2
+ λ
m
X
k=1
||vk(i)||1)
subject to ||s(i)j ||0= 1, ||s(i)j ||1= 1, s(i)j ≥ 0, ∀i, j, (3)
where V(i)= [v(i)1 , v2(i), . . . , v(i)m]are m sparse representations for patches at location i, S(i) = [s(i)1 , s(i)2 , . . . , s(i)n ], and s(i)j ∈ {0, 1}m is a zero-one vector indicating which column
of V(i) is used to represent x(i)j . For instance, if s(i)j = e23, then V(i)s(i)j = v2(i); therefore, x(i)j is reconstructed by D(i)v2(i). The idea is to find multiple sparse representations and each of the representation can represent a subset of patches in the bag of patches that contains large variations. By minimizing the above objective function, we simultaneously find multiple sparse representations for bag-of-patches at each location (V(i)) and decide the sparse representations are used to represent which patches (S(i)).
The above problem is not convex because the feasible set (i.e. the set contains all the possible solution that satisfy the constraints in the problem) is not convex; therefore it is hard to find optimal solution of this problem. Here we propose an algorithm to find a suboptimal solution by iterative minimize V(i) and S(i).
When S(i)is fixed in the Equation (3), we can find each col- umn of V(i)separately by solving the following unconstrained convex optimization problem:
minimize
vj(i)
1 n
X
k,∀s(i)k =ej
||x(i)k − D(i)vj(i)||22+ λ||vj(i)||1, (4) when V(i) is fixed, we can find each s(i)j by solving the following optimization problem:
minimize
s(i)j
||x(i)j − D(i)V(i)s(i)j ||2
2
subject to ||s(i)j ||0= 1, ||s(i)j ||1= 1, s(i)j ≥ 0.
(5)
The size of feasible set in Equation (5) is only m; therefore, we can solve it by simply trying all possible value for s(i)j . The algorithm for solving Equation (3) is summarized in Algorithm 1. In each iteration, we alternatively divide the bag of patches at each location into different subset using S(i) and find the suitable sparse representation for each subset of patches. The algorithm will converge because in each iteration the objective function in Equation (3) will decrease and there is only a finite set of possible S(i). Note that although the algorithm will converge, it does not guarantee to find the optimal solution, and the result depends on the initial value of S(i), but we find that in practice we can usually find a good set of sparse representations for the bag of faces and will converge in several iterations.
After the above procedure, each bag is represented by p sets of sparse representations, B1 = {V1(1), . . . V1(p)}, B2 = {V2(1), . . . V2(p)}, the similarity between two bags is then defined as follow:
S(B1, B2) =
p
X
i=1
max
j,k c(v1,j(i), v2,k(i)), (6) where c(a, b) indicates the number of overlapping codewords between two sparse vectors,
c(a, b) = || max((a ◦ b), 0)||0, (7)
“◦” denote the element-wise multiplication between two vec- tors. Note that using Equation (7), only coding value with the
3Here eiis a m dimensional vector with all zeros except ithdimension is one as defined in most linear algebra literature.
same sign will be considered as the same codewords. That is, we consider coding value with different signs as different codewords. By considering the sign of coding values, we effectively get sparse representation with 2 × k dimensions.
It can be viewed as sparse coding using a larger dictionary [−D D]with 2 × k entries. Equation (6) computes the sum of maximum number of overlapping codewords at each location between two bags of faces.
To efficiently compute the similarity measure in Equa- tion (6), we use a modified version of inverted index. For each entry in inverted list, we maintain a Bag-ID that denotes which bag this codeword belongs to, and a Representation-ID, ranging from 1 to m, denotes which sparse representation of the bag this codeword comes from. For each sparse representa- tion in query face track, we retrieve the index and compute the number of overlapping codewords between query and every sparse representation in the index and will derive m different scores for each Bag-ID. We keep the best score among these m scores. After m runs with different sparse representations in query face track, we can find the maximum number of overlapping codewords between query sparse representations and sparse representations in the index. Since the number of sparse representation is m times more than the case with single sparse representation, the average length of posting lists in inverted index is m times longer; therefore, it takes m2 time to retrieve the index and compute the score.
Algorithm 1 Algorithm for finding sets of sparse representa- tions
Input: A set of dictionaries D(1), . . . , D(p) ∈ Rd×k; fea- tures extracted from the bag of faces at each location X(1), . . . , X(p) ∈ Rd×n;n (the size of the input bag of faces); m (the number of output sparse representations) Output: A set of sparse representations for each location
V(1), . . . , V(p) ∈ Rk×m;
1: for i = 1 to p do
2: Randomly choose S(i) that satisfy the constraint in Equation (3)
3: repeat
4: for j = 1to m do
5: Solving Equation (4) using LARS algorithm
6: end for
7: for j = 1to n do
8: Solving Equation (5) by trying all elements in feasible set
9: end for
10: untilconverge
11: end for
V. EXPERIMENTS A. Dataset
We use two different datasets to evaluate our system. The first one is extracted from TRECVid [7] news videos during 2004 to 2006. Around 20 millions faces are detected from the videos; the detected faces are then tracked and grouped as around 157K bags of faces. As reported in Table I, 1,497
TABLE I
THE STATISTICS OF EXPERIMENTAL DATASETS– TRECVID ANDNHK NEWS VIDEOS;THE DETAILS ARE EXPLAINED INSECTIONV-A.
Datasets Annotated Tracks Faces Identities
TRECVid 1,497 405K 41
NHKnews7 5,567 1.25M 111
bags of faces with 405K faces from 41 well known people are annotated for evaluation. The second one is extracted from a news program broadcast in Japan “NHK news7” during 2001 to 2011. 5,567 bags of faces with 1.25 million faces from 111 people are annotated for evaluation. To our best knowledge, this dataset is one of the largest datasets available for face track retrieval task and is really challenging because it contains not only variations from illumination, pose, expression variation but also biological variations between faces of the same person due to long time period. Throughout the experiments for each dataset, in a leave-one-out manner, each bag of faces is alternatively used as query while remaining bags are used as database for computing the average precision. Mean average precision (MAP), which is a common measurement adopted in many literatures for retrieval task, is then computed for all queries.
B. Compared algorithms
We compare our methods to several state-of-the-art methods in the experiments including:
• SR: patch-level sparse representation from a single image as shown in Figure 2 (a) [20]. We simply pick the first face in the bag to represent the bag of faces. This baseline is used to illustrate the effectiveness of the bag-of-faces representation.
• MSM: mutual subspace method proposed in [2]. We first use PCA to find subspace bases and use the average of top ten canonical correlations for computing distance.
• Min-Min: minimum distance between samples from two sets as proposed in [4]. We use one minus cosine simi- larity as our distance measure.
• AHISD: affine hull image set distance proposed in [1].
• CHISD: convex hull image set distance proposed in [1].
For both AHISD and CHISD, we use the linear version and retain 98% energy by PCA. For CHISD, we set C = 100 for SVM training as in [1].
• BoF-SR: bag-of-faces sparse representation proposed in this paper.
• MBoF-SR: multiple bag-of-faces sparse representation proposed in this paper.
Note that we use the same pixel-wise feature extracted from 13 different facial landmarks for all the above methods. For SR, BoF-SR and MBoF-SR, we use random samples from NHKnews7 dataset as our dictionary entries. The advantage for using random sampling is the time efficiency to construct dictionaries of large size, which is a major superiority in repre- senting large-scale visual data. In terms of representativeness, Coates and Ng [27] found that using randomly sampled image patches as dictionary can achieve similar performance as that
by using learned dictionary (< 2.7% relative improvement in their experiments) if the sampled patches provide a set of over- complete basis that can represent input data. Therefore, in the experiments, we adopt random sampling to efficiently obtain a large dictionary to meet the needs of representing large-scale video data.
For MSM, Min-Min, AHISD and CHISD, we use linear search to derive ranking results, since these methods cannot well work with current indexing frameworks; for SR, BoF- SR and MBoF-SR, we use inverted indexing. For AHISD and CHISD, we use the MATLAB implementation provided by the author of [1], other methods (MSM, SR, BoF-SR, MBoF-SR) are carefully implemented in MATLAB; inverted indexing is implemented with C++. All the experiments operate on a 2.4 GHz Intel Xeon server.
C. Evaluation of the proposed methods
Table II shows the performance of the proposed methods compared to other state-of-the-art methods. In BoF-SR and MBoF-SR, we use λ = 0.125 and k = 1000. In MBoF-SR, we use eight sparse representations to represent the bag of faces (m = 8); discussions on the parameters will be shown in the following sections. SR performs much worse than other methods because it only uses single face image. Therefore, we can see that using bag of faces can really help the performance since it exploits more information. Among all other methods, Min-Min shows salient performance but it takes a long time for retrieval and, therefore, are not applicable for large-scale data. Note that AHISD performs worse than other baseline methods, it is because when size of the bags is big, affine hull representation might be too strong and every affine hull in the dataset is really close to each other in the feature space.
Using BoF-SR we can achieve 8% absolute improvement compared to other state-of-the-art methods on Trecvid dataset and 8.6% improvement on NHKnews7 dataset while having the fastest online retrieval time (0.01s on Trecvid dataset).
MBoF-SR can further improve the performance by 2.5% on Trecvid dataset and 5.2% on NHKnews7 dataset and still have a reasonable online retrieval time. Table II also shows the top rank precision (P@10) of all the methods. The proposed methods can achieve not only the best MAP compared to other methods but also best P@10 on both datasets. Note that the time shows in the Table II only contains the online retrieval time and does not include the time for face tracking, feature extraction and computing representation for bag of faces because the time is independent with the size of the datasets.
For MSM, Min-Min, AHSID and CHISD, we need to use linear search to derive the ranking results; therefore, the retrieval time is linear to the dataset size. On the other hand, SR, BoF-SR, MBoF-SR can achieve sub-linear retrieval time by using inverted indexing. For AHISD and CHISD, the retrieval time on NHKnews7 is too long. For a single query, it takes more than 1,000 seconds to finish. To evaluate all 5,567
4Because the features used in [28] (Local Binary Pattern) are different from those in this work (pixel-wise features), the results of MSM are slightly different.
queries in the dataset, it will take more than two months to finish; therefore, we do not show the performance here.
D. Impact of the parameters in BoF-SR
Figure 4 shows the performance of BoF-SR using different parameters on two datasets. We run experiments with λ from 0.125 to 4, k from 200 to 1000 on both datasets. We find that when k is too small (green line with cross in Figure 4) the performance is worse because the dictionary does not have enough discriminative capability to represent the bag of faces.
When dictionary is large enough, the performance is similar regardless the size of the dictionary. The performance is better when λ is small, because more (denser) codewords are used to represent the bag of faces. However, the performance tends to saturate when λ is too small because the sparsity of the representations tends to stay the same. Note that there is a trade-off between performance and retrieval time, when λ is large, the number of codewords will drop because the sparsity of the representations increase, therefore, the retrieval time is faster. Throughout the following experiments, we set λ = 0.125and k = 1000 for both BoF-SR and MBoF-SR on both datasets.
E. Number of sparse representations in MBoF-SR
In Figure 5, we show the performance and retrieval time on MBoF-SR using different m. When m increases, the performance on both datasets increases; it evidences the ef- fectiveness of the proposed MBoF-SR. When there are more sparse representations used to represent the bag of faces, it can be described better. The retrieval time required by MBoF- SR is m2 times compared to BoF-SR, so when m increases, retrieval time also increases. Nevertheless, when m is small, the retrieval time increases slowly while the performance gains more significantly; therefore, we can choose a small m to achieve better performance with a reasonable retrieval time.
F. Size of the bag
We also conduct experiments by varying bag sizes (i.e., numbers of faces per bag). Table III shows the MAP perfor- mance on Trecvid dataset with different bag sizes. We take first nfaces from the bag of faces to compute the result. If the total number of faces is smaller than n, all faces are used for the experiments. We gain more performance gains as increasing the bag size since more redundancies in faces can be exploited.
However, the proposed methods consistently have the best performance for all different sizes. Performance on AHISD and CHISD drops when using all the images in the bags. This is probably because when size of the bag is huge, affine hull or convex hull representation is too strong and most of the bags become really close to each other in the feature space and thus lacks discriminative capability. Also note that the retrieval time required by the proposed methods (BoF-SR, MBoF- SR) does not change much while it increases dramatically when using Min-Min, AHISD and CHISD. The scalability and effectiveness are both ensured in the proposed methods. The example retrieval results by the proposed MBoF-SR are shown