Chapter 2 Backgrounds
2.2 Human Detection in Bottom-up Detection Scheme
In [4] and [15], Bourdev et al. introduced a new notion of parts as “poselets”, in which the key idea is to define parts that are tightly clustered both in the configuration space and the appearance space, as shown in Figure 2-7. The poselets are produced by a search procedure. A patch is randomly chosen in the image of a randomly picked person as a seed of poselet, and other examples are found by searching in images of other people for a patch where the configuration of key-points, such as shoulders or hips, is similar to that in the seed. After that, the HOGs feature will be computed for each of associated image patches. They are used as positive examples for training a linear support vector machine. At test time, a multi-scale sliding window is used to find strong activations of the poselet filters. Note that these poselets must have strong spatial information to estimate the possible locations of key-points which provides the ability to compute mutual consistency between activations. With these mutual consistencies, we can cluster the activations and produce the hypotheses of humans.
Figure 2-7 Examples of Poselets [4]
In Figure 2-8, an example to illustrate the overall detection procedure is
introduced. As shown in Figure 2-8, detection results of different poselet detectors are shown in different colors, and the size of the blobs means the detection scores. Mutual consistency is to calculate how close the locations of key-points are estimated by two different activations. This information is used to re-score the activations. Activation with more supporting member agreeing with the estimated key-points will lead to a higher score, while the activation not in this case will be damped. This is shown in Figure 2-8. In Figure 2-8, the authors use a saliency based agglomerative clustering with pairwise distances based on consistency of the empirical key-point distributions predicted by each poselet. Finally, the bounding boxes and segmentations are predicted by the poselets in each cluster as shown in Figure 2-8.
Figure 2-8 Illustration of algorithm provided by Bourdev et al. [15] (a) Detection results of Poselet in different color, called activations (b) Illustration of Mutual Consistency (c) Saliency based clustering in
greedy manner (d) Detection and segmentation results.
In [3], Alex Yong-Sang Chia et al. assume that the target object can be described
by the combination of shape-tokens, which consist of several line segments and ellipses. An overview of this contour based recognition method is provided in Figure 2-9. In the first step, lots of shape-tokens will be extracted from training set, and then clustered into different code-words of the codebook. Next, a discriminative sub-set of codebook will be extracted. Instead of cluster size, the extraction is based on the score calculated from shape and geometric qualities and a radial ranking will be applied.
Note that for each shape-token, the relative position of object center will be recorded.
Hence, the final positions of objects will be decided by a voting scheme. Besides, the bounding boxes will be determined based on the shape-tokens used.
Figure 2-9 Overview of algorithm provided by Alex Yong-Sang Chia et al. [3]
Up to now, the definitions of parts used for detection are learned from training data and these parts need to contain strong spatial information in order to infer the locations of key-points on body configuration or the center of target object. The following references in [6-9] adopt different ways. They directly define the parts for detection in the natomical sense, which means that the parts will be head, torso, forearm, upper-arm, thigh or shank. These references are closely related to our work in this thesis.
In [6], Mori et al. first partition the testing image into segments, and then detect
the body parts, such as limb, torso and head, based on information of segments. For limb, the author assumes the half limbs, such as forearm, upper-arm, thigh or shank, will be well segmented, which means the half limb will be represented by single segments. In order to detect half limbs, lots of hand-segmented half limbs are extracted for training. Several examples are shown in Figure 2-10. Features used to describe the half limb are contour, shape, shading and focus. Sigmoid function is used to transform the feature value into a probability-like quantity. These values will be combined linearly and the weights will be learned from training data with a linear regression training scheme. Finally, the number of candidates to be extracted can be seen as the threshold for half limb detection.
For torso, the shape is assumed to be rectangular, and may consist of more than one segment. The features used are the same as the features used for half limb only without shading. The training of weights for feature combination is totally the same.
For inference of configuration, we need to know the orientation of torso and the locations of body joints. Hence, for each torso candidate and each orientation, the best matching head will be decided. A candidate head may consist of one or two segments.
The same set of cues, contour, shape and focus are used to evaluate the score of a candidate head. The combination score of head and torso consists of the score of head and the score of torso, plus the simple score to describe the relative positions. Finally, we sort the possible combination of head and torso by their score and choose a finite number of combinations as candidates for the inference of configuration. Several examples are shown in Figure 2-10.
(a) (b)
Figure 2-10 (a) Examples of human-segmented half limbs for training, (b) Torso candidates are provided by combination of segments. [6]
As having the part candidates and information of joints, the next step is the inference of configuration. The method adopted by the author is the exhaustive search.
For each torso candidate, the best limb will be independently selected for each joint.
The number of possible configurations is evaluated as (𝐿3) ∙ 8 ∙ 7 ∙ 6 ∙ 23∙ 𝑇 . L means the number of half limb candidate, which is usually around 5~7, and T means the number of head-torso candidates, which is set to be 50. Here, the author assumes that for each configure, at least three half limbs can be found. Besides, there are 8 kinds of role for each half limb candidate. Hence, the number of possible combination of three half limbs will be 8 ∙ 7 ∙ 6. However, the polarity of half limbs is also considered.
Hence, a multiplication of 23 will be needed. This exhaustive search will lead to 2-3 million partial configurations. A “Constraint Satisfaction” strategy will be used to suppress physically impossible configures. The constraints used are relative widths, length given torso, adjacency and symmetry in clothing. With this strategy, the number of left configures will be approximately 1000. Finally, these configures will be sorted by the total scores, which are the linear combination of scores of limbs and head-torso. Several examples are shown in Figure 2-11.
Figure 2-11 Several detection results of [6]
In [7], the author first preprocesses the image by using the “local Pb operator” to compute the soft edge map. After that, “Canny’s hysteresis” is used to convert the soft edge map into contours, which are recursively split into piecewise straight lines.
Finally, “constrained delaunay triangulation” (CDT) is applied to transform the scale-invariant discrete line structure into a set of triangles.
As the triangulation map is ready, the candidates of limb and torso will be extracted with the assumption of being a combination of parallel lines. Constraint for torso is oriented upward. With body parts, the configuration inference can be seen as a label assignment problem, which means the decision of the role for each part candidate in the configuration. The best configuration will be inferred by the discussion of simple unary constraints and pairwise constraints, which are aspect ratio, low-level score, scale consistency, appearance consistency, orientation consistency and connectivity. These constraints will be modeled by Gaussian distributions. The inference problem can be modeled as the minimization of the following equation:
∑ ∑ 𝑓𝑘′(𝑙1, 𝜋(𝑙1), 𝑙2, 𝜋(𝑙2))
body label. 𝜋(𝑙𝑖) represents the part candidate which is assigned with 𝑙𝑖 body label.
Besides, 𝑑(𝜋(𝑙)) is used to measure the quality of an individual part candidate.
Minimizing Equation 2.3 can be further written as an integer quadratic programming problem (IQP), which is expressed as follows: Directly optimize Equation 2.4 is an NP hard problem. An approximation is deducted which is a linear bounding function allowing efficient inference as shown in the
Finally, the greedy search is adopted. We fix one candidate for specific label and find the best assignment for other candidates. We repeat the procedure to find the configuration with minimum constraint cost. An example for illustration of the overall system is provided in Figure 2-12.
(a) (b) (c) (d) (e) (f)
Figure 2-12 Illustration of algorithm provided in [7] (a) Input image (b) Edge map (c) Result of Constrained Delaunay Triangulation (d) Part candidates in parallel lines with same color (e)
Configuration found by Integer Quadratic Programming (f) Approximate Segmentation
In [8], instead of finding two parallel line segments to identify limb candidate
directly as shown in [7], the authors relax the constraint so that they need only one straight line segment to handle the missing segment caused by cluttering, occlusion or shape variation. As one straight line is extracted, the “Distance Transform (DT)”
matching provided in [10] will be applied. The matching score between parallel line templates in different sizes and orientations and the distance transform of edge map obtained by “Canny Edge Detector” will be evaluated at every possible position. The formula form is provided as follows: used for torso is the same as the templates for limb. The scale of torso will be inferred from the scale of limbs based on the anthropometric data provided in [16]. For head, the template shape is a circle.
With part candidates, the best body configuration is inferred by the lowest value of dissimilarity 𝐷𝐻 as expressed in the following equation:
𝐷𝐻 = 𝑤𝑔𝐷𝑔+ 𝑤𝑡𝐷𝑡𝑜𝑝+ 𝑤𝑎𝐷𝑎𝑝𝑝+ 𝑤𝑙𝐷𝑙𝑔. (2.7) In Equation 2.7, {𝑤} means weights which are learned from training data. 𝐷𝑔 is a term dedicated to pruning configurations that are not physically valid. 𝐷𝑡𝑜𝑝 corresponds to a topological matching between the part assembly and a model of the human skeleton. This model is inspired by the “shock graphs” mentioned in [17].
𝐷𝑎𝑝𝑝 encodes prior information about the symmetry in clothing and support these assemblies for which the appearance of left and right limbs is similar. The last term 𝐷𝑙𝑔 corresponds to a more global reasoning about the configuration, which is dedicated to estimating a combined image likelihood of the assembly by explicitly taking into account self-occlusion.
A brief illustration of the system flow is shown in Figure 2-13.
Figure 2-13 Illustration of algorithm provided in [8].
In [9], the authors claim that the performance of detection is highly dependent on the discriminative part classifiers. Hence, in this work, densely sampled “shape context descriptor” provided in [18] is adopted to describe body parts. Moreover, the Adaboost training scheme proposed by [19] is applied. Finally, with part candidates, the inference of configure follows the same steps as proposed in [5] with the usage of
“Pictorial Structural Model”. The equation form of this model is provided as follows:
p(L|D) ∝ 𝑝(𝑙0) ∙ ∏ 𝑝(𝑑𝑖|𝑙𝑖)
𝑁
𝑖=0
∙ ∏ 𝑝(𝑙𝑖|𝑙𝑗)
(𝑖,𝑗)∈𝐸
. (2.8)
In this equation, p(L|D) means that given the image feature, D, what will the probability of configuration L be. This probability will be proportional to the multiplication of three terms shown in the right portion of Equation 2.8. 𝑝(𝑙0) denotes the probability for the location of torso to be at 𝑙0. ∏𝑁𝑖=0𝑝(𝑑𝑖|𝑙𝑖) represents the probability for the rest part to be placed at 𝑙𝑖. 𝑑𝑖 means the evidence map for the the i-th part. Finally, ∏(𝑖,𝑗)∈𝐸𝑝(𝑙𝑖|𝑙𝑗) denotes the spatial relation between the position of the i-th part and the position of the j-th part. One thing needs to be
mentioned is that torso candidates will be detected first in this work. Several results of this work are provided in Figure 2-14.
Figure 2-14 Several detection results provided in [9].