Facial Feature Detection - 車載型視覺式駕駛者疲倦昏睡偵測系統

Chapter 3 Implementations

3.3. Facial Feature Detection

+ and empirically determine its range [0, 249] for the skin color.

We now summarize the skin model in the following. A pixel belongs to the skin model if (1) the pixel has lager R than G values, (2) its C value is between 77 and _b

127, and (3) its U^ˆ value is between 0 and 249. Therefore, during the color transform of an image, only the pixels having lager R than G values we compute their C and _b Uˆ values by C_b= −0.1687R−0.3313G+0.5B+128 and ˆ ( 1)

During facial feature detection, eyes and mouth are located using a technique recently reported by Hsu et al. [Hsu02]. This section briefly reviews their technique.

Considering a color image ( , , )I R G B , first of all the image is transformed from the

RGB space into the YC C space using Equation (1). Let _r _b I Y C C′( , _r, _b) denote the transformed image.

A. Eye Detection

At the beginning of eye detection, two maps, EyeMapC and EyeMapL , are constructed. Let ( , )x y denote the location of any pixel. scaled within [0, 255]. The resultant EyeMapC is further enhanced by histogram equalization. Next, construct EyeMapL .

( , ) ( , )

where ⊕ and Θ are morphological dilation and erosion operators, respectively, and ( , )

g_σ x y is the structuring function with the element shape of . Afterwards, the maps EyeMapC and EyeMapL are integrated by

( , ) min{ ( , ), ( , )}

EyeMap x y = EyeMapC x y EyeMapL x y .

Refer to the example shown in Figure 17, in which the input image, maps EyeMapC , EyeMapL and EyeMap are depicted in Figures (a), (b), (c) and (d), respectively. Note that eyes have been highlighted in the map EyeMap . We next locate the eyes in the map by thresholding (Figure 17(e)), connected component labeling, and size filtering (Figure 17(f)). In general, a number of eye candidates are detected. Figure 17(g) shows the located eye candidates.

(a) input image (b) EyeMapC (c) EyeMapL (d) EyeMap

(e) thresholding (f) size filtering (g) eye candidates Fig. 17. Example for illustrating eye detection.

B. Mouth Detection

In mouth detection, one map MouthMap is computed.

detected earlier. Refer to the example shown in Figure 18, in which the input image and the associated map MouthMap are depicted in Figures (a) and (b), respectively.

The mouth has been emphasized in MouthMap. We locate the mouth in the map by thresholding (Figure 18(c)), connected component labeling, and size filtering (Figure 18(d)). In general, a number of mouth candidates are detected. Figure 18(e) shows the located mouth candidates.

(a) input image (b) MouthMap

Fig. 18. Example for illustrating mouth detection.

3.4. Face Tracking

Many techniques [Cap07] are feasible for moving object tracking, such as Baysian dynamic model, bootstrap filter, mean shift, particle filter, and sequential Monte Carlo. In this study, a linear Kalman filter [Wel95] is employed because of its efficiency and the short period of system state changes under consideration. Note that the system under our consideration is the driver’s face, which is to be tracked over a video sequence. The time interval of face state change is the period between two successive images. Face state change within such a small interval is assumed to be linear. In the following, we briefly review the linear Kalman filter and then address how to track over a video sequence the located driver’s face using the filter.

A. Linear Kalman Filter

There are two models involved in the linear Kalman filter: a system model and a measurement model both characterized by linear stochastic difference equations. The system model is governed by s_t₊₁=As_t +w , in which _t s is the system state vector _t

at time t, A is a state transition matrix, and w represents the system perturbation _t assumed to be with a normal probability distribution p( )w =N( , )0 Q where 0 is the zero vector and Q is the covariance matrix of system perturbation. The measurement model is formulated as z_t =Hs_t +v , in which _t z is the measurement vector at time t, _t H relates the system state s to the measurement _t z , and _t v represents the _t measurement noise also assumed to be with a normal probability distribution

( ) ( , ) indicates transposition. It is desirable that the a posterior error covariance

[ ^T]

The Kalman filter proceeds as follows. There are two phases constituting the Kalman filter: the prediction and the updating phases. In the prediction phase,

ˆ_t⁻₊1= Aˆ_t + _t computed. In the above process, matrices A and H are defined according to the

practical application at hand and covariance matrices Q andR are empirically determined. Given the initial values of ˆs , ₀ C and ₀ z , the above two phases repeat ₀ until a stopping criterion is reached and the result is ˆs . _t

B. Tracking

As mentioned, the system under consideration is the located driver’s face, which is described in terms of a triangle connecting the two eyes and mouth of the deriver.

We hence define the measurement vector z of a face as z=( ,x y x y x_l _l, _r, _r, _m,y_m)^T, where ( ,x y , (_l _l) x y_r, _r) and (x_m,y_m) are the locations of the left eye, right eye and mouth, respectively. The state vector s of the face further includes the velocities of the facial features, ( , )u v , (_l _l u v and (_r, _r) u_m,v_m), i.e.,

Since the measurement vector z is a sub-vector of the state vector s, we then define the measurement-state relating matrix H as a sub-matrix of A in the following.

1 0 0 0 0 0 0 0 0 0 0 0

Since the matrices A and H are stemmed from the simplified motion equations, errors are inevitably introduced into the predicted system state and measurement through the matrices. These errors and many others are assumed to be compensated by the system perturbation term w ~ N( , )0 Q and the measurement noise term

( , )

N R

v ~ 0 . Empirically, we observed the location and velocity errors are about four and two pixels, respectively. Accordingly, we define Q and R as

1 6 0 0 0 0 0 0 0 0 0 0 0

Note that we fix the above two matrices during iterations because the same motion equations are assumed in each iteration.

We next determine the initial values of w , ₀ C , ₀ ˆs and ₀ z . The system ₀ perturbations (w_t t≥0) are random vectors generated by N( , )0 Q . The initial a posterior error covariance matrix C is to be updated with iterations. A precise ₀ C is ₀

not necessary. Empirically, 10-pixel positional error and 5-pixel speed error have been

To determine the initial face state vector

0 0 0 0 0 0 0 0 0 0 0 0 0

ˆ =(x_l ,y_l ,x_r ,y_r ,x_m ,y_m ,u_l ,v_l ,u_r ,v_r ,u_m ,v_m )^T

s ,

recall that a face candidate is determined as the actual face only when it is repeatedly detected in two successive images. Let ((x_l⁰,y_l⁰),(x_r⁰,y⁰_r),(x_m⁰,y_m⁰)) and (( ,x y¹_l ¹_l),

1 1

(x y_r, _r),(x¹_m,y¹_m)) be the locations of the left eye, right eye and mouth at times t ₀ and t , respectively. Based on the above values, the components of the initial state ₁

vector ˆs are given as ₀ x_i₀ =(x_i⁰+x¹_i) / 2 , y_i₀ =(y_i⁰+y¹_i) / 2 , u_i₀ =x¹_i −x_i⁰ ,

Recall that measurement vector z consists of the positions of facial features, i.e.,z=( ,x y x y x_l _l, _r, _r, _m,y_m)^T. Considering a facial feature, its location (x_t₋₁,y_t₋₁)

and velocity (u_t₋₁,v_t₋₁) at time t-1 are known. To attain the location ( ,x y _t _t) of the feature at time t, we decide a rectangular searching space S in image I . Let _t

(x_ul,y_ul) and (x_lr,y_lr) denote the coordinates of the upper left and lower right corners of S, respectively. The two corners are determined in the following.

1 1 1 1

(x_ul,y_ul)=(x_t₋ −l_ee/ 2+l u_ee _t₋ /10, /y_t₋ −l_em a l v+ _{em t}₋ /10),

1 1 1 1

(x_lr,y_lr)=(x_t₋ −l_ee/ 2+l u_ee _t₋ /10, /y_t₋ −l_em a l v+ _{em t}₋ /10),

where l is the length between the two eyes, _ee l is the length between the center of _em the two eyes and the mouth, and a is a positive constant (4 for eyes and 3 for mouth).

Both l and _ee l are calculated at the beginning of the system operation. _em

Having determined the searching area of a facial feature, we look for the feature within the area by matching their edge magnitudes so as to reduce the effect of illumination variation. The edge magnitude of the facial feature is computed within a window (l_ee/1.5×l_em/ 3.5for eyes and l_ee×l_em/ 2.5 for mouth) centered at (x_t₋₁,y_t₋₁) in image I_t₋₁ and the edge magnitude of the searching area is computed in I . During _t matching edge magnitudes between the feature and the searching area, the right three fourth of the searching area is examined for the left eye, the left three fourth of the searching area is examined for the right eye, and the entire searching area is examined for the mouth. The above three-fourth strategy for eyes is to avoid incomplete shapes of eyes when the driver’s head has a large turning (see the example shown in Figure 19).

Fig. 19. Incomplete shapes of eyes when the driver’s head has a large turning.

3.5. Fuzzy Reasoning

Given the parameter values of driver’s facial expression, a fuzzy integral technique is employed to deduce the drowsiness level of the driver. In this section, the fuzzy integral technique is discussed first. Fuzzy reasoning based on the technique is then addressed.

A. Fuzzy Integral

Fuzzy integrals [Zim91] have been generalized from Lebesque [Sug77] or Riemann integral [Dub82]. In this study, the Sugeno fuzzy integral extended from Lebesque integral is considered. Let f : [0,1]S→ be a function defined on a finite set S and g P S: ( )→[0,1] be a set function defined over the power set of S. Function

( )

g ⋅ , often referred to as a fuzzy measure function, satisfies the axioms of boundary conditions, monotonicity, and continuity [Wan92]. Sugeno further imposed on g( )⋅ an additional property,∀A B, ⊂S A, ∩B=φ,

g A( ∪B)=g A( )+g B( )+λg A g B( ) ( ), λ≥1. (2) The fuzzy integral of f( )⋅ with respect to ( )g ⋅ is then defined as

∫

⋅ = ∧

where ∧ represents the fuzzy intersection and A_α = ∈{s S f s| ( )≥α}.

The above fuzzy integral provides an elegant nonlinear numeric approach suitable for integrating multiple sources of information or evidence to arrive at a value that indicates the degree of support for a particular hypothesis or decision. Suppose we have several hypotheses, H ={ , h i_i =1, , }⋅⋅⋅ n , from which a final decision d is to be made. Let

e be the integral value evaluated for hypothesis h . We then determined _i the final decision by sgn max

i i

h H h

d e

= ∈ .

Considering any hypothesis h∈H, let S be the set collecting all the information sources at hand. Function f( )⋅ receiving an information source s returns a value ( )f s

that reveals the level of support of s to the hypothesis h. Since the degrees of worth of information sources may be different, function g( )⋅ takes as input a subset of

information sources and gives a value that reflects the degree of worth of the set of sources relative to the other sources. Let d s( )=g s({ }). Function d( )⋅ is referred to

Since g S( ) 1= , the value of λ can be determined by solving

number of subsets required to perform the fuzzy integral from 2^{| |}^S (by Equation (3)) to |S . |

B. Reasoning

Recall that eight facial parameters are considered for drowsiness analysis, i.e., percentage of eye closure over time, eye blinking frequency, eye closure duration, head orientation (including tilt, pan, and rotation angles), mouth opening duration, and degree of gaze. LetD={ ,d d₁ ₂,⋅⋅⋅⋅,d₈} denote the relative degrees of importance of the parameters. Three criteria: worth, accuracy and reliability, have been involved in determining the importance degrees of the parameters. The first criterion is somehow intuitive, whereas the rest two are figured out from experiments to be discussed in Section 4. Accordingly, we define D as D={0.93, 0.8, 0.85, 0.5, 0.3,

0.3, 0.5, 0.9} . Let V ={ ,v v₁ ₂,⋅⋅⋅⋅, }v₈ be the measured values of the eight parameters,

respectively. We transfer V according to the predefined transfer functions of parameters into S ={ ,s s₁ ₂,⋅⋅⋅⋅, }s₈ , where s indicates the degree of drowsiness corresponding to _i the parametric value v . Set S here forms what we call the collection of information _i sources.

Based on the sets D and S, we want to determine using the fuzzy integral method the drowsiness level l of the driver, l∈H ={ , 0.1, 0.2, , }m m+ m+ ⋅⋅⋅⋅ M , where H is the hypothesis set, in which m and M are determined as 10 min /10

Si⊆ , of information sources by Equation (4). Afterwards, for each hypothesis S

hi∈ , we perform the fuzzy integral process. First of all, we calculate the support H value f s_i( )_j for each information sources_j∈S by f s_i( )_j = −1 s_j − . We next sort h_i information sources according to their support values. Let S′={ ,s s₁′ ′₂,⋅⋅⋅, }s₈′ be the sorted version of S such that f s_i( )₁′ ≥ f s_i( )₂′ ≥ ⋅⋅⋅ ≥ f s′_i( )₈ . Substituting f s′ and _i( )_i

( )_i

g S′ into Equation (5), we obtain the fuzzy integral value e of hypothesis _i h . The _i

above process repeats for each hypothesis in H. Finally, the drowsiness level l of the driver is simply determined as ^* arg max

h H i

l h e

= = ∈ .

Chapter 4. Experimental Results

The proposed driver’s drowsiness detection system has been developed using the Borland C++ Programming Language run on an Intel Solo T1300 1.66 GHz PC with the operating system of Windows XP professional. The input video sequence to the system is at a rate of 30 frames/second. The size of video images is 320 x 240 pixels.

We divide our experiments into two parts. The first part investigates the efficiency and accuracy of individual steps of the system process. The results have provided clues when assigning degrees of importance for the facial parameters, which are used in the fuzzy reasoning step. The second part exhibits the performance of the entire system.

4.1. Individual Steps

Recall that there are five major steps: preprocessing, facial feature extraction, face tracking, parameter estimation, and reasoning, involved in the system workflow. Of these five steps, the facial feature extraction and face tracking steps dominate the processing speed of the system, whereas the parameter estimation and reasoning steps determine the accuracy of the system.

4.1.1. Facial Feature Extraction and Face Tracking

Refer to Figure 20, where the experimental result of facial feature extraction and face tracking of a video sequence is exhibited. At the beginning, the system repeatedly

Thereafter, the system initiates the face tracking module. The tracking module continues until it misses detecting the right eye of the driver in frame 198 because of a rapid turning of the driver’s head. The system immediately evokes the facial feature extraction module again. After successfully locating all the facial features in two successive images (i.e., frames 199 and 200), the face tracking module then takes over and tracks the features over the subsequent images. Our current facial feature extraction module takes about 1/8 seconds to detect facial features in an image, whereas the face tracking module takes about 1/25 seconds to locate facial features in an image.

1 2 3 4 5

⋅⋅⋅⋅⋅⋅

178 179 180 181 182 183

184 185 186 187

⋅⋅⋅⋅⋅⋅

¹⁹⁵

196 197 198 199 200 201

202 203 204

⋅⋅⋅⋅⋅⋅

⁴³⁰ ⁴³¹

432 433 434 435 436 437

438 439

⋅⋅⋅⋅⋅⋅

⁴⁶¹ ⁴⁶² ⁴⁶³

464

⋅⋅⋅⋅⋅⋅

⁹⁹⁶ ⁹⁹⁷ ⁹⁹⁸ ⁹⁹⁹

000 001 002 003 004 005

⋅⋅⋅⋅⋅⋅

014 015 016 017 018

019 020 021

⋅⋅⋅⋅⋅⋅

Fig. 20. Facial feature extraction and face tracking over a video sequence.

Figure 21 shows the robustness of the facial feature extraction module under different conditions of illumination (e.g., shiny light, sunny day, cloudy day, twilight, underground passage and tunnel), head orientations, facial expression, and the accessary of glasses. The robustness of the facial feature extraction module is primarily due to the use of a face model. This model helps to find the other features once one or two facial features have been detected. However, there are always uncertainties during facial feature extraction. We confirm a result only when it is repeatedly obtained in two successive images.

(d) twilight (e) underground passage (f) tunnel

(g) head orientations (h) facial expression (i) glasses Fig. 21. Robustness of facial feature extraction under different conditions.

4.1.2. Parameter Estimation A. Eye Parameters

Figure 22 displays 18 images with closed eyes, which were manually extracted from a video clip of 300 frames. Our system has misidentified the closed eyes in frames 102 and 113 as opening eyes, while no opening eye has been misclassified as closed eye. In this experiment, our system achieved the identification rate of 99.3%

(=(300-2)/300). The d_e− , v_e− and b_e− curves associated with the video clip are depicted in Figure 23. Based on these curves, the parameters of blinking frequency, percentage of eye closure over time (PERCLOS) and eye closure duration are calculated as 67.2 times/min, 0.07, and 0.12 seconds, respectively. Similar experiments have been conducted for 20 different video clips. The identification rates have ranged from 96.4% to 99.7% and the average identification rate was 98.6%.

Fig. 22. Images containing closed eyes in a video clip of 300 images.

B. Mouth Parameter

In the next experiment regarding the estimation of mouth openness duration, we calculated the d_m− , v_m− and b_m− curves of a mouth from a video clip of 300 frames. Figure 24 shows the calculated curves, which collectively reveal the occurrence of a mouth openness within the video clip. The system returned the duration of 1.95 seconds for the mouth openness. We display in Figure 25 the video segment from frames 157 to 205 of the video clip, from which the mouth openness duration was manually estimated as about 2 seconds. Accordingly, our system achieved the accuracy rate of 97.5% (=1.95/2) in estimating the parameter of mouth openness duration in this experiment. Similar accuracy rates have been observed for different video clips. However, we still assigned a relatively small degree (0.5) of importance for the parameter of mouth openness duration because mouth opening can occur when the driver talks with passengers.

0 1 2

1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 image

from a video clip of 300 images.

Fig. 25. Images containing closed eyes in a video clip of 300 images.

C. Head Parameters

Figure 26 shows some experimental results of head parameter estimation. In this figure, the first column displays example images of drivers and the second column shows their calculated values of head parameters (including pan, tilt and rotation

angles). See the first example image, in which the driver faced front. The system estimated 0.0, 7.8 and 4.4 degrees with the pan, tilt and rotation angles of the driver’s head, respectively. These results visually agree with the orientation of the driver’s head present in the image. In the second example image, the driver is looking outside the left window. The system returned 37.6, 0.0 and 9.4 degrees with the pan, tilt and rotation angles of the driver’s head, respectively. In this example, the estimated pan and tilt angles are reasonable, while the estimated rotation angle seems a little large.

The similar situation was also observed in the third example image, in which the driver is looking at the rearview mirror. The system estimated -43.9, 0.0 and 6.8 degrees with the pan, tilt and rotation angles of the driver’s head, respectively. Visually, the rotation angles of the drivers’ heads present in the third and fourth example images should both be close to zero degree. On the contrary, the estimated rotation angle (18.3 degrees) for the driver’s head present in the last example image seems too small.

Input image Estimated values of head parameters (degs.) Pan: 0.0

Tilt: 7.8 Rotation: 4.4

Pan: 37.6 Tilt: 0.0 Rotation: 9.4 Pan: －43.9 Tilt: 0.0 Rotation: 6.8 Pan: -3.4 Tilt: 0.0 Rotation: 18.3

Fig. 26. Experimental results of head parameter estimation.

There are two major factors that can lead to the inaccuracy of head parameter estimation: the imprecise localization of facial features and the 3D-from-2D estimation of parametric values. Empirically, the error ranges of estimated pan, tilt, and rotation

angles are about

± 5

^o,

± 5

^o and

± 10

^o, respectively. Accordingly, we would like to give higher degrees of importance for the pan and tilt angles than the rotation angle.

However, the tilt and rotation angles are more decisive than the pan angle in determining the level of drowsiness. Based on the above observations, we then assign the degrees of importance of 0.5, 0.3 and 0.3 for the tilt, pan and rotation angles,

respectively.

D. Gaze Parameter

Two video clips are utilized to illustrate the estimation of gaze parameter. Recall that the gaze parameter is estimated based on the difference (d_g ) between the

predicted and observed horizontal displacements of the driver’s face provided by the Kalman filter during face tracking. Figure 27 displays the

d

−

curves of the two video clips. Clearly, the

d

−

curve of video clip 1 has relatively larger distribution magnitudes as well as variation than the

d

−

curve of video clip 2. A

d

−

curve

with large distribution magnitudes signifies large movements of the driver’s head and with a large distribution variation indicates a high frequency of head movement. Both reflect low degrees of gaze. The gaze degree of a video clip is defined as the ratio of

the gaze duration occurring in the clip to the time interval of the clip. To calculate the gaze durations of the two video clips, their binary

b

−

curves derived from their

d

−

curves are depicted in Figure 28. The gaze degrees of the two video clips are easily evaluated from the

b

−

curves as 0.36 and 0.74, respectively. The results

reasonably reflect the gaze situations of the drivers present in the two video clips.

Since the parameter of gaze degree plays a critical role in determining drowsiness level, we assign the degree of importance of 0.9 for the gaze parameter.

Table 1 summarizes the performances of the system on five experimental video sequences. The video sequences were acquired under different illumination conditions:

shinny day, sunny day, cloudy day, twilight, and tunnel. The time intervals of the video sequences are indicated in the second column of the table.

Table 1. Performances of the system on five experimental video sequences.

As mentioned, the processing speed of the system is dominated by the steps of facial feature extraction and face tracking. The former takes about 1/8 seconds to detect facial features in an image, whereas the latter takes about 1/25 seconds to locate facial features in an image. The facial feature extraction module is evoked only at the beginning of the system process as well as when the face tracking module misses locating facial features in an image. The average processing speed of the system hence heavily relies on the success rate of face tracking. The third and fourth columns of Table 1 show the success rates of face tracking and average processing speeds of the

在文檔中車載型視覺式駕駛者疲倦昏睡偵測系統 (頁 42-0)