車載型視覺式駕駛者疲倦昏睡偵測系統

全文

(1)國立臺灣師範大學資訊工程研究所碩士論文. 指導教授：. 陳. 世. 旺. 博士. 車載型視覺式駕駛者疲倦昏睡偵測系統 An In-Vehicle Vision-Based Driver’s Drowsiness Detection System. 研究生：姚國鵬撰中華民國九十七年六月.

(2) 摘要經由報告顯示，許多交通事故都可歸咎於駕駛者的疲倦或勞累，因為疲倦會影響駕駛者的視野、警覺性與決策的能力，會使駕駛的效率與能力降低。在本研究中，我們發展一視覺式的疲倦偵測與警告系統，可估計出駕駛者的潛在疲倦狀態，且做出一些警告。當假設駕駛者處於低疲倦狀態時，我們可以為駕駛者作一些干涉性較小的動作，例如：開啟空調系統、散播一些香水或打開收音機提供一些娛樂；當駕駛者處於高疲倦狀態時，我們可以開啟導航系統輔助或警告其他人本車駕駛處於高疲倦狀態。疲倦程度的計算是藉由臉部影像來判斷，影像由置於車輛前方的攝影機取得。本系統主要包含了五大步驟：前處理、臉部特徵擷取、臉部追蹤、狀態參數的計算、疲倦程度的推測。在前處理部份，我們首先降低輸入影像的維度以加快系統速度。接下來對光線作補償，以降低周遭環境光線的影響。最後我們對每像點計算其在不同色彩空間的色度值，提供之後臉部特徵擷取的使用。臉部特徵擷取主要包含四個子步驟：膚色偵測、臉部定位、眼睛偵測、嘴巴偵測與特徵確認。膚色區域偵測由前處理所得到的色度值，經由膚色模型來計算。接下來臉部定位，我們搜尋最大的膚色區域，然而所得到的臉部區域通常都是不夠完整的，因此僅在不完整的膚色區域搜尋臉部特徵是不可靠的，所以我們實際上是在整張影像搜尋臉部特徵，而臉部區域則用來做確認臉部特徵的動作。. i.

(3) 當臉部特徵確認後，我們即利用其作臉部的追蹤，直到臉部追蹤失敗，此時臉部特徵擷取才會再次被執行，因為臉部特徵擷取是較耗時的，而臉部追蹤是較快且較可靠的。當系統處於臉部追蹤時，臉部的狀態參數會在此時計算出來，包括了：單位時間內閉眼所佔的比例、眨眼頻率、閉眼的時間長度、凝視狀態、嘴巴張開的時間長度與頭部的轉動。最後我們利用這些狀態參數來推算駕駛的疲倦程度，使用的是模糊積分，其可將各種不同的參數統合起來，推測出一疲倦程度。我們對許多不同的駕駛者與照明程度作測試，結果顯示我們系統可以在白天正常的運作，未來將把系統的使用性延伸至晚上，對於這點我們將納入紅外線攝影機來幫助系統的實現。. ii.

(4) Abstract Many traffic accidents have been reported due to driver’s drowsiness/fatigue. Drowsiness degrades driving performance due to the declinations of visibility, situational awareness and decision-making capability. In this study, a vision-based drowsiness detection and warning system is presented, which attempts to bring to the attention of a driver to his/her own potential drowsiness. The information provided by the system can also be utilized by adaptive systems to manage noncritical operations, such as starting a ventilator, spreading fragrance, turning on a radio, and providing entertainment options. In high drowsiness situation, the system may initiate navigation aids and alert others to the drowsiness of the driver. The system estimates the fatigue level of a driver based on his/her facial images acquired by a video camera mounted in the front of the vehicle. There are five major steps involved in the system process: preprocessing, facial feature extraction, face tracking, parameter estimation, and reasoning. In the preprocessing step, the input image is sub-sampled for reducing the image size and in turn the processing time. A lighting compensation process is next applied to the reduced image in order to remove the influences of ambient illumination variations. Afterwards, for each image pixel a number of chrominance values are calculated, which are to be used in the next step for detecting facial features. There are four sub-steps constituting the feature extraction step: skin detection, face localization, eyes and mouth detection, and feature confirmation. To begin, the skin areas are located in the image based on the chrominance values of pixels calculated in the previous step and a predefined skin model. We next search for the face region within the largest skin area. However, the detected face is typically imperfect. Facial feature detection within the imperfect face region is unreliable. We iii.

(5) actually look for facial features throughout the entire image. As to the face region, it will later be used to confirm the detected facial features. Once facial features are located, they are tracked over the video sequence until they are missed detecting in a video image. At this moment, the facial feature detection process is revoked again. Although facial feature detection is time consuming, facial feature tracking is fast and reliable. During facial feature tracking, parameters of facial expression, including percentage of eye closure over time, eye blinking frequency, durations of eye closure, gaze and mouth opening, as well as head orientation, are estimated. The estimated parameters are then utilized in the reasoning step to determine the driver’s drowsiness level. A fuzzy integral technique is employed, which integrates various types of parameter values to arrive at a decision about the drowsiness level of the driver. A number of video sequences of different drivers and illumination conditions have been tested. The results revealed that our system can work reasonably in daytime. We may extend the system in the future work to apply in nighttime. For this, infrared sensors should be included.. Index Terms – Drowsiness detection, Lighting compensation, Skin model, Face model, Percentage of eye closure over time, Eye blinking frequency, Durations of eye closure, gaze and mouth opening, Head orientation, Kalman filter, Fuzzy decision. iv.

(6) 誌. 謝. 研究所生涯終於告一段落，兩年來體驗了與大學截然不同的生活，也有所成長，這都要感謝許多人的提攜與幫助。首先，要感謝的是指導教授陳世旺老師，由於老師不斷地叮嚀與教導，加上和老師長時間的研究與討論，碩士論文才可以如期完成，老師在學術上的熱誠與態度，實在令人敬佩。接著要感謝中央大學曾定章教授、中央研究院廖弘源教授、台灣科技大學鍾國亮教授與臺灣師範大學方瓊瑤教授，可以在百忙之中審查我的論文。再來要感謝方瓊瑤學姐、鐘允中學長、王俊明學長與張祥利學長在研究所生涯中的教導，還要感謝實驗室同學：建富、柏諺、雨農、安鈞、智偉、志成、亭凱、宗儒、士祥、士棋，平常除了研究上的討論外，也常一起運動、分享，讓研究所的生活更加多采多姿，彷彿就像一個大家庭。最後，要感謝支持我的家人，這段時間總是忙於課業，加上老家遠在外島，所以很少有時間相聚，不過也因為父母的體諒才能專心完成學業，除此之外，他們也給予我衣食無缺的生活，讓我一路升學至研究所，對此非常感激。另外也要特別感謝兩位姊姊與姊夫的照顧。套句常用的話：「需要感謝的人太多了，就感謝天吧。」. v.

(7) Table of Contents. List of Figures ………………………………………………..……………………..viii List of Tables …………………………………………………….………………….x. Chapter 1 Introduction ……………………………………………………………....1 Chapter 2 Drowsiness Detection and Warning System ……………….……………7 2.1. System Configuration ………………………….………...……………7 2.2. System Operation ……………………………….……………………..8 2.2.1. Preprocessing ………………………….………………...9 2.2.2. Facial Feature Extraction …………...………………….11 2.2.3. Face Tracking ………………………...………………...15 2.2.4. Parameter Estimation ………………….……………….16 2.2.5. Reasoning and Decision ………………………….……23 Chapter 3 Implementations ……………………………..…………………….……26 3.1. Lighting Compensation ………………………..……………….……26 3.2. Skin Model and Color Transform ………………...……………….…29 3.3. Facial Feature Detection …………………………...…………….…..31 3.4. Face Tracking …………………………………………………….…..34 3.5. Fuzzy Reasoning ……………………………….………………….…40 vi.

(8) Chapter 4 Experimental Results ……………………………...……………………44 4.1 Individual Steps …………………………………..………………...44 4.1.1 Facial Feature Extraction and Face Tracking …...…...…44 4.1.2 Parameter Estimation …………………………………...47 4.2 The entire system …...………………………………….………...…...54 Chapter 5 Concluding Remarks and Future Work …………………...…………..60. Bibliography …………………………...………………………………….…………63. vii.

(9) List of Figures. Fig. 1. Installation of the video camera ...………………………………….………..…8 Fig. 2. Example images ...………………………………………………….…………..8 Fig. 3. Block diagram for the drowsiness detection system ...…………………………9 Fig. 4. Flowchart for the preprocessing stage ...…………………………….………….9 Fig. 5. Lighting compensation ...……………………………………………...……....10 Fig. 6. Flowchart for the feature extraction stage ...…………………………...……...11 Fig. 7. Skin detection ...………………………………………………………...……..12 Fig. 8. Incomplete face areas ...…………………………………………...…………..13 Fig. 9. Facial feature detection ...…………………………………………………......14 Fig. 10. Face model ...…………………………………..…………………………….14 Fig. 11. (a) Opening eye, (b) closed eye ...………………………………….………...17 Fig. 12. (a) The de -curve of a video clip of 300 images, (b) the sub-clip from images 77 to 84, (c) the sub-clip from images 145 to 152, (d) the ve -curve, and (e) the be -curve ……………………………………………….…………………...19 Fig. 13. (a) Closed mouth, (b) opening mouth ...…………………………………......21 Fig. 14. (a) The dm -curve of a video clip of 300 images, (b) the sub-clip from images 221 to 285, (c) the vm -curve, and (d) the bm -curve ...…………………..…...22. viii.

(10) Fig. 15. (a) The dg -curve of a video clip of 300 images and (b) the corresponding bg -curve ...……………………………………………….………………….23. Fig. 16. Examples for illustrating lighting compensation ...………...………………...27 Fig. 17. Example for illustrating eye detection ...…………………...……………...…33 Fig. 18. Example for illustrating mouth detection ...………………...………………..34 Fig. 19. Incomplete shapes of eyes when the driver’s head has a large turning ……...40 Fig. 20. Facial feature extraction and face tracking over a video sequence ...………..46 Fig. 21. Robustness of facial feature extraction under different conditions ....……….47 Fig. 22. Images containing closed eyes in a video clip of 300 images ...……………..48 Fig. 23. (a) de − curve, (b) ve − curve, and (c) be − curves. from a video clip of 300 images ...……………………………………………………..………………48 Fig. 24. (a) d m − curve, (b) vm − curve, and (c) bm − curve derived from a video clip of 300 images ...……………………………………………………………...50 Fig. 25. Images containing closed eyes in a video clip of 300 images ...……………..50 Fig. 26. Experimental results of head parameter estimation …...…...………………..52 Fig. 27. The d g − curves of (a) video clip 1, (b) video clip 2 ...……………………..54 Fig. 28. The bg − curves of (a) video clip 1, (b) video clip 2 ...………………...……54 Fig. 29. Distributions of drowsiness level evaluated from (a) video 1, (b) video 2,. (c) video 3, and (d) video 4 ...……………………………………………………59. ix.

(11) List of Tables. Table 1. Performances of the system on five experimental video sequences…………55 Table 2. Status identification rates of facial features in video sequences 1 - 4…….….58. x.

(12) Chapter 1 Introduction. Over the past several years, there has been a growing interest in intelligent vehicles (IV), which is one of the three major branches (intelligent vehicles, roads, and systems) of intelligent transportation systems. Many systems related to IV have been widely discussed in literatures as well as in public and private documents. Some of these systems are mundane while others are futuristic. The driver drowsiness detection and warning system considered in this study belongs to the latter, which attempts to bring to the attention of a driver to his/her own drowsiness status during driving. The information provided by the system can also be utilized by adaptive systems to manage noncritical operations, such as starting a ventilator, turning on a radio, offering relaxing tunes, and providing entertainment options. In high drowsiness situation, the system may initiate navigation aids and alert others to the drowsiness of the driver. Drowsiness degrades driving performance due to the declinations of visibility, situational awareness and decision-making capability and has long been known to be a major hazard in road safety [Car00, Fid07, Gan05, Häk00, Sab05, Sag99, Vie06]. Drowsiness easily occurs for the drivers of buses, trucks, tankers and containers while driving on long empty roads and those drives requiring low motor activities. The above vehicles either carry many peoples or large amounts of goods (especially dangerous. 1.

(13) goods such as gases, chemical and poison materials). Once accidents occur for these cars, the losses of human lives and properties could be enormous. Various techniques have been reported for detecting driver’s drowsiness/fatigue. They can be categorized into three classes: the techniques analyzing (1) physiological statuses, (2) driving behaviors, and (3) facial appearances. In studying physiological statuses [Can02, Con02, Sai92, Uen94, Vuc02, Wil02], electrocardiogram (ECG), electromyogram (EMG), electroencephalogram (EEG), electrodermal activation (EDA), and electrooculogram (EOG) have been utilized, which record the activities of heart, muscle and brain, skin conductivity, and eye movement, respectively. Some of the above data have been shown to possess close relationships with drowsiness statuses, such as ECG (recording heart rate variations) and EEG (recording brain waves). The devices for collecting the above data are typically invasive; that is, they physically contact with drivers during data acquisition. Intrusive instruments would inevitably annoy drivers. Moreover, physiological devices installed in vehicles are sensitive to environmental variations. Fatigue analysis based on driving behaviors [Esk03, Mor96, Uen94] needs diverse sensors to collect such data as steering wheel angle/torque, gas/braking pedal positions, lateral/longitudinal speed/acceleration, gear-change, lateral lane position and course change. Each type of data records a kind of driving behavior. The abnormity of a single driving behavior will not be indicative of a drowsy driving. An integrating 2.

(14) process is required to combine various types of data to make a decision about the drowsiness status of the driver. Although some of driving behavioral data are readily obtained from the available equipments of current vehicles, some need to install extra instruments to collect. This will increase the financial loads of car owners. Besides, their installations may be subject to the limitations of vehicle types, driver experiences, and driving conditions [Ji04]. Unlike physiological data that are directly measured from drivers, driving behavioral data are observed from vehicles and have to be transformed to connect to drivers. The data transformation will introduce additional uncertainties into drowsiness analysis. The third class of techniques [Din98, Esk03, Gra98, Hay02, Hor04, Ito02, Ji04, Ohn02, Pop03, Smi00, Wu04] estimates the drowsiness levels of drivers via perceiving their facial expressions using image data. Facial expressions carry rich information, which convey not only organic physiological statuses but also internal moods [Mit03]. This class of approaches is primarily motivated by the human visual system that can easily identify the vigilance/fatigue level of a person based on his/her facial appearance. Furthermore, images are acquired by video cameras that have been known to be non-intrusive and inexpensive. The advantages of non-intrusive and inexpensive sensors and the intimate connection between extrinsic facial expressions and intrinsic statuses have made this class of approaches comprehensively advisable. In this study, a. 3.

(15) vision-based driver’s drowsiness detection and warning system is considered. Facial expressions, such as eye and eyelid movements, pupil response, gaze fixation, eye blinking, mouth occlusion, and head movement, have often been considered in various applications. The associated parameters commonly employed for drowsiness analysis include the degree, speed and duration of eye closure, percentage of eye closure over time, eye blinking frequency, gaze duration, head orientation, head nodding frequency, and the degree and duration of mouth openness. To estimate facial parameters, facial features (e.g., eyes, mouth and nose) have to be detected first. It has been common that the face is first located, followed by facial feature detection within the located face. To locate faces in an image, skin colors have commonly been used. Various color spaces, such as HSV (or HSI) [Hor04, Ike03], Lab, LUV [Ada00], LUX [Lie04], HMMD [Fan03], TSL [Che03], XYZ, YIQ, YES [Sab96], and YCrCb [Cha99], have been utilized to describe skin colors. However, the appearance of skin color varies significantly under different conditions of light source, imaging distance as well as background. No color space can well delimit the boundary of the skin-tone cluster. In this study, we use multiple color spaces (RGB, YCrCb, and LU’X) to delineate the skin cluster. Instead of detecting the face prior to its features, Ji et al. [Ji04] used two cameras: one narrow-angle camera and another wide-angle camera. The narrow-angle camera. 4.

(16) focusing on the eyes monitors eyelid and gaze movements, whereas the wide-angle camera focusing on the face monitors head movement and facial expression. Although the Ji et al. method simplified the facial feature detection process, two cameras were used and pre-focusing the cameras on target objects were required. Furthermore, a near-IR light source was employed. This light source accompanying the associated NIR cameras not only resist the influence of ambient illumination variations but also enable the system to be applicable in both daytime and nighttime. However, NIR imaging systems suffer from the lack of color information, which has played an important role in increasing the reliability of facial feature detection [Hsu02]. In [Hay02, Ito02, Lui06, Ohn02], active IR illuminators and IR cameras were used. The IR systems have the same advantages as well as disadvantages as the NIR systems. In this study, we use only one ordinary video camera. Potential difficulties, which may hamper the performance of the proposed system, include video instability originating from vehicle vibrations as well as illumination variations due to entering or exiting a tunnel or a shadow (e.g., cast by the ego-vehicle, a large roadside building, or a plane). Our system handles to an extent the above difficulties. The rest of this paper is organized as follows. In Section II, the configuration and workflow of the proposed system are described. Implementation details of the system are then addressed in Section III. Section IV demonstrates the feasibility and robustness of the proposed. 5.

(17) system, followed by concluding remarks and the future work discussed in Section V.. 6.

(18) Chapter 2 Drowsiness Detection and Warning System. In this section, the proposed drowsiness detection and warning system is presented. We discuss the system configuration first and then its workflow. 2.1. System Configuration There are three major components constituting the system: a video camera, an in-vehicle computer, and a warning device. Of these three components, the camera installation is the matter of major concern. Different camera installations give rise to the images of different characteristics and as a consequence lead to different techniques of image processing. In the Takai et al. work [Tak03] concerning driver’s face and eyes detection, a video camera was clung to the rearview mirror. Since a driver almost always adjusts the rearview mirror before driving, the camera fastened with the mirror is adapted as well. By appropriately arranging the orientations of the mirror and the camera, adequate images of the driver are captured by the camera. This method is advisable for different drivers of the same vehicle. However, since the rearview mirror is typically installed near the upper center of the front windshield, side-view images of the driver are attained. Moreover, eyes become small or even absent from images once the driver turns his/her head away the camera (e.g., looks down or looks outside the window). In this study, we mount the video camera on the top of the panel right behind the steering wheel (see Figure 1). 7.

(19) The camera has a tilt angle of about 30 degrees pointing at the driver. Figure 2 shows some example images of a driver with different face orientations (when looking up, down, left, and right) taken by the video camera.. Fig. 1. Installation of the video camera.. Fig. 2. Example images. 2.2. System Operation Refer to Figure 3, where a block diagram illustrating the overall operation of the system is depicted. There are five major blocks, labeled by preprocessing, facial feature extraction, face tracking, parameter estimation, and reasoning, involved in the diagram. Each block represents a main stage of system operation and can be further divided into a number of steps. We discuss the blocks separately in the ensuing subsections. Key techniques to realize the system are detailed in Section 3.. 8.

(20) Fig. 3. Block diagram for the drowsiness detection system. 2.2.1. Preprocessing Referring to Figure 4, a flowchart for the preprocessing stage is shown. There are three steps involved in this stage, referred to as the subsampling, lighting compensation, and color transform steps, respectively.. Fig. 4. Flowchart for the preprocessing stage.. Consider an input video image. First of all, we reduce the image by uniformly subsampling (in practice, picking one every two pixels) in order to save the processing. 9.

(21) time of the system. Some noises have been eliminated as well in this step. Next, a lighting compensation process is applied to the reduced image. Refer to Figure 5(a), where an image taken while driving in a tunnel is shown. Both the intensity and color of the image have been affected by the light sources in the tunnel. The lighting compensation process to be addressed in Section 3.1 normalizes the brightness and chromatic characteristics of an image in order to reduce the influences originating from the variations of illumination conditions. See Figure 5(b) for the lighting compensation result of Figure 5(a). The facial features (e.g., brows, eyes, nose, mouth, and skin color) of the driver, which are important for driver drowsiness analysis, have been enhanced in the resulting image. Thereafter, in the color transform step we calculate a number of chrominance values for each image pixel to be discussed in Section 3.2. The calculated chrominance values are to be used in the next feature extraction stage to locate facial features in the image.. (a) Input image. (b) Resultant image. Fig. 5. Lighting compensation.. 10.

(22) 2.2.2. Facial Feature Extraction In the facial feature extraction stage, facial features of eyes, mouth and face are detected. Figure 6 depicts a flowchart for the facial feature extraction stage. There are four steps constituting this stage: skin detection, face localization, eyes and mouth detection, and feature confirmation.. Fig. 6. Flowchart for the feature extraction stage.. To begin, we locate skin regions in the input image based on the chrominance values of pixels calculated in the previous stage. Explicitly, for each image pixel we examine its chrominance values to see whether they satisfy the criteria of a prebuilt skin model to be addressed in Section 3.2. If the pixel fits the skin model, the pixel is regarded as a skin pixel; otherwise, a non-skin pixel. However, due to several uncertain factors (e.g., noise, imprecise calculations of chrominance values, and imperfect skin model); both false positives (i.e., skin pixels are determined as non-skin ones) and. 11.

(23) false negatives (i.e., non-skin pixels are classified as skin ones) may occur. Referring to the example shown in Figure 7, false positives result in holes in skin areas, whereas false negatives bring on noisy patches. Noisy patches are typically small (see Figure 7(b)). We preserve the largest connected skin region so as to ignore noisy patches. Moreover, the largest skin area usually corresponds to the face region of our interest (see Figure 7(c)).. (a). (b). (c). Fig. 7. Skin detection: (a) Input image, (b) potential skin areas, and (c) the determined face region.. As mentioned, false positives incur holes in skin areas. The located face region often contains holes. Unfortunately, holes may be large (see Figure 8). Detecting eyes and mouth within the located face region is unreliable. We actually search for facial features throughout the entire image. The imperfect face region will later be used to confirm the detected facial features. Searching for facial features throughout the entire image is time consuming. However, the facial feature detecting process is not. 12.

(24) necessarily applied to every video image. We will trace facial features over video images once they are located in a frame. The facial feature detecting process is initiated only when it receives a “no feature” signal from the system. We address eyes and mouth detections in Section 3.3.. Fig. 8. Incomplete face areas.. The facial feature detecting process often returns a number of eye and mouth candidates. Refer to the examples shown in Figure 9, in which eye and mouth candidates are marked in Figures 9(a) and (b), respectively. To determine the actual eyes and mouth, we rely on both a predefined face model and the face region located earlier. The face model specifies the relationships between facial features in terms of scale-invariant constraints. Refer to Figure 10, where a face model is depicted. There are five constraints (two qualitative and three quantitative constraints) describing the face model: (1) the mouth is lower than the two eyes; (2) the horizontal position of the mouth is between those of the two eyes; (3) lee < llm , lee < lrm , and llm / lrm ≈ 1 ; (4) the angle between the line connecting the two eyes and the horizontal line is smaller than 10 degrees; (5) the angle between the line joining the two eyes and the line 13.

(25) connecting the mouth and any eye is between 45 and 80 degrees.. (a). (b). (c). Fig. 9. Facial feature detection: (a) eye candidates, (b) mouth candidates, (c) actual facial features.. Fig. 10. Face model.. Let Se and S m be the sets of eye and mouth candidates, respectively. In the feature confirmation step, each time we choose two eye candidates from Se and one mouth candidate from S m . If these three features satisfy the constraints of the face 14.

(26) model, the features form a face candidate represented by the triangle formed by connecting the three features. We collect all face candidates in set S f . Thereafter, for each face candidate we compute a degree deg of confidence. Let α indicate the level of shape similarity between the triangle of the face candidate and an equilateral triangle, defined as α =. 1. π. max{| ai − 1≤i ≤3. π 3. |} where ai ’s are the internal angles of the. triangle of the face candidate. Let β denote the level of closeness between the center. c f of gravity of the face candidate and that cs of the skin area, defined by. β=. 1 dis (c f , cs ) where lr specifies the row length of images and dis(⋅) is a lr. distance function. The degree of confidence of the face candidate under consideration is then calculated according to deg = e − (α + β ) / 2 . Finally, we determine the potential face f by f = arg max{deg fk } , i.e., the face candidate with the largest degree of confidence. f k ∈S f. Only if this potential face is repeatedly observed in the next image, the face is regarded as the actual face. See the examples shown in Figure 9; the located actual faces are exhibited in Figure 9(c). 2.2.3. Face Tracking. Once the driver’s face is located, the system switches to a tracking module. This module characterized by a linear Kalman filter to be addressed in Section 3.4 traces the face over the video sequence until it misses locating the face in an image. At this moment, the facial feature extraction stage is revoked again. Since the facial feature extraction process takes much longer time than the face tracking module, the face tracking module significantly reduces the processing time of face localization. 15.

(27) Moreover, face localization through tracking has achieved higher accuracies than that via facial feature extraction. 2.2.4. Parameter Estimation. In the parameter estimation stage, the parameters of percentage of eye closure over time, eye blinking frequency, eye closure duration, head orientation (including tilt, pan and rotation angles), mouth opening duration, and degree of gaze are estimated from the located driver’s face. The parameters will later be used in the reasoning stage to infer the drowsiness level of the driver. To begin, the facial features are enclosed with rectangular windows. For the eyes, lee /1.5 × lem / 3.5 windows are used and for the mouth an lee × lem / 2.5 window is employed, where lee is the length between the two eyes and lem is the length between the center of the two eyes and the mouth (see Figure 10). Both lee and lem are calculated from the face detected at the early stage of system processing. Characteristics of facial features are computed within windows, which are to be used to estimate the parameters mentioned above for drowsiness reasoning. A. Eye Parameters Considering an eye, it is observed that the vertical edge magnitude of an opening eye is typically larger than that of a closed eye. Referring to Figure 11, two images containing an opening eye and a closed eye and the associated vertical edge magnitude maps are displayed. The opening eye has relatively larger and stronger edge 16.

(28) magnitudes than the closed eye. In addition, the higher degree of eye closure, the smaller the summed edge magnitude. Let Ele and Ere be the averages of edge magnitude of the left and the right eyes, respectively. We define the degree de of eye closure as de = min{Ele , Ere }/ 255 .. (a). (b). Fig. 11. (a) Opening eye, (b) closed eye.. Refer to Figure 12(a), where the calculated de s of a video clip of 300 images are graphically depicted. See the local minima between images 79 and 149 in the figure. We exhibit the eyes extracted from the sub-clips from images 77 to 84 in Figure 12(b) and from images 145 to 152 in Figure 12(c). The eyes present in images 80 and 148 are closed or almost closed. An image containing a closed eye actually corresponds to a local minimum in the de -curve. However, there are many local minima along the curve. Only those small enough correspond to the images containing closed eyes. In order to highlight those minima, we derive the ve -curve from the de -curve according to ve = (de − a )2 , where a is determined in the following. Let m denote the mean of the de -curve, i.e., m =. 1 n ∑ dei , where n is the number of images of the video clip under n i =1. 17.

(29) consideration. If the calculated mean is close to that of the previous video clip, a is set to the current mean; otherwise set to the previous mean. Figure 12(d) displays the ve -curve computed from the de -curve. Comparing these two curves, significant. valleys in the de -curve have been emphasized as peaks in the ve -curve. We threshold the ve -curve to obtain a binary curve, referred to as the be -curve, shown in Figure 12(e).. (a). (b). (c). (d). 18.

(30) (e) Fig. 12. (a) The de -curve of a video clip of 300 images, (b) the sub-clip from images 77 to 84, (c) the sub-clip from images 145 to 152, (d) the ve -curve, and (e) the be -curve.. Based on the be -curve, we easily calculate the parameters of percentage of eye closure over time (PERCLOS), blinking frequency (BF), and eye closure duration (D). Let n (＝300 used in our experiments) be the number of images in a video clip and Tn be the duration of the clip. The percentage of eye closure over time (PERCLOS) is. 1⎛ n ⎞ defined as PERCLOS = ⎜ ∑ bei ⎟ , where bei is the binary value of the be -curve at n ⎝ i =1 ⎠ image i. The blinking frequency (BF) is defined as BF = n p / Tn , where n p is the number of pulses along the b-curve. Let Di denote the duration of pulse i. The eye closure duration (D) is then defined as D = max{Di } . The above parameter 1≤i ≤ n p. estimation is performed every video clip. B. Head Parameters. Head parameters, including the tilt ( α ), pan ( β ) and rotation ( γ ) angles of the head, are to be estimated from the located face. Since the above parameters are three dimensional in nature while the located face is in the 2D image plane, only 19.

(31) approximate estimations of the parameters can be achieved. Recall that the located face is described in terms of a triangle that connects the two eyes and the mouth. The orientation of the face triangle somehow reflects the orientation of the head in the 3D space. Referring to the face model shown in Figure 10, if the head performs a tilt, the ′ be the foreshortened line segment. line segment lem will be foreshortened. Let lem. ′ / lem ) . The tilt angle α of the head can be easily determined as α = cos −1 (lem ′ be the foreshortened line segment of lee when the head pans. The Similarly, let lee. pan angle β of the head is given by β = cos −1 (lee′ / lee ) . Finally, if the head rotates, its rotation angle γ is given by γ = cos −1 ( xl − xr / lee ) , where xl and xr are the horizontal coordinates of the left and right eyes, respectively. C. Mouth Parameters Refer to Figure 13, where a closed mouth and an opening mouth and the associated edge magnitude maps are displayed. The opening mouth has relatively larger and stronger edge magnitudes than the closed mouth. In addition, the higher degree of mouth openness, the larger the summed edge magnitude. Let Em be the average of edge magnitude of the mouth. We define the degree dm of mouth openness as dm = Em / 255 .. 20.

(32) (a). (b). Fig. 13. (a) Closed mouth, (b) opening mouth. Refer to Figure 14(a), where the calculated dm s of a video clip of 300 images are graphically depicted. See the local maxima between images 235 and 287 in the figure. We exhibit the mouths extracted from the sub-clip from images 212 to 285 in Figure 14(b). The mouths present between images 247 and 276 are opened or almost opened. An image containing an opened mouth actually corresponds to a local maximum in the dm -curve. However, there are many local maxima along the curve. Only those large. enough correspond to the images containing opened mouths. In order to highlight those maxima, we derive the vm -curve from the dm -curve according to vm = (dm − b)2 , where b is determined in the following. Let m′ denote the mean of the dm -curve, i.e., m′ =. 1 n′ ∑ dmi , where n′ is the number of images of the video clip n′ i =1. under consideration. If the calculated mean is close to that of the previous video clip, b is set to the current mean; otherwise set to the previous mean. Figure 14(d) displays the vm -curve computed from the dm -curve. Comparing these two curves, significant. peaks in the dm -curve have been significantly emphasized in the vm -curve. We threshold the vm -curve to obtain a binary curve, referred to as the bm -curve, shown in. 21.

(33) Figure 14(e).. (a). (b). (c). (d) Fig. 14. (a) The dm -curve of a video clip of 300 images, (b) the sub-clip from images 221 to 285, (c) the vm -curve, and (d) the bm -curve.. Based on the bm -curve, for each clip of 300 images we calculate the parameter of mouth openness duration ( Dm ). Let Di denote the duration of pulse i. The mouth openness duration ( Dm ) is then defined as Dm = max{Di } , where n p is the number 1≤i ≤ n p. of pulses along the bm -curve. D. Gaze Parameter When a person gazes at something, the person has near fixed eye and face. 22.

(34) orientations. We figure out these two conditions at any time instant based on the difference d g between the predicted and observed horizontal displacements of the face known from the Kalman filter during face tracking. Figure 15(a) shows the calculated dg -curve for a video clip of 300 images. We threshold the curve to obtain the binary bg -curve shown in Figure 15(b). In this example video clip, there is a salient gaze. occurring after image 170. We determine a degree of gaze for each clip of 300 video images, which is calculated by averaging the vg values over the clip.. (a). (b) Fig. 15. (a) The dg -curve of a video clip of 300 images and (b) the corresponding bg -curve.. 2.2.5. Reasoning and Decision. Having obtained the values of facial parameters, we are ready to determine the drowsiness level of the driver based on the parametric values. A fuzzy integral process. 23.

(35) to be addressed in Section 3.5 is employed for this purpose. However, different parameters have different ranges of values. Before invoking the fuzzy integral process, we have to transfer the ranges of parameters into a consistent one. In the following, the transfer functions of parameters actually transfer parametric values into drowsiness degrees that are within the range of [0, 1]. x <= 0.12 ⎧0 ⎪ Percent eye closure over time: D( x) = ⎨3.5714 x − 0.4286 0.12 < x < 0.4 . ⎪1 x >= 0.4 ⎩ x <= 20 ⎧0 ⎪ Eye blink frequency (times/minute): D( x) = ⎨0.05 x − 1 20 < x < 40 . ⎪1 x >= 40 ⎩ x <= 0.15 ⎧0 ⎪ Eye closure duration (second): D( x) = ⎨0.3509 x − 0.0526 0.15 < x < 3 . ⎪1 x >= 3 ⎩ x <= 20 ⎧0 ⎪ Head pan angle ([0,90]): D ( x) = ⎨0.04 x − 0.8 20 < x < 45 . ⎪1 x >= 45 ⎩ x <= 20 ⎧0 ⎪ Head tilt angle ([0,90]): D( x) = ⎨0.04 x − 0.8 20 < x < 45 . ⎪1 x >= 45 ⎩ x <= 10 ⎧0 ⎪ Head rotation angle ([0,90]): D( x) = ⎨0.05 x − 0.5 10 < x < 30 . ⎪1 x >= 30 ⎩ x <= 3 ⎧0 ⎪ Mouth opening duration (second): D( x) = ⎨0.3333x − 1 3 < x < 6 . ⎪1 x >= 6 ⎩. 24.

(36) x <= 10 ⎧0 ⎪ Gaze degree ([0,1]): D( x) = ⎨0.05 x − 0.5 10 < x < 30 . ⎪1 x >= 30 ⎩. The fuzzy reasoning step returns a number of integral values, each resulting from a hypothesized degree of drowsiness. In the decision step, the hypothesized drowsiness degree with the largest integral value is regarded as the drowsiness level of the driver. The system can then take actions, such as starting a ventilator, spreading fragrance, turning on a radio, offering relaxing tunes, and providing entertainment options, according to the determined drowsiness level of the driver. In high drowsiness situation, the system may initiate navigation aids and alert others to the drowsiness of the driver. Currently, our system only gives different numbers of beeps, which are proportional to the different levels of drowsiness.. 25.

(37) Chapter 3. Implementations. In this section, the implementation details of key techniques, including lighting compensation, color transform, skin model, facial feature detection, are addressed. 3.1. Lighting Compensation. Unlike images taken under fixed light sources, the input images to our system are captured under unpredictable light sources. Both the brightness and chromatic characteristics of our images can vary significantly from image to image. The lighting compensation process is to reduce the influences originating from the variations of ambient lighting conditions. Consider a color image C ( R, G, B) , where R, G, and B are the three color components of the image. First of all, we figure out the level of brightness of the image. To this end, we compute the grayscale version I of the image by I = ( R + G + B) / 3 and next compute its histogram h(⋅) . Afterwards, we calculate the distribution tendency t of h ( ⋅) . Different from the skewness of h ( ⋅) which measures the asymmetricity of h ( ⋅ ) with respect to its domain mean mD ( mD =. L −1. L −1. i =0. i =0. ∑ i ⋅ h(i) / ∑ h(i). where L is the. number of gray levels), the distribution tendency of h ( ⋅) measures its asymmetricity with respect to the domain median m ( m = L / 2 , in this study m = 127).. 26.

(38) To begin, we compute the second M 2 and third M 3 moments of h(⋅) with respect to m,. L. L. l =0. l =0. M 2 = ∑ h(l )(l − m) 2 / ∑ h(l ) ,. L. L. l =0. l =0. M 3 = ∑ h(l )(l − m)3 / ∑ h(l ) . The. distribution tendency t of h ( ⋅) is defined as t = M 3 /( M 2 M 2 ) . The t value can be positive or negative. A positive t indicates a relatively bright image, whereas a negative t indicates a relatively dark image. Refer to Figure 16, in which the first row displays the input images and the second row depicts their histograms and the calculated histogram distribution tendencies.. (a) input image. (b) histogram, distribution tendency, and fraction. (c) resultant image Fig. 16. Examples for illustrating lighting compensation.. Having obtained the histogram distribution tendency t of an image, if t < −10 , we determine a fraction p according to p = α e 27. − β (t +138). , where α = 0.713 and.

(39) β = 0.013 . Then, for each color component of the image we search for its pixels with the top p of values and average the values. Let (aR , aG , aB ) be the averaging values of (R, G, B) color components, respectively. Thereafter, for each image pixel with (r, g,. b) color values, we transfer the values into (r’, g’, b’) by r′ = r ×. 255 255 255 , g′ = g × , b′ = b × . aR aG aB. The above equations raise the brightness of the image by rescaling the color values of image pixels. In a similar vein, if t > 50 , we determine a fraction p according to. p = α e − β (t +138) , where α = 0.323 and β = 0.011 .. Then, for each color component of the image we search for its pixels with the bottom p of values and average the values. Let (aR , aG , aB ) be the averaging values of (R, G, B) color components, respectively. Thereafter, for each image pixel with (r, g, b) color values, we transfer the values into (r’, g’, b’) by. r′ = 255 − (255 − r ) ×. 255 255 , g ′ = 255 − (255 − g ) × , 255 − a R 255 − aG. b′ = 255 − (255 − b) ×. 255 255 − aB. The above equations decrease the brightness of the image by rescaling the color values of image pixels. No operation is applied to the image if −10 < t < 50 . Refer to Figure 16(c), where the lighting compensation results of the images of Figure 16(a) are exhibited. Note that the range of t values is between -138 and 138. The above lighting compensation process will distort the chromatic characteristics of the original image if its calculated t value is too small (<<-10) or too large (>>50). This is because the lighting 28.

(40) compensation process adapts the brightness of an image by rescaling its color components separately. 3.2. Skin Model and Color Transform. The skin model is defined in terms of three color spaces: RGB, YCr Cb and ˆ ˆ . The RGB color space is utilized because the input image is represented in terms LUX. of R, G and B components and because the skin color has larger R than G values. Therefore, we have the first constraint for the skin model, which states that a pixel is a potential skin pixel if its R value is larger than G value. This constraint is obviously necessary but not sufficient because there are non-skin pixels, whose R values are still larger than G values. The RGB color space merges the chrominance and luminance components in one space. However, the appearance of skin color can vary significantly under different illumination conditions. A relative loose skin tone cluster is distributed in the RGB space. There are many color spaces with separated chrominance and luminance components. Of which, we prefer the YCr Cb color space because of its perceptual uniformity [Poy96] and the low degree of luma dependency and compactness of the skin tone cluster [Hsu02]. The transformation from the RGB to the YCr Cb color space is given by 0.587 0.098 ⎤ ⎡ R ⎤ ⎡1 ⎤ ⎡Y ⎤ ⎡ 0.299 ⎢Cr ⎥ = ⎢ 0.500 −0.4187 −0.0813⎥ ⎢G ⎥ + ⎢128⎥ , ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢⎣Cb ⎥⎦ ⎢⎣ −0.1687 −0.3313 0.500 ⎥⎦ ⎢⎣ B ⎥⎦ ⎢⎣128⎥⎦. 29. ----- (1).

(41) where Y is the luminance component and Cr and Cb are the chrominance components. Chai and Ngan [Cha99] have suggested the ranges of Cr [133, 173] and. Cb [77, 127] components for skin color. However, the Cr range shrinks and shifts as the Y value becomes large or small. Researchers have attempted to modify the YCr Cb space so that the skin tone cluster could be luma-independent in the new spaces. These include the YCr C g space ˆ ˆ space [Dio03] that was linearly transformed from the YCr Cb space, the LUX. [Lie04] that was nonlinearly transformed from the YCr Cb space, and the YCr′Cb′ space [Hsu02] that was obtained by piecewise linearly modifying the YCr Cb space. In ˆ ˆ spaces, which is actually a simplified version of the this study, we consider the LUX ˆ ˆ color spaces were provided by Lievin and LUX color space. Both the LUX and LUX. Luthon [Lie04]. Starting with the LUX color space, the following equations formulate the transformation from the RGB into the LUX space.. L = ( R + 1)0.3 (G + 1)0.6 ( B + 1)0.1 − 1 , ⎧ M R +1 If R < L ⎪⎪ 2 ( L + 1) U =⎨ , ⎪ M − M ( L + 1) Otherwise ⎪⎩ 2 R +1. ⎧ M B +1 If B < L ⎪⎪ 2 ( L + 1) X =⎨ ⎪ M − M ( L + 1) Otherwise ⎪⎩ 2 B +1. where M is the dynamic range of gray levels (in this study M = 256), L is the luminance component, and U and X are the chrominance components. In the above. 30.

(42) equations, although the time complexities of the equations associated with U and X are reasonable, the equations involve L whose equation has a high computational complexity. Lievin and Luthon [Lie04] then suggested replacing L by G because of their close proximity in skin color. Accordingly, the chrominance components U and X are approximated by Uˆ and Xˆ as ⎧ M R +1 ⎧ M B +1 If R < G If B < G ⎪⎪ 2 ( G + 1) ⎪⎪ 2 ( G + 1) ˆ ˆ , X =⎨ U =⎨ ⎪ M − M ( G + 1) Otherwise ⎪ M − M ( G + 1) Otherwise ⎪⎩ ⎪⎩ 2 R +1 2 B +1 As mentioned earlier, the skin color has larger R than G values. We are hence M G +1 interested in only the equation Uˆ = M − ( ) and empirically determine its 2 R +1 range [0, 249] for the skin color. We now summarize the skin model in the following. A pixel belongs to the skin model if (1) the pixel has lager R than G values, (2) its Cb value is between 77 and 127, and (3) its Uˆ value is between 0 and 249. Therefore, during the color transform of an image, only the pixels having lager R than G values we compute their Cb and M G +1 ). Uˆ values by Cb = −0.1687 R − 0.3313G + 0.5 B + 128 and Uˆ = M − ( 2 R +1. 3.3. Facial Feature Detection. During facial feature detection, eyes and mouth are located using a technique recently reported by Hsu et al. [Hsu02]. This section briefly reviews their technique. Considering a color image I ( R, G , B ) , first of all the image is transformed from the 31.

(43) RGB space into the YCr Cb space using Equation (1). Let I ′(Y , Cr , Cb ) denote the transformed image. A. Eye Detection At the beginning of eye detection, two maps, EyeMapC and EyeMapL , are constructed. Let ( x, y ) denote the location of any pixel.. EyeMapC ( x, y ) =. {. }. 1 2 Cb ( x, y ) + Cr2 ( x, y ) + Cb ( x, y ) / Cr ( x, y ) , 3. where Cr ( x, y ) = 255 − Cr ( x, y ) and Cb2 ( x, y ), Cr2 ( x, y ) and Cr2 ( x, y ) have all been. scaled within [0, 255]. The resultant EyeMapC is further enhanced by histogram equalization. Next, construct EyeMapL . EyeMapL( x, y ) =. Y ( x, y ) ⊕ gσ ( x, y ) , Y ( x, y )Θgσ ( x, y ) + 1. where ⊕ and Θ are morphological dilation and erosion operators, respectively, and gσ ( x, y ) is the structuring function with the element shape of. . Afterwards, the. maps EyeMapC and EyeMapL are integrated by EyeMap ( x , y ) = min{EyeMapC ( x , y ), EyeMapL ( x , y )} .. Refer to the example shown in Figure 17, in which the input image, maps. EyeMapC , EyeMapL and EyeMap are depicted in Figures (a), (b), (c) and (d), respectively. Note that eyes have been highlighted in the map EyeMap . We next locate the eyes in the map by thresholding (Figure 17(e)), connected component labeling, and size filtering (Figure 17(f)). In general, a number of eye candidates are detected. Figure 17(g) shows the located eye candidates.. 32.

(44) (a) input image. (b) EyeMapC. (e) thresholding. (c) EyeMapL. (f) size filtering. (d) EyeMap. (g) eye candidates. Fig. 17. Example for illustrating eye detection. B. Mouth Detection In mouth detection, one map MouthMap is computed. 2. ⎛ C ( x, y ) ⎞ MouthMap( x, y ) = Cr ( x, y ) × ⎜ Cr ( x, y ) 2 − η × r ⎟ , Cb ( x, y ) ⎠ ⎝ 2. in which Cr ( x, y ) 2 and Cr ( x, y ) / Cb ( x, y ) have been scaled within [0, 255] and η is estimated by η = 0.95. ∑. ( x , y )∈Fg. Cr ( x , y ) 2 /. Cr ( x , y ) where Fg is the face region ( x , y )∈Fg Cb ( x, y ). ∑. detected earlier. Refer to the example shown in Figure 18, in which the input image and the associated map MouthMap are depicted in Figures (a) and (b), respectively. The mouth has been emphasized in MouthMap . We locate the mouth in the map by thresholding (Figure 18(c)), connected component labeling, and size filtering (Figure 18(d)). In general, a number of mouth candidates are detected. Figure 18(e) shows the located mouth candidates.. 33.

(45) (a) input image. (c) thresholding. (b) MouthMap. (d) size filtering. (e) mouth candidates. Fig. 18. Example for illustrating mouth detection. 3.4. Face Tracking. Many techniques [Cap07] are feasible for moving object tracking, such as Baysian dynamic model, bootstrap filter, mean shift, particle filter, and sequential Monte Carlo. In this study, a linear Kalman filter [Wel95] is employed because of its efficiency and the short period of system state changes under consideration. Note that the system under our consideration is the driver’s face, which is to be tracked over a video sequence. The time interval of face state change is the period between two successive images. Face state change within such a small interval is assumed to be linear. In the following, we briefly review the linear Kalman filter and then address how to track over a video sequence the located driver’s face using the filter. A. Linear Kalman Filter There are two models involved in the linear Kalman filter: a system model and a measurement model both characterized by linear stochastic difference equations. The system model is governed by st +1 = Ast + wt , in which st is the system state vector. 34.

(46) at time t, A is a state transition matrix, and wt represents the system perturbation assumed to be with a normal probability distribution p ( w ) = N (0 , Q ) where 0 is the zero vector and Q is the covariance matrix of system perturbation. The measurement model is formulated as zt = Hst + vt , in which zt is the measurement vector at time t,. H relates the system state st to the measurement zt , and vt represents the measurement noise also assumed to be with a normal probability distribution p(v ) = N (0, R) where R is the covariance matrix of measurement noise. Let sˆt− and sˆt denote the a priori and a posterior state estimates, respectively. They are related by sˆt = sˆt− + k ( zt − Hsˆt− ) , where k is a blending factor. The difference. zt − Hsˆt− in the above equation reflects the discrepancy between the estimated Hsˆt− and the actual zt measurements. Let et− and et specify the errors of the estimates. sˆt− and sˆt with the actual state st , i.e., et− = st − sˆt− and et = st − sˆt . The covariance matrices of et− and et are Ct− = E[et− et−T ] and Ct = E[et etT ] , where superscript T indicates transposition. It is desirable that the a posterior error covariance Ct = E[et etT ] could be as small as possible. To this end, substituting et = st − sˆt and. sˆt = sˆt− + k ( zt − Hsˆt− ). into. Ct = E[et etT ]. and. minimizing. Ct. we. can. obtain k = Ct− H T ( HCt− H T + Rt ) −1 and Ct = ( I − kH )Ct− where I is the identity matrix. The Kalman filter proceeds as follows. There are two phases constituting the Kalman filter: the prediction and the updating phases. In the prediction phase, sˆt−+1 = Asˆt + wt. and Ct−+1 = ACt AT + Q. k = Ct−+1 H T ( HCt−+1 H T + R) −1 ,. are calculated. In the updating phase,. sˆt = sˆt−+1 + k ( zt − Hsˆt−+1 ) , and. Ct = ( I − kH )Ct−+1. are. computed. In the above process, matrices A and H are defined according to the 35.

(47) practical application at hand and covariance matrices Q and R are empirically determined. Given the initial values of sˆ0 , C0 and z0 , the above two phases repeat until a stopping criterion is reached and the result is sˆt .. B. Tracking As mentioned, the system under consideration is the located driver’s face, which is described in terms of a triangle connecting the two eyes and mouth of the deriver. We hence define the measurement vector z of a face as z = ( xl , yl , xr , yr , xm , ym )T , where ( xl , yl ) , ( xr , yr ) and (xm , ym ) are the locations of the left eye, right eye and mouth, respectively. The state vector s of the face further includes the velocities of the facial features, (ul , vl ) , (ur , vr ) and (um , vm ) , i.e., s = ( xl , yl , xr , yr , xm , ym , ul , vl , ur , vr , um , vm )T .. Next, from the simplified motion equations xt +1 = xt + ut Δt and yt +1 = yt + vt Δt , letting Δt = 1 we define the state transition matrix A accordingly as ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ A = ⎢⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣⎢. 1 0. 0 1. 0 0. 0 0. 0 0. 0 0. 1 0. 0 1. 0 0. 0 0. 0 0. 0 0. 0 0. 1 0. 0 1. 0 0. 0 0. 0 0. 0 0. 1 0. 0 1. 0 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0 0. 0 0. 0 0. 0 0. 0 0. 1 0. 0 1. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 1 0. 0 1. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 1 0. 0 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0 ⎤ 0 ⎥⎥ 0 ⎥ ⎥ 0 ⎥ 0 ⎥ ⎥ 1 ⎥ 0 ⎥ ⎥ 0 ⎥ 0 ⎥ ⎥ 0 ⎥ ⎥ 0 ⎥ 1 ⎦⎥. .. Since the measurement vector z is a sub-vector of the state vector s, we then define the measurement-state relating matrix H as a sub-matrix of A in the following.. 36.

(48) ⎡1 ⎢0 ⎢ ⎢0 H =⎢ ⎢0 ⎢0 ⎢ ⎣0. 0 0 0 0 0 0 0 0 0 0 0⎤ 1 0 0 0 0 0 0 0 0 0 0⎥⎥ 0 1 0 0 0 0 0 0 0 0 0⎥ ⎥ 0 0 1 0 0 0 0 0 0 0 0⎥ 0 0 0 1 0 0 0 0 0 0 0⎥ ⎥ 0 0 0 0 1 0 0 0 0 0 0⎦. Since the matrices A and H are stemmed from the simplified motion equations, errors are inevitably introduced into the predicted system state and measurement through the matrices. These errors and many others are assumed to be compensated by the system perturbation term w ~ N (0 , Q ) and the measurement noise term. v ~ N (0, R) . Empirically, we observed the location and velocity errors are about four and two pixels, respectively. Accordingly, we define Q and R as ⎡16 ⎢0 ⎢ ⎢0 ⎢ ⎢0 ⎢0 ⎢ 0 Q = ⎢ ⎢0 ⎢ ⎢0 ⎢0 ⎢ ⎢0 ⎢0 ⎢ ⎣⎢ 0. 0 16. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 16 0. 0 16. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0. 0. 0. 16. 0. 0. 0. 0. 0. 0. 0 0. 0 0. 0 0. 0 0. 16 0. 0 4. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 4 0. 0 4. 0 0. 0 0. 0. 0. 0. 0. 0. 0. 0. 0. 4. 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 4 0. 0⎤ 0⎥ ⎥ 0⎥ ⎥ 0⎥ 0⎥ ⎥ 0⎥ 0⎥ ⎥ 0⎥ 0⎥ ⎥ 0⎥ 0⎥ ⎥ 4 ⎦⎥. ⎡4 ⎢0 ⎢ ⎢0 R=⎢ ⎢0 ⎢0 ⎢ ⎣0. 0⎤ 0⎥ ⎥ 0⎥ ⎥ 0⎥ 0 0 0 4 0⎥ ⎥ 0 0 0 0 4⎦ 0 4 0 0. 0 0 4 0. 0 0 0 4. 0 0 0 0. Note that we fix the above two matrices during iterations because the same motion equations are assumed in each iteration. We next determine the initial values of w0 , C0 , sˆ0 and z0 . The system perturbations wt (t ≥ 0) are random vectors generated by N (0 , Q) . The initial a posterior error covariance matrix C0 is to be updated with iterations. A precise C0 is. 37.

(49) not necessary. Empirically, 10-pixel positional error and 5-pixel speed error have been observed. Accordingly, C0 is given by 0 0 0 0 0 0 0 0 0 0⎤ ⎡100 0 ⎢ 0 100 0 0 0 0 0 0 0 0 0 0 ⎥⎥ ⎢ ⎢ 0 0 100 0 0 0 0 0 0 0 0 0⎥ ⎢ ⎥ 0 0 0 100 0 0 0 0 0 0 0 0⎥ ⎢ ⎢ 0 0 0 0 100 0 0 0 0 0 0 0 ⎥ ⎢ ⎥ 0 0 0 0 0 100 0 0 0 0 0 0 ⎥ ⎢ C0 = ⎢ 0 0 0 0 0 0 25 0 0 0 0 0 ⎥ ⎢ ⎥ 0 0 0 0 0 0 25 0 0 0 0 ⎥ ⎢ 0 ⎢ 0 0 0 0 0 0 0 0 25 0 0 0 ⎥ ⎢ ⎥ ⎢ 0 0 0 0 0 0 0 0 0 25 0 0 ⎥ ⎢ ⎥ 0 0 0 0 0 0 0 0 0 25 0 ⎥ ⎢ 0 ⎢⎣ 0 0 0 0 0 0 0 0 0 0 0 25⎥⎦. To determine the initial face state vector sˆ0 = ( xl 0 , yl 0 , xr 0 , yr 0 , xm 0 , ym 0 , ul 0 , vl 0 , ur 0 , vr 0 , um 0 , vm 0 )T ,. recall that a face candidate is determined as the actual face only when it is repeatedly detected in two successive images. Let ( ( xl0 , yl0 ) , ( xr0 , yr0 ) , ( xm0 , ym0 ) ) and ( ( x1l , yl1 ) , ( x1r , y1r ) , ( x1m , y1m ) ) be the locations of the left eye, right eye and mouth at times t0. and t1 , respectively. Based on the above values, the components of the initial state vector. sˆ0. are given as. xi 0 = ( xi0 + xi1 ) / 2 ,. yi 0 = ( yi0 + y1i ) / 2 , ui 0 = xi1 − xi0 ,. vi 0 = yi1 − yi0 , where i = l , r , m . Having decided sˆ0 , we determine the initial. measurement z0 from z0 = Hsˆ0 + v0 , where vt (t ≥ 0) are generated according to the distribution of N (0 , R) . Except z0 , the measurement zt at any subsequent iteration has to be assessed. Recall that measurement vector z consists of the positions of facial features, i.e., z = ( xl , yl , xr , yr , xm , ym )T . Considering a facial feature, its location ( xt −1 , yt −1 ). 38.

(50) and velocity (ut −1 , vt −1 ) at time t-1 are known. To attain the location ( xt , yt ) of the. feature at time t, we decide a rectangular searching space S in image I t . Let ( xul , yul ) and ( xlr , ylr ) denote the coordinates of the upper left and lower right corners of S, respectively. The two corners are determined in the following. ( xul , yul ) = ( xt −1 − lee / 2 + leeut −1 /10, yt −1 − lem / a + lem vt −1 /10) , ( xlr , ylr ) = ( xt −1 − lee / 2 + leeut −1 /10, yt −1 − lem / a + lem vt −1 /10) , where lee is the length between the two eyes, lem is the length between the center of the two eyes and the mouth, and a is a positive constant (4 for eyes and 3 for mouth). Both lee and lem are calculated at the beginning of the system operation. Having determined the searching area of a facial feature, we look for the feature within the area by matching their edge magnitudes so as to reduce the effect of illumination variation. The edge magnitude of the facial feature is computed within a window ( lee /1.5 × lem / 3.5 for eyes and lee × lem / 2.5 for mouth) centered at ( xt −1 , yt −1 ) in image I t −1 and the edge magnitude of the searching area is computed in I t . During matching edge magnitudes between the feature and the searching area, the right three fourth of the searching area is examined for the left eye, the left three fourth of the searching area is examined for the right eye, and the entire searching area is examined for the mouth. The above three-fourth strategy for eyes is to avoid incomplete shapes of eyes when the driver’s head has a large turning (see the example shown in Figure 19).. 39.

(51) Fig. 19. Incomplete shapes of eyes when the driver’s head has a large turning. 3.5. Fuzzy Reasoning. Given the parameter values of driver’s facial expression, a fuzzy integral technique is employed to deduce the drowsiness level of the driver. In this section, the fuzzy integral technique is discussed first. Fuzzy reasoning based on the technique is then addressed.. A. Fuzzy Integral Fuzzy integrals [Zim91] have been generalized from Lebesque [Sug77] or Riemann integral [Dub82]. In this study, the Sugeno fuzzy integral extended from Lebesque integral is considered. Let f : S → [0,1] be a function defined on a finite set S and g : P( S ) → [0,1] be a set function defined over the power set of S. Function. g (⋅) , often referred to as a fuzzy measure function, satisfies the axioms of boundary conditions, monotonicity, and continuity [Wan92]. Sugeno further imposed on g (⋅) an additional property, ∀A, B ⊂ S , A ∩ B = φ ,. g ( A ∪ B) = g ( A) + g ( B) + λ g ( A) g ( B), λ ≥ 1 .. (2). The fuzzy integral of f (⋅) with respect to g (⋅) is then defined as. e = ∫ f ( s ) ⋅ g = sup {α ∧ g ( Aα )} , S. α ∈[0,1]. 40. (3).

(52) where ∧ represents the fuzzy intersection and Aα = {s ∈ S | f ( s ) ≥ α } . The above fuzzy integral provides an elegant nonlinear numeric approach suitable for integrating multiple sources of information or evidence to arrive at a value that indicates the degree of support for a particular hypothesis or decision. Suppose we have several hypotheses, H = {hi , i = 1, ⋅⋅⋅, n} , from which a final decision d is to be made. Let ehi be the integral value evaluated for hypothesis hi . We then determined the final decision by d = sgn max ehi . hi ∈H. Considering any hypothesis h ∈ H , let S be the set collecting all the information sources at hand. Function f (⋅) receiving an information source s returns a value f ( s) that reveals the level of support of s to the hypothesis h. Since the degrees of worth of information sources may be different, function g (⋅) takes as input a subset of information sources and gives a value that reflects the degree of worth of the set of sources relative to the other sources. Let d ( s) = g ({s}) . Function d (⋅) is referred to as the density function of g (⋅) . In general, densities d ( s), s ∈ S , are readily estimated. For any subset. A = {si , i = 1, ⋅⋅⋅, m} of S, the fuzzy measure of A can be computed recursively from Equation (2), m. m −1. g ( A) = ∑ d ( si ) + λ ∑ i =1. m. ∑ d (s )d (s ) + ⋅⋅⋅⋅ +λ. i =1 j =i =1. i. j. m −1. d ( s1 ) ⋅⋅⋅⋅d ( sm ) .. = (∏ (1 + λ d ( si )) − 1) / λ si ∈A. 41. (4).

(53) Since g ( S ) = 1 , the value of λ can be determined by solving G ( S ) = (∏ (1 + λ d ( si )) − 1) / λ = 1 or λ + 1 = ∏ (1 + λ d ( si )) . si ∈S. si ∈S. In Equation (3), there are 2|S | subsets of S needed to perform the fuzzy integral. Let S ′ = {s1′, s2′ , ⋅⋅⋅, s|′S | } be the sorted version of S such that f ( s1′ ) ≥ f ( s2′ ) ≥ ⋅⋅⋅ ≥ f ( s|′S | ) . Equation (3) can then be rewritten as. e = ∫ f ( s ) ⋅ g = sup {α ∧ g ( Aα )} = ∨ [ f ( si′) ∧ g ( Si′)] , S. 1≤ i ≤ |S |. α ∈[0,1]. (5). Where ∨ specifies the fuzzy union and Si′ = {s1′, s2′ , ⋅⋅⋅, si′} . This equation reduces the number of subsets required to perform the fuzzy integral from 2|S | (by Equation (3)) to | S | .. B. Reasoning Recall that eight facial parameters are considered for drowsiness analysis, i.e., percentage of eye closure over time, eye blinking frequency, eye closure duration, head orientation (including tilt, pan, and rotation angles), mouth opening duration, and degree of gaze. Let D = {d1 , d 2 , ⋅⋅⋅⋅, d8 } denote the relative degrees of importance of the parameters. Three criteria: worth, accuracy and reliability, have been involved in determining the importance degrees of the parameters. The first criterion is somehow intuitive, whereas the rest two are figured out from experiments to be discussed in Section 4. Accordingly, we define D as D = {0.93, 0.8, 0.85, 0.5, 0.3, 0.3, 0.5, 0.9} . Let V = {v1 , v2 , ⋅⋅⋅⋅, v8 } be the measured values of the eight parameters,. 42.

(54) respectively. We transfer V according to the predefined transfer functions of parameters into S = {s1 , s2 , ⋅⋅⋅⋅, s8 } , where si indicates the degree of drowsiness corresponding to the parametric value vi . Set S here forms what we call the collection of information sources. Based on the sets D and S, we want to determine using the fuzzy integral method the drowsiness level l of the driver, l ∈ H = {m, m + 0.1, m + 0.2, ⋅⋅⋅⋅, M } , where H is. ⎢ ⎥ the hypothesis set, in which m and M are determined as m = ⎢10 × min si ⎥ /10 and si ∈S ⎣ ⎦ ⎡ ⎤ M = ⎢10 × max si ⎥ /10 , where ⎣⎢⋅⎦⎥ and ⎢⎡⋅⎥⎤ denote floor and ceil operators, si ∈S ⎢ ⎥ respectively. To begin, we calculate the degree of worth, g ( Si ) , of. any subset,. Si ⊆ S , of information sources by Equation (4). Afterwards, for each hypothesis hi ∈ H , we perform the fuzzy integral process. First of all, we calculate the support value fi ( s j ) for each information source s j ∈ S by fi ( s j ) = 1 − s j − hi . We next sort information sources according to their support values. Let S ′ = {s1′, s2′ , ⋅⋅⋅, s8′ } be the sorted version of S such that fi ( s1′) ≥ f i ( s2′ ) ≥ ⋅⋅⋅ ≥ fi ( s8′ ) . Substituting fi ( si′) and. g ( Si′) into Equation (5), we obtain the fuzzy integral value ei of hypothesis hi . The above process repeats for each hypothesis in H. Finally, the drowsiness level l of the driver is simply determined as l = h* = arg max ei . hi ∈H. 43.

(55) Chapter 4. Experimental Results. The proposed driver’s drowsiness detection system has been developed using the Borland C++ Programming Language run on an Intel Solo T1300 1.66 GHz PC with the operating system of Windows XP professional. The input video sequence to the system is at a rate of 30 frames/second. The size of video images is 320 x 240 pixels. We divide our experiments into two parts. The first part investigates the efficiency and accuracy of individual steps of the system process. The results have provided clues when assigning degrees of importance for the facial parameters, which are used in the fuzzy reasoning step. The second part exhibits the performance of the entire system. 4.1. Individual Steps. Recall that there are five major steps: preprocessing, facial feature extraction, face tracking, parameter estimation, and reasoning, involved in the system workflow. Of these five steps, the facial feature extraction and face tracking steps dominate the processing speed of the system, whereas the parameter estimation and reasoning steps determine the accuracy of the system. 4.1.1. Facial Feature Extraction and Face Tracking. Refer to Figure 20, where the experimental result of facial feature extraction and face tracking of a video sequence is exhibited. At the beginning, the system repeatedly locates the facial features (two eyes and mouth) of the driver in two successive images. 44.

(56) Thereafter, the system initiates the face tracking module. The tracking module continues until it misses detecting the right eye of the driver in frame 198 because of a rapid turning of the driver’s head. The system immediately evokes the facial feature extraction module again. After successfully locating all the facial features in two successive images (i.e., frames 199 and 200), the face tracking module then takes over and tracks the features over the subsequent images. Our current facial feature extraction module takes about 1/8 seconds to detect facial features in an image, whereas the face tracking module takes about 1/25 seconds to locate facial features in an image.. 1. 2. 3. 4. 178. 179. 180. 181. 184. 185. 186. 187. 196. 197. 198. 199. 202. 203. 204. 432. 433. 434. 438. 439. 464. ⋅⋅⋅⋅⋅⋅. ⋅⋅⋅⋅⋅⋅ 996. 45. 5. 182. ⋅⋅⋅⋅⋅⋅. ⋅⋅⋅⋅⋅⋅ 183. 195. 200. 201. 430. 431. 435. 436. 437. 461. 462. 463. 997. 998. 999. ⋅⋅⋅⋅⋅⋅.

(57) 000. ⋅⋅⋅⋅⋅⋅ 019. 001. 002. 014. 015. 020. 021. 003. 016. 004. 005. 017. 018. ⋅⋅⋅⋅⋅⋅. Fig. 20. Facial feature extraction and face tracking over a video sequence. Figure 21 shows the robustness of the facial feature extraction module under different conditions of illumination (e.g., shiny light, sunny day, cloudy day, twilight, underground passage and tunnel), head orientations, facial expression, and the accessary of glasses. The robustness of the facial feature extraction module is primarily due to the use of a face model. This model helps to find the other features once one or two facial features have been detected. However, there are always uncertainties during facial feature extraction. We confirm a result only when it is repeatedly obtained in two successive images.. (a) shiny lights. (b) sunny days 46. (c) cloudy days.