智慧型演講錄製系統

全文

(1)國立台灣師範大學資訊工程研究所博士論文. 智慧型演講錄製系統 Smart Lecture Recording System. 研究生：羅安鈞指導教授：陳世旺博士. 中華民國. 106. i. 年. 8. 月.

(2) 摘要近年來由於數位學習（或遠距教學）的發展，從高度發達的大都市到偏遠低開發國家，都可為學習者提供了平等的機會。而演講錄製系統在收集數位學習的內容資料中發揮著至關重要的作用。然而隨著數位學習的蓬勃發展，數位內容的缺乏以及專業錄製團隊人員等正在成為一個大問題。這項研究提出了一個智慧型的演講錄製系統，可以自動錄製與人類團隊相同質量水平的內容，並減少錄製人員不足的問題。本研究所提出的智慧型演講錄製系統由三個主要元件系統組成，分別稱為虛擬攝影師、虛擬導演和虛實對位。前兩個元件虛擬攝影師和虛擬導演是線上執行的系統，而虛實對位是屬於離線後製的元件。而虛擬攝影師元件可進一步分為三個子系統：演講者攝影師，觀眾攝影師和演講廳攝影師。所有這些子系統都是自動運作，包括選擇拍攝目標、追踪拍攝、特殊事件偵測等功能。這三個子系統拍攝的視訊將全部傳輸到虛擬導演系統，虛擬導演則選擇最具代表性的畫面錄製或直播。我們將虛擬導演的此功能稱為:選鏡。選鏡的功能主要是對來自虛擬攝影師的視訊作內容分析，並通過反傳播神經網絡特徵的機器學習過程進行畫面選擇的決策。此外，虛擬導演系統具有另一個關鍵功能:視覺指導，通過它可以模仿人類導演和現實世界中的人類攝影師之間的溝通。在完成一段實況的演講錄製後，有時會在演講的錄音集中附加額外的內容或素材，以增加其表現力和可看性。所以本研究另外開發了一個稱為虛實對位系統的後期製作元件，用於實際拍攝影片與虛擬物件的合成。該系統以深度攝影機作為深度感測設備，協助真實世界的彩色攝影機和虛擬世界的攝影機同步對位。虛實對位系統有三個主要執行流程：時間深度融合、攝影機跟踪和虛實合成預覽。由深度相機獲取的深度影像經由時間深度融合被疊合成場景的 3D 構造。再藉由 3D 場景的結構與深度攝影機的相對關係，推導出彩色攝影機的移動軌跡。此軌跡則用於引導虛擬攝影機與真實攝影機同步移動完成對位，將虛擬物件投影並生成虛擬影像，將生成的虛擬圖像疊加在由彩色攝影機獲取的真實圖像上，所得到的圖像稱為虛實合一的預覽圖像。本研究進行了一系列實際演講錄製實驗，而實驗數據顯示我們所提出的智慧型演講錄製系統可以模擬出近似於真正的人類團隊所採取的拍攝、選鏡技術。我們也認為這套系統可不限於演講錄製;如果可以搭配適當的訓練資料，也可以適合錄製舞台表演，音樂會，運動比賽和產品發表會等場合。關鍵字：智慧型演講錄製系統、虛擬攝影師、虛擬導播、虛實對位、選鏡、視覺指導、虛實預覽。. ii.

(3) Abstract Nowadays, e-learning (or distance learning) provides equal opportunities for learners in locations ranging from highly developed metropolises to remote lessdeveloped countries. Lecture recording systems play a vital role in collecting spoken discourse for e-learning. However, in view of the growing development of e-learning, the lack of content is becoming a problem. This research presents a smart lecture recording (SLR) system that can record orations at the same level of quality as a human team, but with a reduced degree of human involvement. The proposed SLR system is composed of three principal components, referred to as virtual cameraman (VC), virtual director (VD), and virtual-real match moving (VRMM), respectively. The first two components, VC and VD, are online components, whereas the VRMM component is offline. The VC component is further divided into three subsystems: speaker cameraman (SC), audience cameraman (AC), and hall cameraman (HC). All these subsystems are automatic, and can take actions that include target and event detection, tracking, and view searching. The videos taken by these three subsystems are all forwarded to the VD system, in which the representative shot is chosen for recording or direct broadcasting. We refer to this function of the VD system as shot selection. The shot selection function operates based on the content analysis of the videos transmitted from the VC component. The capability of content analysis is pre-trained through a machine-learning process characterized by the counterpropagation neural network. In addition, the VD system possesses another pivotal function of visual instruction, through which it imitates the communication between a human director and human cameramen in the real world. Having completed a live speech recording, it is often necessary to include iii.

(4) additional contents or materials in the shot collection of the speech in order to increase its expressivity and vitality. In this context, we develop a post-production component called the virtual-real match moving (VRMM) system for graphic/ stereoscopic image composition. The input data to this system is provided by the equipment constituting a color camera and a depth camera. There are three major processes: temporal depth fusion, camera tracking, and virtual-real synthesis preview, involved in the VRMM subsystem. During temporal depth fusion, the depth images acquired by the depth camera are fused to lead to a 3D construction of the scene. Based on the constructed scene, the pose of the color camera is determined, which is next used to direct a virtual camera to generate synthetic images of a given 3D object model. The generated images are superimposed upon the real images acquired by the color camera. The resultant images are called preview images. A series of experiments for real lecture has been conducted. The results showed that the proposed SLR system can provide oration records close to some extend to those taken by real human teams. We believe that the proposed system may not be limited to live speeches; if it can be configured with appropriate training materials, it may also be suitable for recording stage performance, concerts, athletic competitions, and product launches. Keywords: Smart lecture recording system, Virtual cameraman, Virtual director, Virtual-real match moving, Shot selection, Visual instruction, Preview images.. iv.

(5) 誌謝在博士研究生活中，承蒙眾多人的幫助，非常抱歉無法親自一一答謝。博士班雖然時間很漫長，但是如今我非常高興當初毅然決定就讀的選擇。誠摯感謝我的指導教授陳世旺老師，陳老師好學不倦的精神深深影響著我，提醒我不斷的思考和吸收新知才能突破以往的思維，懂得如何獨立解決問題。此外，陳老師也體諒我在職就讀的身份，不斷的配合我的工作更改會談時間，甚至陪我一起討論工作研發上遇到的難題，博士班學到的知識技術也幫助我在工作升遷上更加的順利。陳老師所傳授的不僅僅只有學術領域，更不時於會談之餘教導我做人處事的道理，讓我在追求學問的過程中不忘重視人情，在我工作與生活受到挫折時，每次都能即時給我最溫暖的安慰，讓實驗室對我來說是另一個家、另一個避風港。在此特別感謝方瓊瑤老師，方老師常常不厭其煩的告知我們研究上必須注意的事項，也感謝方瓊瑤老師在計畫口試與論文口試時協助與叮嚀我們分配相關的工作、文件與學校規章。也感謝系主任陳柏琳老師，我們的演講者錄製系統實驗中提供了不少實用的意見與方法。感謝台大傅楸善教授、交大吳炳飛教授、清大陳朝欽教授在論文與研究方法上的寶貴意見與指導，也讓我了解不少演講錄製系統上可以改進的地方。感謝 IPCV 實驗室的所有成員，鍾允中學長、王俊民學長與張祥利學長多年來的指導，一起參與討論研究上面臨的困難，時而提供我解決的方法或技巧，讓我的研究得以順利進行與完成。感謝彥佑學弟、俊億學弟與佳儒學妹在演講者錄製系統實作上的付出與盡心。感謝軒嘉學弟、淳雅學妹、宇珊學妹的大力幫忙和支持。也感謝 CVIU 實驗室的所有學弟妹，如:雯琳、靖允、銘仁、家安在我演講錄製實驗與口試過程中揮汗如雨的幫忙，非常感動。特別感謝我的摯友陳柏綱博士，我永遠不會忘記七年前在迷惘中的兩人互相鼓勵所做的羅博與陳博約定，現在我們倆都終於順利畢業取得博士學位，達成了 v.

(6) 約定，沒有你，我不會開始；沒有你，我堅持不到現在，你是我最好的朋友也是貴人。我也要感謝我的家人與妻子映均，持續的體諒著我，請原諒我因為忙著工作與學業，沒能好好陪伴你們。最後，非常感謝大家一直以來對於我的鼓勵與讚美，以及批評與指教，尤其是對於我的包容，希望往後每位師長、每位夥伴、每位朋友都能夠繼續互相扶持，在接下來的人生都能一路順遂，謝謝大家。. vi.

(7) Table of Contents List of Figures List of Tables Chapter 1 Introduction ................................................................................................ 1 1.1 Motivation ........................................................................................................ 1 1.2 Literature Review............................................................................................. 3 1.3 Organization of this Thesis ............................................................................ 11 Chapter 2 Mathematical Fundamentals .................................................................. 12 2.1 Gaussian Mixture Model................................................................................ 12 2.2 Finite Automata Theory ................................................................................. 14 2.3 Spatiotemporal Attention Neural Network .................................................... 17 2.4 Multiple Kernel Learning .............................................................................. 19 2.5 Counter Propagation Neural Network ........................................................... 22 Chapter 3 Smart Lecture Recording System .......................................................... 28 3.1 Lecture halls under consideration .................................................................. 28 3.2 System Architecture ....................................................................................... 29 3.3 System Workflow ........................................................................................... 37 Chapter 4 Virtual Cameramen ................................................................................. 47 4.1 Speaker Cameraman ...................................................................................... 47 4.2 Audience Cameraman .................................................................................... 61 4.3 Hall Cameraman ............................................................................................ 70 Chapter 5 Virtual Director ........................................................................................ 74 5.1. Content Analysis ........................................................................................... 74 5.2 Shot Selection ................................................................................................ 99 5.3 Visual Instruction ......................................................................................... 108 Chapter 6 Virtual-Real Match Moving .................................................................. 112 6.1 Workflow of the VRMM System ................................................................. 112 6.2 Temporal Depth Fusion ................................................................................ 113 6.3 Camera Tracking .......................................................................................... 118 6.4 Preview Synthesis ........................................................................................ 120 Chapter 7 Experimental Results ............................................................................. 123 7.1 Virtual Cameraman Subsystem .................................................................... 123 7.2 Virtual Director Subsystem .......................................................................... 126 7.3 Real-time camera match-moving method .................................................... 136 Chapter 8 Conclusions and Future Work .............................................................. 139 Appendix ................................................................................................................... 144 References ................................................................................................................. 159 vii.

(8) List of Figures Figure 1.1. Watching video on a smartphone………………………………………….1 Figure 1.2. Product presentation for Apple smartphones……………………………...2 Figure 1.3. Workflow of program recording…………………………………………..3 Figure 1.4. Camera tracking technology for VRMM………………………………...10 Figure 2.1. Example of a GMM (a) Three individual Gaussian probability density functions, (b) The GMM of the three Gaussian probability density functions………12 Figure 2.2. NFA transition diagram with empty transition, multiple input transition, ambiguous transition, and missing transition………………………………………...15 Figure 2.3. DFA transition diagram…………………………………………………..16 Figure 2.4. STA network……………………………………………………………..17 Figure 2.5. Activation of an attention neuron in response to stimuli………………...18 Figure 2.6. Flowchart of STA image acquisition……………………………………..19 Figure 2.7. Mapping to a new space………………………………………………….20 Figure 2.8. Architecture of a fully connected CPN…………………………………..23 Figure 2.9. Architecture of forward-mapping CPN applied in VD subsystem………24 Figure 2.10. Kohonen layer (left) and Grossberg layer (right)………………………25 Figure 2.11. Adding a new node to the CPN…………………………………………26 Figure 3.1. The organization of the SLR system……………………………………..28 Figure 3.2. The kinds of lecture halls considered in this study (a) tiered (b) level (c) ambient auditoriums…………………………………………………………….........29 Figure 3.3. A deployment of hardware devices of the SLR system in a lecture room.30 Figure 3.4. The configuration of the SC component (a) The picture of SC component (b) The red points mark the lens positions of the depth camera and the PTZ camera, respectively………………………………………………………….……………….31 Figure 3.5. The Kinect perceives an object, i.e., an object is present in the image plane of the depth camera of the Kinect……………………………………………………31 Figure 3.6. The horizontal pan angle ∅ℎ𝑜𝑟 of the PTZ camera……………………..32 Figure 3.7. The vertical tilt angle ∅ℎ𝑜𝑟 of the PTZ camera………………………….33 Figure 3.8. The configuration of the AC component………………………………...34 Figure 3.9. Kinect (from: Google pictures)…………………………………………..35 Figure 3.10. Kinect virtual skeleton (from: Primesense)…………………………….35 Figure 3.11. Information from a Kinect sensor (a) color image (b) depth image…....36 Figure 3.12. PTZ cameras……………………………………………………………36 Figure 3.13. A workflow of the SLR system................................................................38 Figure 3.14. A flowchart of the VD subsystem……………………………………....41 viii.

(9) Figure 3.15. Videos of the VC subsystem (a) SC (b) AC (c) HC……………………41 Figure 3.16. An output of the VD subsystem……...…………………………………41 Figure 3.17. User interface of the manual control…………………………………...42 Figure 3.18. Manual control of a PTZ camera……………………………………….43 Figure 3.19. FSM of the VD subsystem……………………………………………...45 Figure 3.20. DFA of the VD subsystem……………………………………………...46 Figure 4.1. Flowchart of SC subsystem……………………………………………...47 Figure 4.2. Haar-like features………………………………………………………...49 Figure 4.3. Multilayer classification. The blue rectangles represent the weak classifiers……………………………………………………………………………..49 Figure 4.4. Face detection results (a)outdoor (b)indoor.……………………………..50 Figure 4.5. Distribution of the hue histogram………………………………………..51 Figure 4.6. Detected face areas (a) the area detected by AdaBoost (b) the area detected by the mean-shift tracking algorithm………………………………………………...52 Figure 4.7. Flowchart of posture database construction……………………………...54 Figure 4.8. Three types of illustrating posture. The rightmost type involves two pointing hands………………………………………………………………………..55 Figure 4.9. Seven types of pointing postures………………………………………...55 Figure 4.10. Different virtual skeletons of different users…………………………...56 Figure 4.11. Gaussian probability density functions for two joints………………….57 Figure 4.12. Flowchart of posture recognition……………………………………….58 Figure 4.13. Posture detected by GMM……………………………………………...58 Figure 4.14. Postures of pointing…………………………………………………….59 Figure 4.15. Postures of illustrating………………………………………………….59 Figure 4.16. Combination of illustrating postures……………………………………59 Figure 4.17. Result of hand posture recognition……………………………………..60 Figure 4.18. Illustrating recognition. The PTZ camera continues shooting the speaker………………………………………………………………………………..60 Figure 4.19. Pointing recognition. The PTZ camera moves and shoots the area in which the speaker is pointing………………………………………………………...61 Figure 4.20. Flowchart of audience cameraman……………………………………..62 Figure 4.21. ROI detection (a) motion feature map (b) motion feature density map (c) ROI candidate………………………………………………………………………...63 Figure 4.22. FAST corner detection [75]…………………………………………….63 Figure 4.23. FAST corner detection result…………………………………………...64 Figure 4.24. Search window of proposed optical-flow method……………………...65 Figure 4.25. ROI selection (a) ROI detection result (b) the shot after camera steering (c) the attention map for selection……………………………………………………68 ix.

(10) Figure 4.26. Face detection process (a) ROI detected result (b) the face results after camera steering (c) the face of salient object………………………………………...69 Figure 4.27. Pointing detection (a) Pointing detected (b) change to a close-up shot………………………………………………………………………………...…71 Figure 4.28. Posture of relaxing……………………………………………………...72 Figure 4.29. Result of relaxing……………………………………………………….72 Figure 5.1. Procedures for VD shot selection and visual instruction………………...74 Figure 5.2. Flowchart of content analysis……………………………………………76 Figure 5.3. Rule of thirds…………………………………………………………….77 Figure 5.4. Visual balance……………………………………………………………78 Figure 5.5. Comparison between different salient object sizes……………………....78 Figure 5.6. Flowchart of salient object detection…………………………………….79 Figure 5.7. Examples of images that attract human attention [23]…………………...80 Figure 5.8. Flowchart of multiple-scale contrast detection…………………………..81 Figure 5.9. Results of static salient object analysis (a) The original image with a blurred object (background) and a clear object (main object) (b) The static saliency map of (a)…………………………………………………………………………….81 Figure 5.10. Dynamic salient object analysis of a camera that shakes once (left column) and a camera that shakes repetitively (right column)………………………82 Figure 5.11. The saliency map (a) input image (b) static saliency map (c) dynamic saliency map (d) combined saliency map……………………..……………………..83 Figure 5.12. Salient object analysis in real lecture image sequences………………..84 Figure 5.13. Attention map sampling (a) attention map (b) sample points distribution…………………………………..……………………………………….84 Figure 5.14. Attention points and aesthetic rules…………………………………….86 Figure 5.15. The principle of aesthetic scoring (a) schematic of rule of third (b) schematic of visual balance………………………………………………………….87 Figure 5.16. Zoom-in example. The input sequence is shown at the top…………….91 Figure 5.17. Move and hold example. The input sequence is shown at the top……...91 Figure 5.18. Scenery shots with different lengths……………………………………93 Figure 5.19. Saturation of an image……………………………………………….....95 Figure 5.20. Exposure example of an image…………………………………………96 Figure 5.21. Three examples of sharpness in images………………………………...97 Figure 5.22. Detection of sharpness………………………………………………….97 Figure 5.23. Gradient map (left) horizontal direction (right) vertical direction……...97 Figure 5.24. Detail map (left) horizontal direction (right) vertical direction………...98 Figure 5.25. ROF map (a) blurred background (b) clear background……………..…98 Figure 5.26. Saliency map and ROF map…………………………………………....99 x.

(11) Figure 5.27. Three definitions of score goodness (a) first type (b) second type (c) third type………………………………………………………………………………….102 Figure 5.28. MKL implementation………………………………………………….103 Figure 5.29. Score definition after kernel transformation…………………………..104 Figure 5.30. Kernel matrix (left) rule of third (right) visual balance……………….105 Figure 5.31. Kernel matrix (left) sharpness (right) exposure……………………….106 Figure 5.32. Kernel matrix (left) saturation (right) illuminance continuity………...106 Figure 5.33. Kernel matrix (left) color continuity (right) Scenery continuity………106 Figure 5.34. Kernel matrix of camera motion………………………………………107 Figure 5.35. Training model of CPN………………………………………………..108 Figure 5.36. Flowchart of visual instruction………………………………………..108 Figure 5.37. Blocks of 10 pixels × 10 pixels………………………………………..109 Figure 5.38. Movement area detection (a) original image (b) attention map (c) blocks after merging………………………………………………………………………..110 Figure 6.1. The depth camera is mounted on a color camera……………………….112 Figure 6.2. A flowchart of VRMM method………………………………………....113 Figure 6.3. Steps of the temporal depth fusion process……………………………..114 Figure 6.4. Example of 3D updating………………………………………………..116 Figure 6.5. Locating dynamic regions using the STA neural network (a) the input depth image (b) the located dynamic region (c) the 3D construction of the scene with a moving hand (d) the 3D construction without the hand…………………………..117 Figure 6.6. Human detection based on human skeleton model (a) a located human in the input depth image (b) the corresponding color image (c) the 3D construction of the scene containing the human (d) the 3D construction of the without the human...….117 Figure 6.7. The configuration of the sensing device………………………………..118 Figure 6.8. Image composite results of a real object and a virtual object (a) the image of a real object (b) a virtual camera (indicated by a small green spot) looking at a virtual object standing in front of the image of the real object (c) the virtual image generated by the virtual camera (d) a tracking result of the preview system when the color camera pans (e) a match between the real and the virtual objects, and (f) another match after camera motion………………………………….………………………121 Figure 6.9. The outputs of the preview system……………………………………..122 Figure 7.1. User interface of SC system…………………………………………….123 Figure 7.2. PTZ camera shoots the speaker continuously…………………………..124 Figure 7.3. Speaker uses a laser pen while his left hand is illustrating……………..125 Figure 7.4. PTZ camera shoots the whole screen area……………………………...125 Figure 7.5. Speaker uses his right hand to point to an area………………………....125 Figure 7.6. PTZ camera shoots the area in which the speaker is pointing………….125 xi.

(12) Figure 7.7. Speaker waves a baton to indicate an area and his right hand starts pointing……………………………………………………………………………..125 Figure 7.8. PTZ camera shoots the area in which the baton is waving……………..125 Figure 7.9. Example results of AC (a) AC performed long shot when all audience remained calm. (b) AC performed zoom-in to single audiance who was raising hand…………………………………………………………………………………126 Figure 7.10. Photograph of manual shot selection………………………………….127 Figure 7.11. Screenshot of manual shot selection…………………………………..127 Figure 7.12. The SC, AC, HC shots of the lectures in different types of hall………127 Figure 7.13. Similarity comparison of different video clips………………………..128 Figure 7.14. Improvement trend chart………………………………………………129 Figure 7.15. Comparison of videos with no special event………………………………..131 Figure 7.16. Comparison of videos of an audience member asking questions……..132 Figure 7.17. Comparison between LC, CPN and MK-CPN for an interaction between speaker and audience……………………………………………………………….133 Figure 7.18. User servey histogram between random and MK-CPN methods……..135 Figure 7.19. Female part of the user servey histogram between random and MK-CPN methods……………………………………………………………………………..135 Figure 7.20. Male part of the user servey histogram between random and MK-CPN methods……………………………………………………………………………..136 Figure 7.21. Angular measurement tools (a) electric rotary plate (b) electric angle meter………………………………………………………………………………...137 Figure 7.22. Displacement measurement tools (a) dolly rail (b) laser range finder...137 Figure 8.1. Concept of active learning……………………………………………...143 Figure A.1. Result of screen detection (red/green rectangles)……………………...144 Figure A.2. HSV color space (from Wikipedia)…………………………………….144 Figure A.3. Candidates of projection screen (including light and windows)……….145 Figure A.4. Detected projection screen (green rectangle)…………………………..145 Figure A.5. Laser spot detection. The PTZ camera moves and shoots the whole screen area………………………………………………………………………………….146 Figure A.6. Baton detection. The PTZ camera moves and shoots the area in which the baton is waving………………………………………………………….…………..147. xii.

(13) List of Tables Table 3.1. Speaker state table…………………………...……………………………44 Table 3.2. Audience state table………………………………………………...……..44 Table 3.3. Condition–state table…………………………………………………...…44 Table 3.4. Output state table. ………………………………...………………………45 Table 3.5. New state table after DFA ………………………………………………...46 Table 5.1. Evaluation rules of content analysis………………………………………75 Table 5.2. Definition of the movement……………………………………………...110 Table 7.1. Results for overall SC/AC accuracy……………………………………..126 Table 7.2. Similarities between manual selection and VD with different shot selection modules……………………………………………………………………………..129 Table 7.3. Shot decision count analysis…………………………………………….130 Table 7.4. Survey results for individual questions………………………………….134 Table 7.5. Angular and Linear Displacement Error Measurement…………………137 Table A.1: Event and camera-control table…………………………………………148 Table A.2. Visual instruction list (speaker posture: pointing)………………………153 Table A.3. Visual instruction list (speaker posture: illustrating)……………………155 Table A.4. Visual instruction list (speaker posture: relaxing)………………………157. xiii.

(14) Chapter 1 Introduction 1.1 Motivation In recent years, fiber optic networks have become well developed, and wireless Internet access points have been deployed widely. At present, users are not required sit in front of a desktop computer and access content through a wired network; they can use a hand-held device (Figure 1.1) or a laptop in places that have LTE or Wi-Fi signals, such as convenience stores, cafes, and mass transit stations.. Figure 1.1. Watching video on a smartphone. Therefore, the market for network-based multimedia digital content (especially elearning) is growing. With the growth of e-learning systems, more and more people hope to record their presentations or speeches to preserve them or to upload them on the Internet to share with other people around the world. Taiwan has developed elearning for nearly a decade, and the Ministry of Economic Affairs allocated 200 million NTD to further develop e-campus technologies over the last two years. Moreover, Delta Electronics & Wistron Electronics made a total R & D investment of approximately 600 million NTD. Before the end of 2016, the value of the e-learning industry is expected to reach 84.6 billion NTD; Southeast Asia is expected to be the first major area to enter the overseas market. There seems to be universal agreement that the worldwide e-learning market will show rapid and dramatic growth over the next five years. The worldwide market for self-paced e-learning reached $35.6 billion in 1.

(15) 2011. The five-year compound annual growth rate is estimated at approximately 7.6%, so revenues are expected reach some $51.5 billion by 2016. A definition of self-paced learning is education in which each learner studies at his or her own pace, without a fixed starting date or regularly scheduled assignment completion dates in common with other students enrolled in the same program. However, a self-paced learning course may have a fixed overall completion timeframe. In view of the growing development of e-learning, the lack of content is becoming a problem. In general, companies, schools, and other organizations often hold a wide variety of presentations, such as product presentations (Figure 1.2), academic speech seminars, and so on. These presentations are vital assets that are worth recording for on-line audiences or future audiences who will want to review archived videos. Recording these lectures usually requires a professional filming team that can perform filming and shot selection.. Figure 1.2. Product presentation for Apple smartphones. When recording a lecture, the filming team has three main jobs: the first job is to survey the venue, before the speech begins, so that the equipment can be arranged effectively. The second job is filming the speech; during the presentation, each cameraman controls a camera and films various shots in different situations. The third 2.

(16) job requires that the live video shots from different cameramen be transmitted to a control room; the director does the third job by selecting the most representative shots from those videos. The main responsibilities of the director are shot selection and visual instruction. Whether a presentation can be successfully recorded depends on whether the director has sufficient experience or visual storytelling ability. Finally, after the end of the speech, director splices the shot division, does the postproduction process, and then uploads the video to a server; the audience can download the video from the server. Although speeches recorded and uploaded to servers are highly convenient for numerous audiences, the cost is very high.. 1.2 Literature Review A great number of people have used e-learning and mobile learning services, such as open courseware (OCW) and lecture websites. When recording a lecture or a course, at least two cameramen and one director are needed. Often, to make the program's content more comprehensive, multiple cameras shoot video from different viewpoints and all video signals are sent to a video mixer (or switcher) in a control room. The director chooses the most suitable shot to broadcast from the mixer. We call this work “shot selection.” The workflow of program recording is shown in Fig. 1.3.. Figure 1.3. Workflow of program recording. Considering the cost of the traditional system, Rowe [1] classified the cost into two major parts: fixed costs, like computers, cameras, and microphones, and unfixed costs like cameramen, directors, and editors. Generally, the fixed costs must be paid 3.

(17) once, but the unfixed costs must be paid for every production. The authors of [2] mentioned that their teams cost more than US$500 for each Microsoft Corporation lecture. Consequently, the authors of [3], [4], and [5] have discussed the reduction of unfixed costs. To reduce the costs of recording a lecture or a course, some automatic systems of recording have been proposed. Recording lectures automatically has become popular in recent years, but most of the systems on the market still use static cameras without automatic camera control. Cruz [6] published the earliest proposal for an automatic lecture recording system; because the author only used one camera to shoot the scene, the output looked tedious. Bianchi [4] improved on this drawback; the author used several cameras that were able to detect and track the speaker automatically to shoot a lecture. In [7] and [8], the authors used a Microsoft library to construct iCam systems, and the authors of [9] discussed how to reduce the unfixed costs of recording and broadcasting speeches and presentations. The virtual cameraman (VC) is based on the idea of automatically detecting the positions and postures of key speakers and shooting adequate views of those speakers. Onishi [10] not only detected and tracked the speaker but also recognized the actions of the speaker. For posture recognition, the authors of [11] extracted user images from various input images and used those images to train a neural fuzzy network to recognize corresponding postures. The authors of [12] let users wear magnetic sensors that were able to locate the users’ hands, shoulders, and abdomens, and then recognized the users’ behaviors by the location information transmitted by those sensors. The researchers involved in [13] and [14] used KINECT, a range sensor developed by Microsoft, to obtain users’ 3-D skeleton information and recognized their actions using posture matching and an SVM classifier, respectively. Huang [15] tried to track the human head and arm using a single camera in cluttered environments. An earlier publication 4.

(18) reported the construction of a VC system [16-22]. In another article, human faces were tracked using a mean-shift algorithm [23]. However, no prior research considered utilizing content analysis to perform automatic shot selection between video shots from different cameramen. In 1969, Paul Ekman and Wallace V. Friesen published a paper on the psychology of nonverbal behavior [24] indicating that gesture plays a vital role in terms of nonverbal communication. In addition, they classified nonverbal behaviors into several categories; each category represented a predefined meaning. However, the hand gestures of any speaker differ from the gestures of other speakers. Therefore, it is impossible to require the VC to react to every gesture. Depending on the specific needs of a use case, the meanings of hand gestures may require discussion and classification into specific categories. If gestures have been classified for a predefined scenario, the system can control the camera action appropriately. Shot selection plays an essential role in the success of a program using multiple cameras. A video mixer is a platform that allows a director to transmit pictures, with or without additional functions such as special effects and titles. The job of the director is to convey the information of the speech or program faithfully to the audience. Shot selection is a key task that demands experience and skill; at each moment, the director must process multiple live video feeds and choose the most suitable shot from all of them. If the most suitable shot is already being shown, the director must maintain the current feed, but if the most suitable shot is not being shown, the director must cut to the feed with the most suitable shot. The director selects the most suitable shot (called the representative shot) of all received shots by looking for clues that may interest the viewer and then transmit the representative shot, either to an on-air broadcast or to a recording medium. To make a satisfactory decision from multiple inputs, an expert director should have served as a cameraman, a video editor, and a technical adviser. An 5.

(19) expert director with manifold experience has the ability to analyze content and select shots that conform to photographic aesthetics and group psychology. In automatic shot selection, Gleicher [25] used virtual videography to edit videos, whereas Okuni cut a video and extracted meaningful shots to make a new one [26]. Kumano [27] analyzed the behavior of camera motion in terms of video grammar and combined multiple pieces of footage into a more complete video. Wang [28] analyzed a video content by using a genetic algorithm to analyze the structure of the footage and the use of photography; Wang combined separate but relevant video selections into a complete movie. Machnicki [29] integrated diverse directorial functions into an automatic system and called it a “virtual director” (VD). The aforementioned research studies were off-line works; thus, they could not select shots in real time. Furthermore, those studies did not implement any communication between VC and VD to emulated the communication between real cameramen and directors. Before the VD performs automatic shot selection, the system must process content analysis, gather information, and put the video into context for the VD. In content analysis, the first step to automatic photographic analysis is to extract salient objects. Goferman [30] analyzed high-contrast phenomena at the edges of areas within images; simulated human vision, applied to a static image, detected colors, faces, and other information. Fang [31] discussed spatiotemporal attention applied to images by considering human visual stimuli that simulate information. The authors of [32] attempted to detect the movements and gestures of a speaker and proposed a method of automatic image analysis to describe the behavior of the speaker, but they did not discuss any automatic shot selection between multiple cameras. The authors of [4, 29] studied the camera skills required for a photographer. By camera manipulation, they improved the appearance of some films regarding liveliness and smoothness, but they did not perform content analysis or learn shooting rules from real photographers 6.

(20) automatically. In the present research, to increase the attractiveness of the final video output, the proposed system must consider aesthetics and the rules of photography. An ideal system would apply a machine-learning algorithm [33, 34] to learn rules from professionals, such as fuzzy control [35], artificial neural network [36, 37], and deep learning [38-43]. Deep learning, especially convolutional neural networks (CNN)[4450], have been applied to fields including image classification [51, 52], computer vision [53, 54], speech recognition [55, 56], natural language processing [57, 58], and autonomous driving [59-61], where they produced results comparable to and in some cases superior to human experts. However, the limitation of CNN is its local feature learning property where content analysis usually require global and temporal informations. The automation of shot selection can be seen as a decision-making process that takes various shots and information as inputs and returns a single suitable shot as output. The learning process can involve learning shot selection skills from real directors. Hecht [62] introduced a counter propagation network (CPN) as a type of supervised machinelearning technique. If the training data was relevant to the input data, then the CPN applied classification to process the input data into output data quickly in the testing stage. However, the CPN yielded poor results when the input data were heterogeneous. Multiple kernel learning (MKL) [63, 64] is a machine-learning method that can combine predefined kernels for each individual data source by selecting an optimal kernel and parameters. MKL shows a better result in heterogeneous data classification than learning methods without kernel transforms, especially for data in different dimensions and different ranges. However, the MKL training process requires long periods of time for running a complex optimization algorithm. We utilize the fast convergence learning property of the CPN network to simplify the complex 7.

(21) optimization process of MKL and produce an approximate result. At the same time, the kernel transformation of MKL also improves the classification accuracy of the CPN. Ideally, the VC and VD subsystems should exchange two-way communications. Therefore, the proposed VD subsystem should actively give visual instructions or advice to the VC at appropriate times instead of passively receiving video shots and signals. The VC that shoots the speaker should pass all relevant information, including the position and hand gestures of the speaker, to the VD subsystem. For VC subsystems taking the audience view and hall view, the shooting target may not be a specific individual. Therefore, audience view and hall view VCs may be focused on the detection of crowd motion. We follow [65] to establish a method of creating entropy models to quantify crowd motion and locate any abnormal behavior in crowd scenes. Because of the restrictions on shot selection in real time, it is a challenge to design a VD that automatically evaluates the quality of views and immediately makes a consistent decision that results in a steady video. Therefore, an efficient and robust VD system must be implemented. Liu [66] proposed a finite state machine (FSM) that imitated a real director to select suitable shots. Liu’s system operated by state transitions, and all states were defined carefully without exceptional inputs and undefined states. To avoid any deadlocks or empty shots in the proposed smart lecture recording (SLR) system, we also applied an FSM to our VD subsystem. The proposed SLR system consists of three VC subsystems and a VD subsystem; it composes three different views from three different VC subsystems into one view in our interface. Not only can the VC automatically track speakers on stage; the VC can also perform posture recognition from depth images (or range images) of speakers. In addition, this study considers a number of possible scenarios and presents a system that automatically develops a set of reasonable rules for camera work. After a VC subsystem integrates all relevant speaker information, that VC system calculates the optimal shot 8.

(22) according to the rules of photographic opportunities, operates a pan, tilt, and zoom (PTZ) camera to take the optimal shot, and sends messages to inform the VD. After receiving the messages, the VD can choose the single best shot by the established rules and present that shot to the audience. In addition, speakers using laser pens or batons can also control the PTZ camera to shoot specific content, increasing the diversity of screen displays and interactions. Given multiple videos from the VCs, an advanced VD must automatically analyze all available information and choose the best shot by considering photographic aesthetics, optics, scenery continuity, and action continuity. In the shot selection stage, the CPN network executes the decision module for shot selection. The training data of the CPN network is composed of decisions from a real director. Therefore, the VD subsystem can learn shot selection rules from that director. However, the CPN network tends to yield poor results if the input data are heterogeneous. Therefore, we use MKL to transform all heterogeneous data from different content analysis methods into the same dimensions and ranges before those data are sent to the CPN network; this increases the accuracy of shot selection. If the video signal is not broadcast as live video, then postproduction processes such as video editing, virtual–real synthesis, and subtitling can be executed. Related research on automatic video editing has been published in recent years. For example, Gleicher [65] proposed virtual videography, and Liu [66] tried to re-edit multiple videos into meaningful video clips. Numerous similar studies, such as those of Okuni [26] and Kumano [27], investigated similar topics. Even though studies have been able to find meaningful video clips and combine those clips into new videos, they are exercises in postproduction, not real-time editing. Visual effects (VFX) involving virtual-real match-moving (VRMM) have been commonly applied in contemporary media, particularly in public entertainment, such as movies and commercials. To achieve perfect virtual-real synthesis, all relevant 9.

(23) parameters and the moving trajectory of the camera should be noted. A camera for capturing real objects in the real scene must be registered, and its parameters must exactly match those of the virtual camera in the virtual scene, to prevent spatial disorientation in the composite result, which shows a real object in a virtual scene (see Fig. 1.4). Conventionally, the match-moving [67] tedious operations for registering and representing real objects in virtual scenes are performed in postproduction. This work, especially the postproduction of stereo film, is highly manpower-intensive.. Figure 1.4. Camera tracking technology for VRMM. Moreover, if the image information captured is insufficient for reconstructing the camera trajectory in a virtual scene, or if certain errors exist in the parameters registered on-site, the whole image recording of the real object must be done all over again. Clearly, the prolongation of production causes higher costs. An improved method or system of image composition, therefore, is required not only for reducing postproduction labor costs but also for preventing insufficient trajectory information and erroneous parameter registration at the early stage. There are already numerous camera trajectory tracking and camera self-positioning techniques available on the market [68], such as structure-from-motion (SFM) [69] and simultaneous localization and mapping (SLAM) [70-72] for common cameras. KinectFusion from Microsoft is a self-positioning technique that combines a continuous 10.

(24) stream of 3D data from a depth camera. However, a scene to be filmed with the previously mentioned techniques must remain static; otherwise, the tracked trajectories are affected by the objects moving in the scene, and thus, the precision of the tracking is adversely affected. For VRMM in the present work, we modify the temporal depth fusion of KinectFusion, to apply it to dynamic environments. We develop a human skeleton detection method and a spatiotemporal attention (STA) analysis method to reduce the noise from moving objects and human characters in the scene.. 1.3 Organization of this Thesis This study is organized as follows. The mathematical fundamentals are introduced in Chapter 2. Chapter 3 presents the system hardware architecture and software organization. Chapter 4 shows how to implement three VCs. Chapter 5 describes how the VD performs shot selection and visual instruction. Chapter 6 documents the details of VRMM. Chapter 7 presents the experimental results. The final chapter covers conclusions and future work.. 11.

(25) Chapter 2 Mathematical Fundamentals In this chapter, the mathematical fundamentals utilized in this thesis are addressed, including Gaussian mixture model is discussed in Section 2.1; finite automata theory is introduced in Section 2.2, the spatiotemporal attention neural network is detailed in Section 2.3, a multiple kernel learning method is introduced in Section 2.4, and the counter propagation neural network is presented in the last section.. 2.1 Gaussian Mixture Model A Gaussian mixture model (GMM) is a probabilistic model that assumes relevant data points can be formulated by a linear combination of multiple Gaussian probability density functions. The model can smoothly approximate the density distributions of arbitrary shapes. In this study, we use GMMs to describe postures of humans for the purpose of posture recognition. Figure 2.1(a) shows three individual Gaussian probability density functions; Figure 2.1(b) shows their mixture.. (a) (b) Figure 2.1. Example of a GMM (a) Three individual Gaussian probability density functions, (b) The GMM of the three Gaussian probability density functions. Suppose that we have a set of points 𝑋 = {𝑥𝑖 }, 𝑖 = 1, … , 𝑛 in a d-dimensional space. We seek 𝐾 Gaussian distributions 𝐺1 , 𝐺2 , … , 𝐺𝐾 that best represent 𝑥𝑖 with 12.

(26) 𝐾 corresponding contribution weights 𝛼𝑘 where ∑𝐾 𝑘=1 𝛼𝑘 = 1 . The probability density function is defined by weighed sum of the 𝐺𝑘 : 𝑝(𝑥𝑖 ) = ∑𝐾 𝑘=1 𝛼𝑘 𝑝(𝑥𝑖 |𝐺𝑘 ).. (2.1). The probability density function expressed in this way is called a Gaussian mixture model. The probability density function of the distribution generated a point 𝑥𝑖 : 𝑝(𝑥𝑖 |𝐺𝑘 ) =. 1 𝑑 1 2𝜋 2 |Σ𝑘 |2. 1. exp [− 2 (𝑥𝑖 − 𝜇𝑘 )𝑇 Σ𝑘 −1 (𝑥𝑖 − 𝜇𝑘 )],. (2.2). where 𝜇𝑘 is the mean of the density function, and Σ𝑘 denotes the covariance matrix of the density function. These parameters determine the characteristics of this density function, such as the center, shape, width, and direction of the density function. Hence, the sum weighted contributions of all the 𝐺𝑘 is defined as follows: 𝑝(𝑥𝑖 ) = ∑𝐾 𝑘=1. 𝛼𝑘 𝑑 1 2𝜋 2 |Σ𝑘 |2. 1. exp [− 2 (𝑥𝑖 − 𝜇𝑘 )𝑇 Σ𝑘 −1 (𝑥𝑖 − 𝜇𝑘 )].. (2.3). Assume that 𝑋 = {𝑥𝑖 }, 𝑖 = 1, … , 𝑛 are independent of each other. Therefore, the probability density of 𝑋 is defined as follows: 𝑝(𝑋) = ∏𝑛𝑖=1 𝑝(𝑥𝑖 ). (2.4). The problem is, given 𝑋 , what are optimal values of 𝛼𝑘 , 𝜇𝑘 , Σ𝑘 ? However, the processing steps of the optimization are nontrivial. A simpler alternative algorithm to estimate these parameters is called the expectation-maximization (EM). For a detailed exposition of EM, please refer to [73]. From derivation of [73], we can obtain the probability of the 𝑘 𝑡ℎ Gaussian: 𝑝𝑖𝑘 =. 𝑥 𝐺𝑘 ) . 𝑥𝑖 |𝐺𝑗 ). 𝛼𝑘 𝑝( 𝑖 | 𝐾 ∑𝑗=1 𝛼𝑗 𝑝(. (2.5). And the new parameters are: 1. 𝛼𝑘new = 𝑛 𝜇𝑘𝑛𝑒𝑤 =. ∑𝑛𝑖=1 𝑝𝑖𝑘 ,. ∑𝑛 𝑖=1 𝑝𝑖𝑘 𝑥𝑖 ∑𝑛 𝑖=1 𝑝𝑖𝑘. (2.6). ,. (2.7) 13.

(27) Σ𝑘𝑛𝑒𝑤 =. 𝑛𝑒𝑤 ∑𝑛 )(𝑥𝑖 −𝜇𝑘𝑛𝑒𝑤 )𝑇 𝑖=1 𝑝𝑖𝑘 (𝑥𝑖 −𝜇𝑘. ∑𝑛 𝑖=1 𝑝𝑖𝑘. .. (2.8). By Equations 2.5–2.8, the EM iterative steps of the GMM procedure are listed as follows: Step1. select the target number of Gaussians 𝐾. 1. Step2. initialize 𝐾 Gaussians. Usually, we let 𝛼𝑘 = 𝐾, calculate the data cluster center by a K-means algorithm, and set 𝜇𝑘 and Σ𝑘 as the initial values. Step3. expectation: Calculate for each data point 𝑥𝑖 the 𝑝𝑖𝑘 from the 𝜇𝑘 and Σ𝑘 . Step4. maximization: Update the Gaussian parameters: Equations 2.2-2.4 Step5. iterate from step3. Until convergence.. 2.2 Finite Automata Theory In this section, we address the basics of the finite automata theory which are utilized in this study. This will include how to convert a nondeterministic finite automaton (NFA) into a deterministic finite automaton (DFA), and how to simplify a DFA into a simplified finite automaton (SFA). A. Finite state machine The finite state machine (FSM), also called the finite state automata, is an efficient and simple mathematical model often used in logic circuits and computer programs. The FSM is defined by a finite number of states, input operations, and a transition function. For any defined triggering event, the current state transitions to the appropriate state. B. Nondeterministic finite automaton An NFA is expressed by a quintuple 𝑀 = (𝐾, 𝛴, Δ, 𝑠0 , 𝐹), where 𝐾 represents a finite set of states, 𝛴 is the input symbol collection, Δ is the transfer relation, s0 is the initial state, and 𝐹 is the set of final states. The NFA is one type of FSM. For each input symbol, an NFA transitions to a new state until all input symbols have been consumed. For some state and input symbol, the next state may be nothing, or one state, 14.

(28) or multiple possible states. The NFA consists of the following four transitions: empty transition, multiple input transition, ambiguous transition, and missing transition. The NFA is easier to construct, because NFAs can be constructed from any regular expression using Thompson's construction algorithm. Figure 2.2 is an example of NFA transition diagram.. Figure 2.2. NFA transition diagram with empty transition, multiple input transition, ambiguous transition, and missing transition. C. Deterministic finite automaton The deterministic FSM can be referred to as a DFA. A DFA is described by a quintuple 𝑀 = (𝐾, 𝛴, δ, 𝑠0 , 𝐹), where 𝐾 represents a finite set of states, 𝛴 is the input symbol collection, 𝛿 is the transfer function, s0 is the initial state, and 𝐹 is the set of final states. The rules according to which the automaton 𝑀 picks its next state are encoded into the transition function. Every NFA has an equivalent DFA. The conversion is using the subset construction method, please refer to [74] for details. After the NFA has been converted to a DFA, the functionality of the new DFA is equivalent to that of the original NFA. The main purpose of the conversion is to eliminate the uncertainty of the NFA from ambiguous transitions. The system is easier to implement and debug if we design the system using a DFA. D. Conversion from NFA to DFA 15.

(29) After the NFA is specified, it can be converted into an equivalent DFA. Using the subset construction algorithm, each NFA can be translated to an equivalent DFA. Given a transition diagram, the steps of the subset construction algorithm are as follows: Step 1. Separate all multiple input transitions. Step 2. Check whether each state has an empty transition that can transition without an input symbol. Step 3. Check the input symbol and reachable state from a given state and store them in a transition table. Step 4. Check whether any new state exists in the table. We try to find states that can be merged into a new state; if a state that can be merged is found, then repeat Step 3 with that state. Step 5. Repeat Step 3 and Step 4 until all the possible states are merged. Step 6. Rename the states. Step 7. Mark initial state and final state. After all Steps have been executed, we have converted the NFA into the DFA. Figure2.3 shows an example of DFA transition diagram.. Figure 2.3. DFA transition diagram. E. Conversion from DFA to SFA Once we obtain the DFA, we must check whether DFA can be simplified. The simplification algorithm reduces the number of states, and improves the efficiency of 16.

(30) the system. Step 1. Construct transition table. Step 2. Partition states according to final and non-final states. Step 3. Rename components. Step 4. Find states that can be merged into a new state. Step 5. For each component of the previous partition, partition the component according to the next states. Step 6. Repeat Step 3 to Step 5 until partition count is the same. Step 7. Rename states, construct table, and draw diagram. The practical steps for our system design are described in Chapter 3.. 2.3 Spatiotemporal Attention Neural Network The STA [31] neural network is configured as a two-layer network, with one layer for input and one layer for output. The extracted information serves as the input stimuli to a STA network embedded in the perceptual analyzer. The output layer is also referred to as the attention layer. Neurons in the attention layer are arranged into a twodimensional (2D) array, in which they are interconnected. No direct links connect input neurons to each other, but each neuron is part of the two-layer network. Assume that a 2D Gaussian G (see Figure 2.4) is centered at an attention neuron. A weight value links an input neuron with the attention neuron at the center of the Gaussian G. If consistent stimuli repeatedly innervate the neural network, a focus of attention is established in the network.. 17.

(31) Figure 2.4. STA network. Figure 2.5 shows the activation of an attention neuron in response to an input stimulus. If the input to the neuron is greater than that neuron’s activation threshold within a time interval ΔT, the neuron requires ΔTrise time to reach maximum activation and decays over a time of approximately ΔTdecay.. Figure 2.5. Activation of an attention neuron in response to stimuli. The equation for this activation curve is formulated as follows: 𝑆𝑇𝐴(𝑥, 𝑦, 𝑡) = {. min(𝜌, 𝑆𝑇𝐴(𝑥, 𝑦, 𝑡0 ) + 𝜌 ∙ (1 − 𝑒 −𝜎∙(𝑡−𝑡0 ) )) , if 𝐴(𝑥, 𝑦, 𝑡) > 𝑇𝑎 (2.9) 𝑆𝑇𝐴(𝑥, 𝑦, 𝑡 − 1) − 𝜔 , otherwise. where ρ is the maximum activation, σ controls the rate of rise, and ω controls the rate of decay. In addition, t0 is the start time at which the STA neuron in position (x,y) receives an activation 𝐴(𝑥, 𝑦, 𝑡0 ) larger than the threshold 𝑇𝑎 . To detect STA-salient objects in a video sequence, at first, a low-color image and a high-color image are extracted from the input video sequence. A high-color (low-color) image at time t preserves the maximum (minimum) color values of the input video sequence up to time t. A distinct spatial difference image is then computed for each 18.

(32) input in the STA neural module. Then, we calculate the temporal difference (derivative) image for each spatial difference image. The resulting temporal difference images then serve as inputs to the STA neural network. The process flowchart is shown in Figure 2.6.. Figure 2.6. Flowchart of STA image acquisition.. 2.4 Multiple Kernel Learning MKL is a method that has been proven to produce excellent classification results when dealing with heterogeneous data from different sources of information with their own dimensions and ranges, especially in large sample spaces. MKL is used in our VD subsystem to improve the accuracy of shot selection. If the problem is nonlinear, instead of trying to fit a nonlinear model to discriminate the data, we can map the problem to a new space by doing a nonlinear transformation using suitably chosen mapping function and then use a linear model in the new space (see Figure 2.7). Assume that we have the new dimensions calculated through the mapping functions 𝑧 = ∅(𝑥) mapping from the 𝑥 space to the 𝑧 space. 19.

(33) Figure 2.7. Mapping to a new space. Given a sample X = {(x𝑖 , y𝑖 )}𝑛𝑖=1. For a binary classification, the classifier can be trained by solving the following quadratic optimization problem: 1. min 2 ‖w‖22 + 𝐶 ∑𝑁 𝑖=1 𝜉𝑖 𝜉. s. t. 𝑦𝑖 (w ∙ ∅(x𝑖 ) + 𝑏) ≥ 1 − 𝜉𝑖. (2.10). where 𝐶 is a predefined positive trade-off parameter between model simplicity and classification error and 𝜉 is the vector of slack variables. Instead of solving this optimization problem directly, we use the Lagrangian dual function to obtain the following dual formulation: 1. 𝑁 𝑁 max ∑𝑁 𝑖=1 𝛼𝑖 − 2 ∑𝑖=1 ∑𝑗=1 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 ∅(x𝑖 ) ∙ ∅(x𝑗 ) 𝛼. s. t. ∑𝑁 𝑖=1 𝛼𝑖 𝑦𝑖 = 0 , 0 ≤ 𝛼𝑖 ≤ C, ∀𝑖 ∈ {1,2, … , 𝑛}. (2.11). where 𝛼 is the vector of dual variables corresponding to each separation constraint. The idea in kernel machine is to replace the inner product of mapping functions, ∅(𝑥𝑖 )∅(𝑥𝑗 ), by kernel function 𝐾(𝑥𝑖 , 𝑥𝑗 ). Kernels are generally considered to be measures of similarity in the sense that 𝐾(𝑥𝑖 , 𝑥𝑗 ) takes a larger value as 𝑥𝑖 and 𝑥𝑗 are more “similar” from the point of view of the application. The optimization process applies the following dual formulation: 1. 𝑁 𝑁 max ∑𝑁 𝑖=1 𝛼𝑖 − 2 ∑𝑖=1 ∑𝑗=1 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐾(𝑥𝑖 , 𝑥𝑗 ) 𝛼. s. t. ∑𝑁 𝑖=1 𝛼𝑖 𝑦𝑖 = 0 , 0 ≤ 𝛼𝑖 ≤ C. (2.12). 20.