Multiple Kernel Learning - Mathematical Fundamentals

Chapter 2 Mathematical Fundamentals

2.4 Multiple Kernel Learning

MKL is a method that has been proven to produce excellent classification results when dealing with heterogeneous data from different sources of information with their own dimensions and ranges, especially in large sample spaces. MKL is used in our VD subsystem to improve the accuracy of shot selection.

If the problem is nonlinear, instead of trying to fit a nonlinear model to discriminate the data, we can map the problem to a new space by doing a nonlinear transformation using suitably chosen mapping function and then use a linear model in the new space (see Figure 2.7). Assume that we have the new dimensions calculated through the mapping functions 𝑧 = ∅(𝑥) mapping from the 𝑥 space to the 𝑧 space.

Figure 2.6. Flowchart of STA image acquisition.

Given a sample X = {(x_𝑖, y_𝑖)}_𝑖=1^𝑛 . For a binary classification, the classifier can be trained by solving the following quadratic optimization problem:

min𝜉 1

2‖w‖₂² + 𝐶 ∑^𝑁_𝑖=1𝜉_𝑖 s. t. 𝑦_𝑖(w ∙ ∅(x_𝑖) + 𝑏) ≥ 1 − 𝜉_𝑖 (2.10) where 𝐶 is a predefined positive trade-off parameter between model simplicity and classification error and 𝜉 is the vector of slack variables. Instead of solving this optimization problem directly, we use the Lagrangian dual function to obtain the following dual formulation:

max𝛼 ∑^𝑁_𝑖=1𝛼_𝑖−¹₂∑^𝑁_𝑖=1∑^𝑁_𝑗=1𝛼_𝑖𝛼_𝑗𝑦_𝑖𝑦_𝑗∅(x_𝑖) ∙ ∅(x_𝑗)

s. t. ∑^𝑁_𝑖=1𝛼_𝑖𝑦_𝑖 = 0 , 0 ≤ 𝛼_𝑖 ≤ C, ∀𝑖 ∈ {1,2, … , 𝑛} (2.11) where 𝛼 is the vector of dual variables corresponding to each separation constraint.

The idea in kernel machine is to replace the inner product of mapping functions,

∅(𝑥_𝑖)∅(𝑥_𝑗), by kernel function 𝐾(𝑥_𝑖, 𝑥_𝑗 ). Kernels are generally considered to be measures of similarity in the sense that 𝐾(𝑥_𝑖, 𝑥_𝑗 ) takes a larger value as 𝑥_𝑖 and 𝑥_𝑗 are more “similar” from the point of view of the application. The optimization process applies the following dual formulation:

max𝛼 ∑^𝑁_𝑖=1𝛼_𝑖−¹₂∑^𝑁_𝑖=1∑^𝑁_𝑗=1𝛼_𝑖𝛼_𝑗𝑦_𝑖𝑦_𝑗𝐾(𝑥_𝑖, 𝑥_𝑗) s. t. ∑^𝑁_𝑖=1𝛼_𝑖𝑦_𝑖 = 0 , 0 ≤ 𝛼_𝑖 ≤ C.

(2.12) Figure 2.7. Mapping to a new space.

Moreover, the optimization process of a general kernel, which is represented in matrix form, is as follows:

𝛼,𝛼max^T𝑦=0𝛼^T𝑒 −¹₂𝛼^T𝐾𝛼 𝑠. 𝑡. 0 ≤ α_𝑖 ≤ 𝐶 (2.13)

where 𝐾 is the kernel matrix. The kernel matrix 𝐾 is the representation of the similarity between all pairs of data points. Because the elements of the kernel matrix are defined by the inner product from pairwise comparison, the kernel matrix is a symmetric positive definite matrix that forms a convex cone. Here we can find a property of the kernel matrix: any symmetric positive definite matrix specifies a kernel matrix, and every kernel matrix is a symmetric positive definite matrix.

It is possible to construct new kernels by combining multiple simpler kernels. Such that, we can fuse heterogeneous information from different sources. Each kernel measures similarity according to its domain. Assuming M different sources exist, and each source has its own base kernel matrix 𝐾_𝑚, the kernel matrix K is defined as:

𝐾 = ∑^𝑀_m=1𝛽_𝑚𝐾_𝑚 subject to 𝛽_𝑚 ≥ 0 (2.14) where m=1 to M. K is the representation of the linear combination of kernel matrices.

This is called multiple kernel learning where we replace a single kernel with a weighed sum. By linear combination of m base kernel matrices with kernel coefficient βm, where

m=1 to M, we can synthesize kernel matrix K.

For training, the dual formulation of MKL with multiple kernels can be defined in matrix form as (rewritten by equation 2.13) :

∑ 𝛽𝑚min𝑚𝐾𝑚( max

𝛼,𝛼^T𝑦=0𝛼^T𝑒 −¹₂𝛼^T(∑^𝑀_m=1𝛽_𝑚𝐾_𝑚)𝛼) 𝑠. 𝑡. 0 ≤ α_𝑖 ≤ 𝐶 (2.15)

where α_𝑖

is the sample coefficient and β

m is the kernel weight. After training, βm will take values depending on how the corresponding kernel is useful in discriminating. For input 𝑥, considering the classification with N training samples {𝑥_𝑖, 𝑦_𝑖 ∈ ±1}_𝑖=1^𝑁 and

M base kernels {𝐾

_𝑚}_𝑚=1^𝑀 , the learned model is of the form:

𝑓(𝑥) = ∑^𝑁_𝑖=1𝛼_𝑖𝑦_𝑖𝐾(𝑥_𝑖, 𝑥) + 𝑏

= ∑^𝑁_𝑖=1𝛼_𝑖𝑦_𝑖∑^𝑀_𝑚=1𝛽_𝑚𝐾_𝑚(𝑥_𝑖, 𝑥)+ 𝑏. (2.16)

The sample coefficient is used for the relation between data and classes where αi is the weight for the ith datum. The kernel coefficient is the representation of classes and features where βm is the weight for base kernel matrix Km.

The basic task of MKL is to find the sample coefficient αi, and corresponding βm. In other words, the task is to optimize both sample coefficient αi and kernel coefficient

β

m so that the error function can be minimized to achieve better data clustering results.

In a conventional method, the kernel coefficient βm could be obtained from machine-learning approaches, like Support vector machine (SVM). However, the optimization problem is too complex to solve directly; thus, an alternative approach is proposed here using an iterative method to obtain the optimized sample coefficient and kernel coefficient. More specifically, we just solve these two coefficients one at a time while the other is fixed. That is, we optimize αi by fixing βm and optimize βm

by fixing α

i. In any odd-numbered iteration, a nearly optimal αi

is obtained by solving a generalized

eigenvalue problem; we obtain βm

by solving the relaxation of semidefinite

programming (SDP). Next, in the even-numbered iteration, a nearly optimal βm

is

obtained by solving a generalized eigenvalue problem; we obtain αi by solving the relaxation of SDP. In each iteration, we get closer to the optimal solution and then use this solution as the input to the next loop until convergence.

2.5 Counter Propagation Neural Network

Our CPN network is used as a decision-making module for shot selection in our VD subsystem. The CPN is a supervised learning technique, and a real director provided the initial training input data. The most crucial feature of the CPN is its fast response time. The output of CPN is the single shot that has the highest score.

A. Counter propagation neural network introduction

Figure 2.8 shows the architecture of a fully connected CPN network. The network is constructed of five layers: two input layers, two output layers, and one hidden layer.

The training data is X. There are 𝑛 neurons in the X input layer, and the input data to neurons are denoted by 𝑥_𝑖, where 𝑖 = 1, … , 𝑛. Another input layer with 𝑚 neurons takes the data Y, which is the labeled vector for X, denoted by 𝑦_𝑘, where 𝑘 = 1, … , 𝑚 .

In the architecture, the hidden layer contains p neurons; each neuron is denoted 𝑅_𝑗, where 𝑗 = 1, … , 𝑝. This hidden layer is also called the cluster layer. Each neuron in the hidden layer represents a class. The weight vector 𝑤_𝑗𝑖 connected to input neurons 𝑥_𝑖 and hidden neurons 𝑅_𝑗 is used to classify 𝑖 inputs to class 𝑗. In the same way, the weight vector 𝑢_𝑗𝑘 connected to input neurons 𝑦_𝑘 and hidden neurons 𝑅_𝑗 is used to classify k inputs to class j.

After an input vector is classified, we can obtain the output result calculated from weight vectors 𝑣_𝑘𝑗 and 𝑡_𝑖𝑗, which directly connect the hidden neurons and the output layer. If we input vector 𝑥_𝑖 to input layer X, the approximate output is 𝑦₁^∗, … , 𝑦_𝑚^∗; if we input vector 𝑦_𝑖 to input layer Y, the approximate output is 𝑥₁^∗, … , 𝑥_𝑛^∗.

B. Forward-mapping CPN

Figure 2.8. Architecture of a fully connected CPN.

One of the features of a fully connected CPN is as follows: if the input to the network is an expected output result, one obtains an input vector corresponding to the result when there is a one-to-one mapping between input vector and output vector.

However, there could be different situations in our VD subsystem. For example, assume the director selects the hall view; the cause might be that the speaker is interacting with the audience or that the director wants to use an establishing shot to avoid an emergency.

Because both situations could motivate the director to choose the same shot, the mapping function should not be one-to-one. Thus, we simplified the CPN into a forward-only network, and the updated architecture, called a forward-mapping CPN, is shown in Figure 2.9.

Figure 2.9. Architecture of forward-mapping CPN applied in VD subsystem.

The forward-mapping CPN can be divided into two layers (Figure 2.10). The first layer, called the Kohonen layer, uses a winner-take-all learning algorithm to train the weight vector 𝑤_𝑗𝑖. The Kohonen layer executes an unsupervised learning algorithm.

This layer is often used in classification. Each neuron in the hidden layer can represent a rule. Thus, the entire Kohonen layer can be seen as a rule library. The algorithm is as follows:

Step 1. Assume the score vector for training data is 𝑥₁, … , 𝑥_𝑛, where 𝑛 is the number of training data points.

Step 2. Calculate the likelihood between each class and corresponding weight by:

𝑑_𝑗 = ∑^𝑛_𝑖=1|𝑥_𝑖(𝑡) − 𝑤_𝑗𝑖(𝑡)| (2.17) where 𝑡 is the 𝑡th training data point.

Step 3. Choose the minimum 𝑑_𝑗, as a winner. Only the weight of the winner is updated.

𝑑_{𝑤𝑖𝑛𝑛𝑒𝑟} = min

𝑗=1,…,𝑛𝑑_𝑗. (2.18)

Step 4. To avoid an overly large difference between weight and input, a threshold ∆ is used here. If 𝑑_{𝑤𝑖𝑛𝑛𝑒𝑟} < ∆, the weight and input are similar; go to Step 5. If 𝑑_{𝑤𝑖𝑛𝑛𝑒𝑟} >

∆, the weight and input are not similar and a new rule is required; go to Step 6.

Step 5. Update the weight connected to the winner; the update function is:

𝑤_{𝑤𝑖𝑛𝑛𝑒𝑟,𝑖}(𝑡 + 1) = 𝑤_{𝑤𝑖𝑛𝑛𝑒𝑟,𝑖}(𝑡) + 𝛼[𝑥_{𝑤𝑖𝑛𝑛𝑒𝑟,𝑖}(𝑡) − 𝑤_{𝑤𝑖𝑛𝑛𝑒𝑟,𝑖}(𝑡)] (2.19) Figure 2.10. Kohonen layer (left) and Grossberg layer (right).

where 𝛼 is the learning rate, and its initial value is ¹₂, but 𝛼 decreases through iteration to expedite convergence.

Step 6. If no class is found for an input, try to add a new neuron node to the hidden layer (see Figure 2.11). The initial weight for a new node is:

𝑤_𝑝+1(𝑡 + 1) = 𝑥(𝑡) (2.20) 𝑢_𝑝+1,𝑘(𝑡 + 1) = 𝑦^∗(𝑡) (2.21)

This uses this input value as the new weight 𝑤_𝑝+1,𝑖 and the expected (labeled) output value as the new weight 𝑣_𝑝+1,𝑘.

The number of neurons in the hidden layer denotes the number of types that can be classified. The concept is the same as the shot selection of a real director. When real directors select a view, they always must determine the situation that it belongs to (e.g., an audience member is asking some questions, a speaker has a meaningful posture);

directors classify the situation to which each selected shot belongs.

The second layer is the Grossberg layer. This layer uses the Grossberg supervised learning algorithm to train the weight vector 𝑢_𝑗𝑘. The training algorithm of the Grossberg layer resembles the training processes of the Kohonen layer. The Grossberg algorithm is as follows:

Figure 2.11. Adding a new node to the CPN.

Step 1. Assume the score vector for training data is 𝑥₁, … , 𝑥_𝑛, where 𝑛 is the number of training data points, 𝑦₁^∗, … , 𝑦_𝑚^∗ is the expected results, and 𝑚 is the number of VCs (in this research 𝑚 = 3).

Step 2. Calculate the likelihood between each class and the corresponding weight by:

𝑑_𝑗 = ∑^𝑛_𝑖=1|𝑥_𝑖(𝑡) − 𝑤_𝑗𝑖(𝑡)| (2.22) where 𝑡 is the 𝑡th training data point.

Step 3. Choose the minimum 𝑑_𝑗 as the winner. Only the weight of the winner is updated.

𝑑_{𝑤𝑖𝑛𝑛𝑒𝑟} = min

𝑗=1,…,𝑛𝑑_𝑗 (2.23) Step 4. Update the weights that connect to the winner; the update function is:

𝑤_{𝑤𝑖𝑛𝑛𝑒𝑟,𝑖}(𝑡 + 1) = 𝑤_{𝑤𝑖𝑛𝑛𝑒𝑟,𝑖}(𝑡) + 𝛼[𝑥_{𝑤𝑖𝑛𝑛𝑒𝑟,𝑖}(𝑡) − 𝑤_{𝑤𝑖𝑛𝑛𝑒𝑟,𝑖}(𝑡)] (2.24) 𝑢_{𝑤𝑖𝑛𝑛𝑒𝑟,𝑘}(𝑡 + 1) = 𝑢_{𝑤𝑖𝑛𝑛𝑒𝑟,𝑘}(𝑡) + 𝛽[𝑦_{𝑤𝑖𝑛𝑛𝑒𝑟,𝑘}^∗ (𝑡) − 𝑢_{𝑤𝑖𝑛𝑛𝑒𝑟,𝑘}(𝑡)] (2.25) 𝛼 is the learning rate of the Kohonen layer, and the value is a constant at this stage. 𝛼 is the last value in the Kohonen layer after convergence. 𝛽 is the learning rate of the Grossberg layer, and the initial value is also a constant.

Chapter 3 Smart Lecture Recording System

The overall organization of the smart lecture recording (SLR) system is shown in Figure 3.1. There are three principal components constituting the SLR system: virtual cameraman (VC), virtual director (VD), and manual control (MC). The VC component is further divided into three sub-components: speaker cameraman (SC), audience cameraman (AC), and hall cameraman (HC). This division is inspired by a professional lecture recording team that in general possesses at least three cameramen for performing the separate duties of shooting the speaker, the listeners, and the panoramic scene. In the ensuing sections, we discuss the kinds of lecture halls considered in this thesis in Section 3.1; the architecture of the SLR system is described in Section 3.2; the workflow of the system is finally addressed in Section 3.3.

3.1 Lecture halls under consideration

The lecture halls under our consideration range from ordinary classrooms, lecture theaters, to grand oration halls. In addition to extent, various kinds of lecture halls are characterized by distinct arrangements of auditoriums. Figure 3.2 shows several halls with different disposals of seats, such as (a) tiered, (b) level, and (c) ambient

Figure 3.1. The organization of the SLR system.

auditoriums. Tiered auditoriums are often for presentations held by academic institutes and organizations. Level auditoriums are often for exhibitions of new products held by business firms or deliveries of new songs by music companies. As for ambient auditoriums, they are primarily for formal reports or speeches in congresses and parliaments. Our SLR system will be able to work in the sites mentioned above.

3.2 System Architecture

To illustrate the architecture of the SLR system, let us look at the layout depicted in Figure 3.3, which shows a deployment of hardware devices of the SLR system in a lecture room. There is a screen mounted on the front wall of the room for displaying lecture materials (e.g., power-points and videos) transmitted from a computer controlled by the speaker. The SC component of the SLR system consisting of a Kinect sensor and a PTZ camera sits in front of the speaker and points to the speaker. The Kinect sensor serves as a photographer and the PTZ camera plays as his/her imaging device. Once the Kinect sensor perceives an object, the Kinect sensor directs the PTZ camera toward the object for identification and tracking. The AC component comprising two PTZ cameras stood in the front of the room faces toward audience seats.

Similarly, the top PTZ camera serves as a photographer and the bottom PTZ camera plays as his/her imaging device. Finally, the HC component containing only one PTZ camera is located at the rear of the lecture room. The major purpose of this component is to provide panoramic shots of the room as interesting episodes for weaving in the

Figure 3.2. The kinds of lecture halls considered in this study (a) tiered (b) level (c) ambient auditoriums.

lecture video.

All the aforementioned three components of the SLR system and the speaker computer are connected through wire or wireless communications to a host computer situated anywhere in the room. Note that an SLR system may contain multiple SC, AC, and HC components for working in a large lecture hall. For such a system, the most important issue of concern may be the coordination of components during lecture recording. We leave this issue to the future work.

3.2.1 The SC component

The SC component consists of a Kinect sensor and a PTZ camera, whose specifications are to be addressed later. Figure 3.4(a) shows the configuration of the SC component, where the Kinect sensor is situated on the top of the PTZ camera. Initially, the PTZ camera is in the home position. It’s lens is vertically aligned the lens of the Figure 3.3. A deployment of hardware devices of the SLR system in a lecture room.

depth camera of the Kinect sensor (see Figure 3.4(b)). Once the Kinect sensor detects an object in the image plane of the depth camera, the Kinect sensor immediately computes the 3D orientation of the object and accordingly guides the PTZ camera toward the object to see whether it is the target of interest or not.

(a) (b)

Figure 3.4. The configuration of the SC component (a) The picture of SC component (b) The red points mark the lens positions of the depth camera and the PTZ camera,

respectively.

See Figure 3.5; let ∅ be the viewing angle of a 3D object. The angle is known by the Kinect sensor through the projection of the object onto the image plane of the depth camera of the Kinect sensor. Based on this angle, the horizontal pan angle ∅_ℎ𝑜𝑟 and

Figure 3.5. The Kinect perceives an object, i.e., an object is present in the image plane of the depth camera of the Kinect.

the vertical tilt angle ∅_{𝑣𝑒𝑟𝑡} of the PTZ camera can be determined. Consider ∅_ℎ𝑜𝑟. Let 𝑓 be the focal length of the PTZ camera and 𝑤′ be the half width of the depth plane.

Look at Figure 3.5; 𝑥′ is the distance between the 2D object and the center of the depth plane, 𝑑 is the distance from the Kinect sensor to the world plane, which is the plane parallel to the depth plane and passing through the 3D object. Refer to Figure 3.6; 𝜃_ℎ𝑜𝑟 is the horizontal viewing angle of the depth camera, 𝑤 and 𝑥 are the width of the world plane and the distance between the center of the world plane and the 3D object, respectively. Based on the similar triangle property 𝑥 =_𝑤′^𝑥′𝑤 = _𝑤′^𝑥′(𝑑 tan^𝜃^ℎ𝑜𝑟₂ ). The horizontal pan angle ∅_ℎ𝑜𝑟 of the PTZ camera is calculated according to ∅_ℎ𝑜𝑟 = tan⁻¹(_𝑑^𝑥) .

Consider the vertical tilt angle ∅_ℎ𝑜𝑟 of the PTZ camera. See Figure 3.7; suppose 𝑙 is the displacement between the Kinect and the PTZ camera. ℎ′ and 𝑦′ are the height of the image plane and the vertical distance between the target image and the center of the image plane, respectively. 𝑑 is the distance from the camera to the world plane. In the world plane, ℎ and 𝑦 are the width of the world plane and the vertical distance between the center and the target, respectively. 𝜃_{𝑣𝑒𝑟𝑡} is the vertical view angle

Figure 3.6. The horizontal pan angle ∅_ℎ𝑜𝑟 of the PTZ camera.

of the depth camera. ∅_{𝑣𝑒𝑟𝑡} is the tilt angle for the PTZ camera. First, use similar triangle properties to calculate 𝑦：𝑦 =^𝑦′_ℎ′ℎ =^𝑦′_ℎ′(𝑑 𝑡𝑎𝑛^𝜃^{𝑣𝑒𝑟𝑡}₂ ). Then, calculate ∅_{𝑣𝑒𝑟𝑡}

：∅_{𝑣𝑒𝑟𝑡} = tan⁻¹(^𝑦+𝑙_𝑑 ). The aforementioned parameters are known except 𝑑, ∅_ℎ𝑜𝑟, and ∅_{𝑣𝑒𝑟𝑡}. Therefore, PTZ camera action requires only a simple computation process to generate control signals.

3.2.2 The AC component

The AC component consists of two PTZ cameras. Figure 3.8 shows the configuration of the AC component. Unlike the SC component in which the Kinect sensor serves as a photographer, a moving PTZ camera of the SC component is serving as a photographer instead. This is because audiences typically have much wider extent than the speaker even though the speaker can move around, the field of view of a moving PTZ camera will be able to cover the entire range of audiences. Similar to the PTZ camera of the SC component, the bottom PTZ camera of the AC component serves as the imaging device of the cameraman.

Figure 3.7. The vertical tilt angle ∅_ℎ𝑜𝑟 of the PTZ camera.

Similar to the SC component, once the top PTZ camera detects an object, the 3D position of the object is determined by the SC component. The position of the object then guide the bottom PTZ camera toward the object.

3.2.3 The HC component

The HC component containing one PTZ camera is located at the rear of the lecture room. In addition to taking panoramic shots of the room, including the screen, podium, speaker, and audience, another important task for the HC component is to detect the interactions between the speaker and audience. Those interactions will provide interesting episodes for weaving in the lecture video.

3.2.4 Kinect sensor

A Kinect sensor (see Figure 3.9) is included in the SC component, which is an active depth sensor produced by Microsoft and typically used for games. With the Kinect sensor, users simply use gestures to direct a general operating system interface.

The Kinect sensor that photographs a person also captures a virtual skeleton abstracted from visual information about that person (Figure 3.10). This feature allows users to play interactive, controller-free, Kinect-based games by moving their bodies.

Figure 3.8. The configuration of the AC component.

The Kinect sensor provides three pivotal types of information: color images, depth images (see Figure 3.11), and sound. Color images are obtained by the RGB camera in the middle of the Kinect, while depth images are produced by the infrared transmitter and infrared CMOS sensor at the left and right sides. The detailed specifications are as follows:

-Depth-sensing and skeleton detection preferred distance: 1.2 to 3.6 meters -FOV: 57 degrees horizontal, vertical 43 degrees

-Motor rotation angle: Up and down 28 degrees -Frames per second (FPS): 30 per second -Depth resolution: QVGA (320 x 240) -Color resolution: VGA (640 x 480)

Figure 3.9. Kinect (from: Google pictures)

Figure 3.10. Kinect virtual skeleton (from: Primesense)

In addition, in 2011 Microsoft released the Kinect SDK for its own operating system software (including Windows 7). It allows users around the world to research and develop new Kinect applications (e.g., gestures can be used to control robots, operate slideshows, and select items). This study integrates the virtual skeletons recorded by Kinect sensors with a custom hand gesture library to identify speakers’

hand gestures.

3.2.5 PTZ camera

All the three components of the SLR system contain PTZ cameras (see Figure 3.12). A PTZ camera is a camera that is capable of remote directional and zoom control.

PTZ is an abbreviation for pan, tilt, and zoom, and PTZ cameras can execute all three of those motions. PTZ cameras are commonly used in applications such as surveillance,

在文檔中智慧型演講錄製系統 (頁 32-0)