• 沒有找到結果。

Organization of this Thesis

在文檔中 智慧型演講錄製系統 (頁 24-0)

Chapter 1 Introduction

1.3 Organization of this Thesis

This study is organized as follows. The mathematical fundamentals are introduced in Chapter 2. Chapter 3 presents the system hardware architecture and software organization. Chapter 4 shows how to implement three VCs. Chapter 5 describes how the VD performs shot selection and visual instruction. Chapter 6 documents the details of VRMM. Chapter 7 presents the experimental results. The final chapter covers conclusions and future work.

12

Chapter 2

Mathematical Fundamentals

In this chapter, the mathematical fundamentals utilized in this thesis are addressed, including Gaussian mixture model is discussed in Section 2.1; finite automata theory is introduced in Section 2.2, the spatiotemporal attention neural network is detailed in Section 2.3, a multiple kernel learning method is introduced in Section 2.4, and the counter propagation neural network is presented in the last section.

2.1 Gaussian Mixture Model

A Gaussian mixture model (GMM) is a probabilistic model that assumes relevant data points can be formulated by a linear combination of multiple Gaussian probability density functions. The model can smoothly approximate the density distributions of arbitrary shapes. In this study, we use GMMs to describe postures of humans for the purpose of posture recognition. Figure 2.1(a) shows three individual Gaussian probability density functions; Figure 2.1(b) shows their mixture.

Suppose that we have a set of points 𝑋 = {𝑥𝑖}, 𝑖 = 1, … , 𝑛 in a d-dimensional space. We seek 𝐾 Gaussian distributions 𝐺1, 𝐺2, … , 𝐺𝐾 that best represent 𝑥𝑖 with

(a) (b)

Figure 2.1. Example of a GMM (a) Three individual Gaussian probability density functions, (b) The GMM of the three Gaussian probability density

functions.

13

The probability density function of the distribution generated a point 𝑥𝑖: 𝑝(𝑥𝑖|𝐺𝑘) = 1

2𝜋𝑑2𝑘|12

exp [−12(𝑥𝑖 − 𝜇𝑘)𝑇Σ𝑘−1(𝑥𝑖− 𝜇𝑘)], (2.2)

where 𝜇𝑘 is the mean of the density function, and Σ𝑘 denotes the covariance matrix of the density function. These parameters determine the characteristics of this density function, such as the center, shape, width, and direction of the density function. Hence, the sum weighted contributions of all the 𝐺𝑘 is defined as follows:

𝑝(𝑥𝑖) = ∑ 𝛼𝑘 processing steps of the optimization are nontrivial. A simpler alternative algorithm to estimate these parameters is called the expectation-maximization (EM). For a detailed exposition of EM, please refer to [73]. From derivation of [73], we can obtain the

14

Σ𝑘𝑛𝑒𝑤= 𝑛𝑖=1𝑝𝑖𝑘(𝑥𝑖−𝜇 𝑘𝑛𝑒𝑤𝑝)(𝑥𝑖−𝜇𝑘𝑛𝑒𝑤)𝑇

𝑛 𝑖𝑘

𝑖=1 . (2.8) By Equations 2.5–2.8, the EM iterative steps of the GMM procedure are listed as follows:

Step1. select the target number of Gaussians 𝐾.

Step2. initialize 𝐾 Gaussians. Usually, we let 𝛼𝑘 = 𝐾1, calculate the data cluster center by a K-means algorithm, and set 𝜇𝑘 and Σ𝑘 as the initial values.

Step3. expectation: Calculate for each data point 𝑥𝑖 the 𝑝𝑖𝑘from the 𝜇𝑘 and Σ𝑘. Step4. maximization: Update the Gaussian parameters: Equations 2.2-2.4

Step5. iterate from step3. Until convergence.

2.2 Finite Automata Theory

In this section, we address the basics of the finite automata theory which are utilized in this study. This will include how to convert a nondeterministic finite automaton (NFA) into a deterministic finite automaton (DFA), and how to simplify a DFA into a simplified finite automaton (SFA).

A. Finite state machine

The finite state machine (FSM), also called the finite state automata, is an efficient and simple mathematical model often used in logic circuits and computer programs.

The FSM is defined by a finite number of states, input operations, and a transition consumed. For some state and input symbol, the next state may be nothing, or one state,

15

or multiple possible states. The NFA consists of the following four transitions: empty transition, multiple input transition, ambiguous transition, and missing transition. The NFA is easier to construct, because NFAs can be constructed from any regular expression using Thompson's construction algorithm. Figure 2.2 is an example of NFA transition diagram.

C. Deterministic finite automaton

The deterministic FSM can be referred to as a DFA. A DFA is described by a quintuple 𝑀 = (𝐾, 𝛴, δ, 𝑠0, 𝐹), where 𝐾 represents a finite set of states, 𝛴 is the input symbol collection, 𝛿 is the transfer function, s0 is the initial state, and 𝐹 is the set of final states. The rules according to which the automaton 𝑀 picks its next state are encoded into the transition function.

Every NFA has an equivalent DFA. The conversion is using the subset construction method, please refer to [74] for details. After the NFA has been converted to a DFA, the functionality of the new DFA is equivalent to that of the original NFA. The main purpose of the conversion is to eliminate the uncertainty of the NFA from ambiguous transitions. The system is easier to implement and debug if we design the system using a DFA.

D. Conversion from NFA to DFA

Figure 2.2. NFA transition diagram with empty transition, multiple input transition, ambiguous transition, and missing transition.

16

After the NFA is specified, it can be converted into an equivalent DFA. Using the subset construction algorithm, each NFA can be translated to an equivalent DFA. Given a transition diagram, the steps of the subset construction algorithm are as follows:

Step 1. Separate all multiple input transitions.

Step 2. Check whether each state has an empty transition that can transition without an input symbol.

Step 3. Check the input symbol and reachable state from a given state and store them in a transition table.

Step 4. Check whether any new state exists in the table. We try to find states that can be merged into a new state; if a state that can be merged is found, then repeat Step 3 with that state.

Step 5. Repeat Step 3 and Step 4 until all the possible states are merged.

Step 6. Rename the states.

Step 7. Mark initial state and final state.

After all Steps have been executed, we have converted the NFA into the DFA. Figure2.3 shows an example of DFA transition diagram.

E. Conversion from DFA to SFA

Once we obtain the DFA, we must check whether DFA can be simplified. The simplification algorithm reduces the number of states, and improves the efficiency of

Figure 2.3. DFA transition diagram.

17

the system.

Step 1. Construct transition table.

Step 2. Partition states according to final and non-final states.

Step 3. Rename components.

Step 4. Find states that can be merged into a new state.

Step 5. For each component of the previous partition,

partition the component according to the next states.

Step 6. Repeat Step 3 to Step 5 until partition count is the same.

Step 7. Rename states, construct table, and draw diagram.

The practical steps for our system design are described in Chapter 3.

2.3 Spatiotemporal Attention Neural Network

The STA [31] neural network is configured as a two-layer network, with one layer for input and one layer for output. The extracted information serves as the input stimuli to a STA network embedded in the perceptual analyzer. The output layer is also referred to as the attention layer. Neurons in the attention layer are arranged into a two-dimensional (2D) array, in which they are interconnected. No direct links connect input neurons to each other, but each neuron is part of the two-layer network. Assume that a 2D Gaussian G (see Figure 2.4) is centered at an attention neuron. A weight value links an input neuron with the attention neuron at the center of the Gaussian G. If consistent stimuli repeatedly innervate the neural network, a focus of attention is established in the network.

18

Figure 2.5 shows the activation of an attention neuron in response to an input stimulus. If the input to the neuron is greater than that neuron’s activation threshold within a time interval ΔT, the neuron requires ΔTrise time to reach maximum activation and decays over a time of approximately ΔTdecay.

The equation for this activation curve is formulated as follows:

𝑆𝑇𝐴(𝑥, 𝑦, 𝑡) = {min (𝜌, 𝑆𝑇𝐴(𝑥, 𝑦, 𝑡0) + 𝜌 ∙ (1 − 𝑒−𝜎∙(𝑡−𝑡0))) , if 𝐴(𝑥, 𝑦, 𝑡) > 𝑇𝑎

𝑆𝑇𝐴(𝑥, 𝑦, 𝑡 − 1) − 𝜔 , otherwise (2.9) where ρ is the maximum activation, σ controls the rate of rise, and ω controls the rate of decay. In addition, t0 is the start time at which the STA neuron in position (x,y) receives an activation 𝐴(𝑥, 𝑦, 𝑡0) larger than the threshold 𝑇𝑎.

To detect STA-salient objects in a video sequence, at first, a low-color image and a high-color image are extracted from the input video sequence. A high-color (low-color) image at time t preserves the maximum (minimum) color values of the input video sequence up to time t. A distinct spatial difference image is then computed for each

Figure 2.4. STA network.

Figure 2.5. Activation of an attention neuron in response to stimuli.

19

input in the STA neural module. Then, we calculate the temporal difference (derivative) image for each spatial difference image. The resulting temporal difference images then serve as inputs to the STA neural network. The process flowchart is shown in Figure 2.6.

2.4 Multiple Kernel Learning

MKL is a method that has been proven to produce excellent classification results when dealing with heterogeneous data from different sources of information with their own dimensions and ranges, especially in large sample spaces. MKL is used in our VD subsystem to improve the accuracy of shot selection.

If the problem is nonlinear, instead of trying to fit a nonlinear model to discriminate the data, we can map the problem to a new space by doing a nonlinear transformation using suitably chosen mapping function and then use a linear model in the new space (see Figure 2.7). Assume that we have the new dimensions calculated through the mapping functions 𝑧 = ∅(𝑥) mapping from the 𝑥 space to the 𝑧 space.

Figure 2.6. Flowchart of STA image acquisition.

20

Given a sample X = {(x𝑖, y𝑖)}𝑖=1𝑛 . For a binary classification, the classifier can be trained by solving the following quadratic optimization problem:

min𝜉 1

2‖w‖22 + 𝐶 ∑𝑁𝑖=1𝜉𝑖 s. t. 𝑦𝑖(w ∙ ∅(x𝑖) + 𝑏) ≥ 1 − 𝜉𝑖 (2.10) where 𝐶 is a predefined positive trade-off parameter between model simplicity and classification error and 𝜉 is the vector of slack variables. Instead of solving this optimization problem directly, we use the Lagrangian dual function to obtain the following dual formulation:

max𝛼𝑁𝑖=1𝛼𝑖12𝑁𝑖=1𝑁𝑗=1𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗∅(x𝑖) ∙ ∅(x𝑗)

s. t. ∑𝑁𝑖=1𝛼𝑖𝑦𝑖 = 0 , 0 ≤ 𝛼𝑖 ≤ C, ∀𝑖 ∈ {1,2, … , 𝑛} (2.11) where 𝛼 is the vector of dual variables corresponding to each separation constraint.

The idea in kernel machine is to replace the inner product of mapping functions,

∅(𝑥𝑖)∅(𝑥𝑗), by kernel function 𝐾(𝑥𝑖, 𝑥𝑗 ). Kernels are generally considered to be measures of similarity in the sense that 𝐾(𝑥𝑖, 𝑥𝑗 ) takes a larger value as 𝑥𝑖 and 𝑥𝑗 are more “similar” from the point of view of the application. The optimization process applies the following dual formulation:

max𝛼𝑁𝑖=1𝛼𝑖12𝑁𝑖=1𝑁𝑗=1𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝐾(𝑥𝑖, 𝑥𝑗) s. t. ∑𝑁𝑖=1𝛼𝑖𝑦𝑖 = 0 , 0 ≤ 𝛼𝑖 ≤ C.

(2.12) Figure 2.7. Mapping to a new space.

21

Moreover, the optimization process of a general kernel, which is represented in matrix form, is as follows:

𝛼,𝛼maxT𝑦=0𝛼T𝑒 −12𝛼T𝐾𝛼 𝑠. 𝑡. 0 ≤ α𝑖 ≤ 𝐶 (2.13)

where 𝐾 is the kernel matrix. The kernel matrix 𝐾 is the representation of the similarity between all pairs of data points. Because the elements of the kernel matrix are defined by the inner product from pairwise comparison, the kernel matrix is a symmetric positive definite matrix that forms a convex cone. Here we can find a property of the kernel matrix: any symmetric positive definite matrix specifies a kernel matrix, and every kernel matrix is a symmetric positive definite matrix.

It is possible to construct new kernels by combining multiple simpler kernels. Such that, we can fuse heterogeneous information from different sources. Each kernel measures similarity according to its domain. Assuming M different sources exist, and each source has its own base kernel matrix 𝐾𝑚, the kernel matrix K is defined as:

𝐾 = ∑𝑀m=1𝛽𝑚𝐾𝑚 subject to 𝛽𝑚 ≥ 0 (2.14) where m=1 to M. K is the representation of the linear combination of kernel matrices.

This is called multiple kernel learning where we replace a single kernel with a weighed sum. By linear combination of m base kernel matrices with kernel coefficient βm, where

m=1 to M, we can synthesize kernel matrix K.

For training, the dual formulation of MKL with multiple kernels can be defined in matrix form as (rewritten by equation 2.13) :

∑ 𝛽𝑚min𝑚𝐾𝑚( max

𝛼,𝛼T𝑦=0𝛼T𝑒 −12𝛼T(∑𝑀m=1𝛽𝑚𝐾𝑚)𝛼) 𝑠. 𝑡. 0 ≤ α𝑖 ≤ 𝐶 (2.15)

where α𝑖

is the sample coefficient and β

m is the kernel weight. After training, βm will take values depending on how the corresponding kernel is useful in discriminating. For input 𝑥, considering the classification with N training samples {𝑥𝑖, 𝑦𝑖 ∈ ±1}𝑖=1𝑁 and

M base kernels {𝐾

𝑚}𝑚=1𝑀 , the learned model is of the form:

22

𝑓(𝑥) = ∑𝑁𝑖=1𝛼𝑖𝑦𝑖𝐾(𝑥𝑖, 𝑥) + 𝑏

= ∑𝑁𝑖=1𝛼𝑖𝑦𝑖𝑀𝑚=1𝛽𝑚𝐾𝑚(𝑥𝑖, 𝑥)+ 𝑏. (2.16)

The sample coefficient is used for the relation between data and classes where αi is the weight for the ith datum. The kernel coefficient is the representation of classes and features where βm is the weight for base kernel matrix Km.

The basic task of MKL is to find the sample coefficient αi, and corresponding βm. In other words, the task is to optimize both sample coefficient αi and kernel coefficient

β

m so that the error function can be minimized to achieve better data clustering results.

In a conventional method, the kernel coefficient βm could be obtained from machine-learning approaches, like Support vector machine (SVM). However, the optimization problem is too complex to solve directly; thus, an alternative approach is proposed here using an iterative method to obtain the optimized sample coefficient and kernel coefficient. More specifically, we just solve these two coefficients one at a time while the other is fixed. That is, we optimize αi by fixing βm and optimize βm

by fixing α

i. In any odd-numbered iteration, a nearly optimal αi

is obtained by solving a generalized

eigenvalue problem; we obtain βm

by solving the relaxation of semidefinite

programming (SDP). Next, in the even-numbered iteration, a nearly optimal βm

is

obtained by solving a generalized eigenvalue problem; we obtain αi by solving the relaxation of SDP. In each iteration, we get closer to the optimal solution and then use this solution as the input to the next loop until convergence.

2.5 Counter Propagation Neural Network

Our CPN network is used as a decision-making module for shot selection in our VD subsystem. The CPN is a supervised learning technique, and a real director provided the initial training input data. The most crucial feature of the CPN is its fast response time. The output of CPN is the single shot that has the highest score.

23

A. Counter propagation neural network introduction

Figure 2.8 shows the architecture of a fully connected CPN network. The network is constructed of five layers: two input layers, two output layers, and one hidden layer.

The training data is X. There are 𝑛 neurons in the X input layer, and the input data to neurons are denoted by 𝑥𝑖, where 𝑖 = 1, … , 𝑛. Another input layer with 𝑚 neurons takes the data Y, which is the labeled vector for X, denoted by 𝑦𝑘, where 𝑘 = 1, … , 𝑚 .

In the architecture, the hidden layer contains p neurons; each neuron is denoted 𝑅𝑗, where 𝑗 = 1, … , 𝑝. This hidden layer is also called the cluster layer. Each neuron in the hidden layer represents a class. The weight vector 𝑤𝑗𝑖 connected to input neurons 𝑥𝑖 and hidden neurons 𝑅𝑗 is used to classify 𝑖 inputs to class 𝑗. In the same way, the weight vector 𝑢𝑗𝑘 connected to input neurons 𝑦𝑘 and hidden neurons 𝑅𝑗 is used to classify k inputs to class j.

After an input vector is classified, we can obtain the output result calculated from weight vectors 𝑣𝑘𝑗 and 𝑡𝑖𝑗, which directly connect the hidden neurons and the output layer. If we input vector 𝑥𝑖 to input layer X, the approximate output is 𝑦1, … , 𝑦𝑚; if we input vector 𝑦𝑖 to input layer Y, the approximate output is 𝑥1, … , 𝑥𝑛.

B. Forward-mapping CPN

Figure 2.8. Architecture of a fully connected CPN.

24

One of the features of a fully connected CPN is as follows: if the input to the network is an expected output result, one obtains an input vector corresponding to the result when there is a one-to-one mapping between input vector and output vector.

However, there could be different situations in our VD subsystem. For example, assume the director selects the hall view; the cause might be that the speaker is interacting with the audience or that the director wants to use an establishing shot to avoid an emergency.

Because both situations could motivate the director to choose the same shot, the mapping function should not be one-to-one. Thus, we simplified the CPN into a forward-only network, and the updated architecture, called a forward-mapping CPN, is shown in Figure 2.9.

Figure 2.9. Architecture of forward-mapping CPN applied in VD subsystem.

25

The forward-mapping CPN can be divided into two layers (Figure 2.10). The first layer, called the Kohonen layer, uses a winner-take-all learning algorithm to train the weight vector 𝑤𝑗𝑖. The Kohonen layer executes an unsupervised learning algorithm.

This layer is often used in classification. Each neuron in the hidden layer can represent a rule. Thus, the entire Kohonen layer can be seen as a rule library. The algorithm is as follows:

Step 1. Assume the score vector for training data is 𝑥1, … , 𝑥𝑛, where 𝑛 is the number of training data points.

Step 2. Calculate the likelihood between each class and corresponding weight by:

𝑑𝑗 = ∑𝑛𝑖=1|𝑥𝑖(𝑡) − 𝑤𝑗𝑖(𝑡)| (2.17) where 𝑡 is the 𝑡th training data point.

Step 3. Choose the minimum 𝑑𝑗, as a winner. Only the weight of the winner is updated.

𝑑𝑤𝑖𝑛𝑛𝑒𝑟 = min

𝑗=1,…,𝑛𝑑𝑗. (2.18)

Step 4. To avoid an overly large difference between weight and input, a threshold ∆ is used here. If 𝑑𝑤𝑖𝑛𝑛𝑒𝑟 < ∆, the weight and input are similar; go to Step 5. If 𝑑𝑤𝑖𝑛𝑛𝑒𝑟 >

∆, the weight and input are not similar and a new rule is required; go to Step 6.

Step 5. Update the weight connected to the winner; the update function is:

𝑤𝑤𝑖𝑛𝑛𝑒𝑟,𝑖(𝑡 + 1) = 𝑤𝑤𝑖𝑛𝑛𝑒𝑟,𝑖(𝑡) + 𝛼[𝑥𝑤𝑖𝑛𝑛𝑒𝑟,𝑖(𝑡) − 𝑤𝑤𝑖𝑛𝑛𝑒𝑟,𝑖(𝑡)] (2.19) Figure 2.10. Kohonen layer (left) and Grossberg layer (right).

26

where 𝛼 is the learning rate, and its initial value is 12, but 𝛼 decreases through iteration to expedite convergence.

Step 6. If no class is found for an input, try to add a new neuron node to the hidden layer (see Figure 2.11). The initial weight for a new node is:

𝑤𝑝+1(𝑡 + 1) = 𝑥(𝑡) (2.20) 𝑢𝑝+1,𝑘(𝑡 + 1) = 𝑦(𝑡) (2.21)

This uses this input value as the new weight 𝑤𝑝+1,𝑖 and the expected (labeled) output value as the new weight 𝑣𝑝+1,𝑘.

The number of neurons in the hidden layer denotes the number of types that can be classified. The concept is the same as the shot selection of a real director. When real directors select a view, they always must determine the situation that it belongs to (e.g., an audience member is asking some questions, a speaker has a meaningful posture);

directors classify the situation to which each selected shot belongs.

The second layer is the Grossberg layer. This layer uses the Grossberg supervised learning algorithm to train the weight vector 𝑢𝑗𝑘. The training algorithm of the Grossberg layer resembles the training processes of the Kohonen layer. The Grossberg algorithm is as follows:

Figure 2.11. Adding a new node to the CPN.

27

Step 1. Assume the score vector for training data is 𝑥1, … , 𝑥𝑛, where 𝑛 is the number of training data points, 𝑦1, … , 𝑦𝑚 is the expected results, and 𝑚 is the number of VCs (in this research 𝑚 = 3).

Step 2. Calculate the likelihood between each class and the corresponding weight by:

𝑑𝑗 = ∑𝑛𝑖=1|𝑥𝑖(𝑡) − 𝑤𝑗𝑖(𝑡)| (2.22) where 𝑡 is the 𝑡th training data point.

Step 3. Choose the minimum 𝑑𝑗 as the winner. Only the weight of the winner is updated.

𝑑𝑤𝑖𝑛𝑛𝑒𝑟 = min

𝑗=1,…,𝑛𝑑𝑗 (2.23) Step 4. Update the weights that connect to the winner; the update function is:

𝑤𝑤𝑖𝑛𝑛𝑒𝑟,𝑖(𝑡 + 1) = 𝑤𝑤𝑖𝑛𝑛𝑒𝑟,𝑖(𝑡) + 𝛼[𝑥𝑤𝑖𝑛𝑛𝑒𝑟,𝑖(𝑡) − 𝑤𝑤𝑖𝑛𝑛𝑒𝑟,𝑖(𝑡)] (2.24) 𝑢𝑤𝑖𝑛𝑛𝑒𝑟,𝑘(𝑡 + 1) = 𝑢𝑤𝑖𝑛𝑛𝑒𝑟,𝑘(𝑡) + 𝛽[𝑦𝑤𝑖𝑛𝑛𝑒𝑟,𝑘 (𝑡) − 𝑢𝑤𝑖𝑛𝑛𝑒𝑟,𝑘(𝑡)] (2.25) 𝛼 is the learning rate of the Kohonen layer, and the value is a constant at this stage. 𝛼 is the last value in the Kohonen layer after convergence. 𝛽 is the learning rate of the Grossberg layer, and the initial value is also a constant.

28

Chapter 3

Smart Lecture Recording System

The overall organization of the smart lecture recording (SLR) system is shown in Figure 3.1. There are three principal components constituting the SLR system: virtual cameraman (VC), virtual director (VD), and manual control (MC). The VC component is further divided into three sub-components: speaker cameraman (SC), audience cameraman (AC), and hall cameraman (HC). This division is inspired by a professional lecture recording team that in general possesses at least three cameramen for performing the separate duties of shooting the speaker, the listeners, and the panoramic scene. In the ensuing sections, we discuss the kinds of lecture halls considered in this thesis in Section 3.1; the architecture of the SLR system is described in Section 3.2; the workflow of the system is finally addressed in Section 3.3.

3.1 Lecture halls under consideration

The lecture halls under our consideration range from ordinary classrooms, lecture theaters, to grand oration halls. In addition to extent, various kinds of lecture halls are characterized by distinct arrangements of auditoriums. Figure 3.2 shows several halls with different disposals of seats, such as (a) tiered, (b) level, and (c) ambient

Figure 3.1. The organization of the SLR system.

29

auditoriums. Tiered auditoriums are often for presentations held by academic institutes and organizations. Level auditoriums are often for exhibitions of new products held by business firms or deliveries of new songs by music companies. As for ambient auditoriums, they are primarily for formal reports or speeches in congresses and parliaments. Our SLR system will be able to work in the sites mentioned above.

auditoriums. Tiered auditoriums are often for presentations held by academic institutes and organizations. Level auditoriums are often for exhibitions of new products held by business firms or deliveries of new songs by music companies. As for ambient auditoriums, they are primarily for formal reports or speeches in congresses and parliaments. Our SLR system will be able to work in the sites mentioned above.

在文檔中 智慧型演講錄製系統 (頁 24-0)