Neural Networks - 遞歸及自注意力類神經網路之強健性分析

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

that a machine can be utilized to handle this task. Traditionally, statistical machine translation was the common approach. In recent years, the application of neural networks has boosted the performance to a new peak.

Summarization In the past, more attention has been paid to extractive summarization, while abstractive summarization are rather rare. In view of the recent success of deep learning, the research on abstractive summarization has been growing. Recent literature has preliminarily verified the effectiveness of RNN on rewritten summarization of documents. Moreover, the contribution of the attention mechanism is also noticed by many. The characteristic of this mechanism is that it can increase the weight of key segments while generating text, thereby composing a better summary.

2.2 Neural Networks

The NLP community is widely adopting Artificial Neural Networks (ANN) for various topics in this field. Thus, we begin this section by supplying an overview of neural networks and its elementary ingredients.

An ANN can be regarded as a series of functions that are strung together, in which the majority are non-linear. It is important to note that, an ANN can learn to replicate linear or logistic regression, as well as other fundamental statistical machine learning models [40]. To illustrate, we consider the logistic regression problem an example here. A multi-class logistic regression can be represented as:

f (x) = W x + b g(y) = softmax(y)

(2.2.1)

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

where W ∈ R^C^×d denotes the weights in an ANN in matrix form, x ∈ R^d the input vector, b ∈ R^C the bias, and y ∈ R^C. C is the number of classes and d is the dimensionality of the input. We subsequently use θ as a shorthand to denote {W , b}, the set of parameters of this ANN. As such, g(f (x)) represents the logistic regression in the form of a composition of f and g. When we use ANN as an approach to this, the f is modeled by a fully connected NN, and g an activation function, in this case, softmax. Note that the activation function of ANNs is non-linear. Technically, an ANN contains more than one “layer” of the above computation, therefore called “deep neural networks,” or DNN. These layers are connected or stacked together, with the output of one of them being the input of the next. The “hidden” layers, or the ones that are between the input and output, typically use activation functions other than the softmax, e.g., ReLU, tanh. For the output layer, the softmax and sigmoid functions are commonly used, due to the assumption that the output layer of an ANN can be considered as categorical or Bernoulli distribution. Thus, the linear as well as logistic regression can be approximated by an ANN with just one layer, in which the former uses the identity function and the latter non-linear activation function. Note that fully connected feed-forward ANN can sometimes be referred to as a multilayer perceptron (MLP) [64].

h = σ(W₁x + b₁)

y = softmax(W₂h + b₂),

(2.2.2)

where the σ denotes the activation function. It is worthy of mentioning that the layers can typically have independent weight matrices. Here, the first hidden layer contains a weight matrix W₁ and bias b1. The process of obtaining h, or the output of the layer, and feeding to the following layer as its input is called the “forward propagation”. As a result, the output of the

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

final layer, i.e., y, can be thought of as the output of the whole ANN. Recall that, mathematically the composition of more than one linear functions is equivalent to another linear function. It is perceivable that the model only has limited expressive power if we build it in this manner. Thus, the use of non-linear activation functions is at the heart of the success of deep neural networks.

However, some more advanced ANNs can be designed to share those weights, sometimes referred to as “weight tying.” It is often used as a way of reducing the number of parameters in a model, as well as having the benefit of creating an inductive bias. Such bias may increase the ability of the model to generalize, and has been examined in previous work [62]. For example, a general pre-trained model can be transferred to various down-stream tasks due to the generality.

This technique is widely used in current deep learning models.

2.2.1 Activation Functions

Activation functions are mathematical equations that determine the output of a cell in a neural network. Much like a real neural network in organisms, this function is imposed upon each neuron in the network so that the output value represents the activated state. In addition, this function can sometimes act as a regularization of the output.

For current ANNs, we require an extra characteristic of the activation function, which is that it must be differentiable. One of the most common functions is the sigmoid, denoted by the following:

σ(x) = 1

1 + e^−x (2.2.3)

Another function, the softmax, is one that takes as input a vector of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

exponentials of the input. Specifically,

softmax(x) = e^xⁱ

∑K

j=1e^x^j for i = 1, . . . , K and x = (x₁, . . . , x_K)∈ R^K (2.2.4)

The sigmoid and softmax are generally adopted at the last, i.e., output layer, of the ANN.

As for the hidden layers in between, we mostly employ the rectified linear unit (ReLU) function.

It is written as the following:

ReLU(x) = max(0, x) (2.2.5)

Yet another activation function, the hyperbolic tangent or tanh, is clearly defined and requires no further explanation. It outputs values in the range (−1, 1).

Recently, other functions such as the Exponential Linear Units (ELU) and the Gaussian Error Linear Unit (GELU) [19, 30] has been proposed. ELU is aimed at speeding up the learning process of deep neural networks and at the same time retaining a high classification accuracy.

Part of the ELU is similar to ReLU, where the identity function is used to handle the positive section of the input values, in order to tackle the vanishing gradient problem. On the other hand, ELU has the unique property of allowing negative values. This trait serves as a normalization factor very similar to the batch normalization process, which shifts the average activation of the units towards zero. But, unlike batch normalization, it requires no extra computation overhead.

2.2.2 Recurrent Neural Networks

Among different types of ANNs, the recurrent neural network (RNN) is one that is suitable for learning sequential input [21]. Recurrent neural networks have recently become a popular solution for single sequence as well as sequence-to-sequence tasks. In particular, prominent

‧

network structures such as the ‘seq2seq’ proposed by [4, 15, 65] are increasingly being applied to a wide variety of problems. Moreover, some tasks that were considered as difficult are seeing explosive advances, including machine translation (MT) and language modeling (LM), when deep neural networks are incorporated. [38, 46] Typically, an encoder-decoder scheme is adopted to deal with these tasks, where the input sequence is encoded by an encoder, and the subsequent decoder generates a (sequential) output.

Alternatively, we can view RNNs as a feed-forward neural network in which all layers share the same set of parameters. But, note that, rather than a fixed number of layers, the ‘depth’ (or number of layers) of this type of NN is dependent on the length of the input sequence. We can see that each element in the input sequence can be treated as the input of each layer. To be more formal, we can define an RNN as maintaining a vector W_hx_t, which is a hidden state or memory to store at each time step t. Then, upon receiving input at time step t, the network updates its state as the following:

h_t= σ_h(W_hx_t+ U_hh_t1+ b_h) y_t= σ_y(W_hx_t+ b_y)

(2.2.6)

where σhand σyare activation functions. We can see from the above formulation that the weights Uhis to transform the previous hidden state ht1, and Whthe current input xt. A bias term bhcan also be added. These calculations update the state vector h_t. Subsequently, the RNN produces an output yt. However, an RNN can be less effective when modeling a sequential input if the number of time steps exceeds a certain amount, as we can induce from the above formulation.

Nevertheless, an RNN depends on the previous calculation results to produce the next one.

So, an increasing amount of efforts have been devoted to find a mechanism that can replace recurrence. One possible direction of research is to use attention [4]. The motivation behind this approach is that it can combine the efficiency of attention computation with the ability to

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

learn positional information. It has been shown [70] to achieve outstanding performances on a multitude of language pairs in MT. We will introduce attention-based models later in this chapter.

2.2.3 Long Short-term Memory

A neural network cell named Long Short-term Memory, or LSTM, is proposed in an attempt to solve the deficiency of simple RNNs in learning a sequence of a longer length [27, 31]. It is evidences by the experimental results that LSTM can remember a longer span of the sequential input, as compared to traditional RNNs. This trait is important especially for NLP applications.

In essence, an LSTM is an augmented RNN with extra weights to determine the amount of information to “remember” as well as “forget.” They are done through the implementation of a forget gate ft, an input gate it, and an output gate ot. The values of all these gates are dependent on the input x_tand previous memory h_t. In this way, the cell learns to determine the portion of its state to keep or discard through the forget gate. Formally, we recall the current input x_t, the previous output ht−1, and the current cell state ct as defined in the previous paragraph. Using the following formulas, we enable the model to learn what to forget in the past and remember in the moment.

i_t= σ(W_ix_t+ U_ih_t₋₁+ b_i) f_t= σ(W_fx_t+ U_fh_t₋₁+ b_f) o_t= σ(W_ox_t+ U_oh_t−1+ b_o)

˜ct= tanh(Wcxt+ Ucht−1+ bc) c_t= f_t◦ ct−1+ i_t◦ ˜ct

h_t= o_t◦ tanh(ct)

(2.2.7)

where σ designates the sigmoid function and “◦” means the element-wise product between

‧

Graves et al. [26] proposed an intuitive extension to the LSTM, called the Bidirectional LSTM, or BiLSTM. In essence, it involves creating two separate LSTMs, of which one receives the original input sequence and the other a reversed sequence. These two LSTMs learn to model the sequential input separately and independently. This model is later widely used in virtually all NLP models [79].

2.2.4 Training

The training (learning) phase of a neural network currently relies on stochastic gradient descent. However, a typical network consists of more than one layer, preventing us from calcu-lating the gradient of the loss function. Thus, a dynamic process named “back propagation,” or

“backprop (BP)”. is commonly adopted [63].

BP adopts the chain rule of calculus to determine the gradient of vectors. Let x ∈ R^m be the input to a neural network, y ∈ Rⁿbe the output of the penultimate layer, we can formulate the network as:

y = g(x)

z = f (y) = f (g(x))

(2.2.8)

where z is a scalar output of the network. The gradient of z with respect to every element x_i in xcan be written as:

∂z

∂x_i = Σ_j ∂z

∂y_j × ∂y_j

∂x_i (2.2.9)

Since we are using vectors as input, the gradients ∇xz can be computed by the following multiplication:

‧

where the Jacobian matrix of g is denoted by ∂y

∂x, and ∂y

∂x ∈ Rⁿ^×m. This matrix contains all partial derivatives. As described in [25], for all operations in the forward pass of the NN, the BP derives the Jacobian-gradient product. Let function

J =L( ˆy, y) (2.2.11)

be the loss function of a certain task performed by a neural network that we need to minimize.

This NN has K layers with weights Wkand biases bk, where k ∈ {1, · · · , K}. First we perform forward calculation using the input x starting from the first layer and ending at the last one (output), yielding the output vector ˆy. Then, we obtain the loss by J. BP is therefore acting in a reversed order. It starts by calculating the gradients ∇yˆJ with respect to the output ˆy.

Subsequently for all previous layers, it obtains the partial derivatives of parameters W_k and b_k for each layer until the first one is reached. It is perceivable that, in such a process, the derivatives of the deeper layers (close to the output) must be obtained first before the shallower layers can be considered. This is due to the fact that the values of deep layers depend on those from shallow layers. Finally, the SGD algorithm applies the gradients onto the parameters and completes the optimizing step. Typically, this procedure is repeated for a certain amount of “epochs,” or traversals of the entire training dataset.

Note that, when training RNNs, the gradients should be passed along the time step axis and not through the depth-wise procedure of other types of NNs. To do that, we must employ a technique known as back-propagation through time (BPTT) [73]. But the practical limitation of hardware prevent us from doing BP indefinitely. So, we normally set a certain window on the time axis for BP to operate in. Interestingly, one might think a wider window can help the RNN to see and model a wider context. Whereas in practice, we often find that the network is unable

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

to learn anything. After careful inspection, it is found that the problem of exploding or vanishing gradients exists in these situations. Consider for a moment the forward operation of an RNN, in which the state vector is continuously being updated by multiplication of the weights. Thus, when BP is in action, the gradients would also undergo the same process multiple times. It is imaginable that these gradients may become exceedingly large (explode) or small (vanish). As a result, the network is unable to be optimized. Therefore, the LSTM (mentioned in Section 2.2.3) is proposed to alleviate the exploding or vanishing gradient problem.

在文檔中遞歸及自注意力類神經網路之強健性分析 - 政大學術集成 (頁 25-33)

Neural Networks

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

2.2 Neural Networks

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

2.2.1 Activation Functions

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

2.2.2 Recurrent Neural Networks

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

2.2.3 Long Short-term Memory

‧

2.2.4 Training

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學