• 沒有找到結果。

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

to learn anything. After careful inspection, it is found that the problem of exploding or vanishing gradients exists in these situations. Consider for a moment the forward operation of an RNN, in which the state vector is continuously being updated by multiplication of the weights. Thus, when BP is in action, the gradients would also undergo the same process multiple times. It is imaginable that these gradients may become exceedingly large (explode) or small (vanish). As a result, the network is unable to be optimized. Therefore, the LSTM (mentioned in Section 2.2.3) is proposed to alleviate the exploding or vanishing gradient problem.

2.3 Attention Mechanisms

The “attention” in neural networks can be thought of as a type of weighting. There are multiple functions to obtain attention scores, but the additive method [4] and multiplicative method [46] are among the most widely used ones. The multiplicative method is sometimes factored by 1d

k, which is used in BERT-related models. In particular, the form that is commonly used is called “Scaled Dot-Product Attention”. Consider that, in an attention block, the input is a series of vectors named “queries” and “keys” with dimension dk, and “values” of dimension dv. Then, the dot product between a query and all keys are calculated and normalized by dividing with

dk. Finally, the softmax function is applied and the weights on the values can be output.

Note that, we can parallelize the attention calculation by computing them on a batch of inputs at the same time. More specifically, the query, key, and value vectors are collated into matrices Q, K, V , respectively. We then perform matrix operations to obtain outputs as:

Attention(Q, K, V ) = softmax(QKT

√dk )V (2.3.1)

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Conceptually, attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

On the other hand, additive attention employs a fully connected NN to determine the compatibility or similarity between two vectors. The score of additive attention is calculated as follows:

Attentionadd(st, hi) = vatanh(Wa[st; hi]) (2.3.2)

where s, h are state vectors and v, W are parameters that the model needs to learn. We can observe the above definitions and find that multiplication in dot-product attention can take advantage of modern GPU hardware to perform fast calculations. This is one of the main reasons for recent improvements in using attention for language modeling.

2.3.1 Self-Attention

Recently, Vaswani et al. [70] proposed the “Transformer” model, which is a novel method that solely depends on self-attention to learn vector representations of a sequence. The heart of Transformer is a unit of multi-head self-attention mechanism. It transforms the input vectors into a representation formed by multiple mixtures learned by the model. For each head, The input is first linearly projected by a set of three weight matrices as in the previous section to three vectors, (Q, K, V ). Then, an attention weight is calculated using the dot-product attention in Eq. 2.3.1. The reason it is called the “self”-attention is that the attended elements are the input sequence itself. Moreover, note that the number of head effectively indicates the number of weight matrix sets. Figure 2.1 shows a schematic of the Transformer model. The trait of

Figure 2.1: The multi-head attention module in a Transformer block.

this type of models is that it frees us from the recurrent part of previous neural networks or even convolution calculations. The Transformer model utilize entirely the attention weights to denote global correspondence between input and output symbols (words). Thus, the degree of parallelization can be much higher than RNNs. The performance is indicated by training a machine translation system that achieves state-of-the-art outcomes in just 12 hours.

However, this type of model does not come without weaknesses. First, it cannot directly learn the order of the input due to the fact that there is no recurrent state or convolution. So, another new type of embedding, Position Embedding, is incorporated to represent the relative order of the elements in the input sequence. They are defined as follows.

xi = (embwordi⊕ embtagi) + embposi (2.3.3)

where embposiis called the Position Embedding of the i-th position. We use the sine and cosine

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

functions of different frequencies to compose position embeddings as stated in [70]. It is added to the word embeddings which represent linguistic information of the words, with dimensions identical to other embeddings so that they are compatible.

Specifically, these functions are used to encode the position information [70]:

embpos2i= sin(pos/100002i/dmodel) embpos2i+1 = cos(pos/100002i/dmodel)

(2.3.4)

where pos is the position and i is the dimension. We can interpret this formulation as using a sinusoid to encode one dimension with position embedding. The wavelengths denoted by these functions constitute a geometric sequence (a sequence of numbers where each one is determined by multiplying the previous one by a fixed, non-zero value) from 2π to (10000 · 2π). It is hypothesized that this formulation can incorporate the relative position of the input elements into the embeddings. The reason is explained as follows. Suppose we have a word embedding embpos, the embedding of a word that is k steps away can be denoted by a linear combination of embpos+k.

It is worthy of noting that there are various means of learning positional information [23].

The learned positional embeddings [23] are examined and the results showed that they have virtually no difference [70]. The sine and cosine functions are eventually selected due to their potential in modeling the indefinitely long sequence that may not be present in the training data.

The characteristics of these functions can help the network extrapolate to unseen lengths.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

相關文件