Dealing with Long-Term Sequence - Era of Deep Learning

2. Related Work

2.2. Era of Deep Learning

2.2.1. Dealing with Long-Term Sequence

a). RNN-based Approach

A natural choice of model, when in concern with a time-related sequence, is recurrent neural networks (RNNs). Vanilla RNNs models could be hard to train due to the challenge the same as big deep learning systems: as the sequence grows longer, the gradient could vanish fast. To cater for this, many improvements of RNNs have been proposed such as long short-term memory (LSTM) [47], bidirectional LSTM [43], and gated recurrent neural networks [48].

These models have succeeded in keeping the gradients and been firmly established as state-of-the-art approaches in many different sequence modeling tasks, such as language modeling and machine translation [49][50]. Ever since then, numerous efforts have continued to push the boundaries of RNNs and encoder-decoder architectures to a further place [51][52]. Despite good at modeling time sequences, the inherent computation of each time steps generates a

- 12 -

series of hidden states. Each of the state depends on the previous output of states, and make prediction on the current time step. The dependent nature makes it difficult to compute the output sequence in parallel within training examples, which limits the ability to model longer sequence, as it would take too much time. To alleviate this problem, practice like factorization [53] have shown a significant improvement in computational efficiency, and also the work in [54] leverages a conditional computation technique to further improve the performance.

b). Attention-based Approach

Recent efforts on the issue have shown a great solution on the computational limitation by using the so-called attention mechanism [55][56]. The basic idea of attention is to allow each position of the state sequence attend to every other states, which means to have a full sight to the state history. This powerful property relaxes the hard constrains of RNN model that could only depend on a single state previously generated. In [57], a novel multi-head attention mechanism is proposed, combined with the idea of parallelism and advantages of attention.

The design of the model architecture is composed purely with multi-head attention stacks, without any other types of computation such as convolution or recurrent units. This is one milestone for introducing a new fundamental computation unit, on top of which most of the models can be broken down into several types of arithmetic units, such as linear transformation and convolution, exhibiting an impressive ability when compared to the conventional approaches. Assuming that the input sequence has the length of 𝑙, the formula of multi-head attention can be written as follows:

𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝑄𝐾^𝑇

√𝑛 ) 𝑉

where 𝑄, 𝐾, and 𝑉 are linear-transformed from the original input feature vector representations, and the output has a dimensionality of 𝑙 × 𝑑 , where 𝑑 is the feature channel. The term 𝑛 represents the number of head. The formula of the normalization function 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 is defined as follows:

(1)

- 13 - 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧_𝑗) = 𝑒^𝑧^𝑗

∑^𝐾_𝑘=1𝑒^𝑧^𝑘, for 𝑗 = 1, … , 𝐾

where 𝑧 represents a vector with dimension 𝑑. This normalizes the output vector to the value ranging from 0 to 1, and enlarges the gap between the greatest value and others. The key for parallelism is by splitting the 𝑄𝐾^𝑇 term into several smaller pieces with a factor of √𝑛 along the feature channel 𝑑. These smaller parts can then be distributed to multiple computation cores and calculate the attention matrix. The attention matrix plays the role for weighting the importance between any pair of the input feature vector representations, telling each vector at specific time step which are the most related vectors to itself. A final output is then generated by multiply the attention matrix with the feature maps 𝑉 . Various approaches have been proposed to leverage different attention mechanisms and achieved state-of-the-art performance in language translation [58], music generation [12], and speech recognition [59].

Despite the advantages of dealing with long-term sequences, one major limitation of the attention mechanism is that the memory consumption of the attention computation, which is proportional to the square of the sequence length. This can be observed from the term 𝑄𝐾^𝑇. The dot product is computed element-wisely along the feature dimension 𝑑, thereby the output of the term would have dimension of 𝑙²× 𝑑. As for one-dimensional tasks such as language modeling, it may be still effective enough to achieve high performance levels when not much hardware is available. But for tasks like image generation or others having two-dimensional input, the quadratic term becomes crucial for determine the feasible size of the input to fit into the memory. One practical way to the issue proposed in [21] is to divide an image into non-overlapping query blocks. After splitting an image into smaller blocks, self-attention is then applied. With the partitioning, the memory consumption can thus be reduced to an appropriate size tailored for the computation. This approach is leveraged in this research, and will be discussed in later sections.

(2)

- 14 - c). ASPP-based Approach

Additionally, a generalized convolutional computation approach was proposed in [60], which is named atrous spatial pyramid pooling (ASPP) mechanism. This approach employs dilated convolution to enlarge the reception field, capturing objects in various scales by varying the

size of dilation rate. The formula can be written as follows:

𝑦[𝑖, 𝑗] = ∑ 𝑥[𝑖 + 𝑟𝑚, 𝑗 + 𝑟𝑙]𝑤[𝑚, 𝑙]

𝑚,𝑙

where 𝑥 and y denotes the input and output 2-D feature maps, respectively, 𝑤 is the convolution filter to be learned, 𝑟 refers to the dilation rate, and [𝑖, 𝑗] indicates the location on the feature maps. The standard convolution is a special case when 𝑟 = 1. ASPP then performs dilated convolution with multiple dilation sizes and pool the resulting feature maps together.

The time complexity of normal convolution is proportional to the square of the sequence length, the same as attention. ASPP works like the concept same as the memory-efficient attention mechanism mentioned before. They both split the certain components into a smaller size, and stacks them back after the computation. For ASPP, the splitting does not really happen, but in a similar way that with larger dilation rate, a certain amount of computation is ignored. For instance, considering a kernel with size 3 × 3, a total of nine positions in the kernel need to be calculated when the dilation rate of 1 is applied. And for dilation rate of 3, still a total of nine positions need to be calculated, but an effective reception size is now 7 × 7, in which a normal convolution will have a total of 49 positions need to be propagated. With a better ability to capture wider contexts, ASPP has been applied in melody extraction and AMT in [18] and [15].

在文檔中多重樂器自動採譜之探討 (頁 19-22)