Machine Translation - 神經機器翻譯於時尚網站在地化之應用

In Hutchins’ study (2005), he provides the brief history of how machine translation systems have evolved over the years. The development of MT can be traced back to around the 1950s, when the first MT conference was held and MT was first demonstrated. Since then, the U.S. government and Georgetown University have installed some of the first MT systems and conducted related research projects. Although the results were not as good as expected, this stage laid the foundation for the progress of MT in the 1980s. Since then, the usage of MT systems has widened, especially with regards to early installations of translation software on personal computers and the introduction of translator workstations to the market. As for the online machine translation services with which the general public is more familiar, this major breakthrough in MT did not made until the 21^st century. The subsequent development of the emerging technology of NMT has become a hot research topic as of late. To understand why NMT is a topic worthy of study, it is necessary to understand its differences with respect to past MT systems. According to the definition proposed by Ping (2009), based on the architecture of MT systems, there are generally two types of MT which constitute the main areas of research in MT: rule-based MT and corpus-based MT.

2.4.1 Rule-based machine translation

Rule-based MT is a system built on a variety of linguistics rules determined by developers, and can be classified as “direct” and “indirect” in terms of how they develop (Ping, 2009). Introduced before the 1980s, the direct approach is based on a bilingual dictionary and morphological analysis. In other words, this type of MT translates source text by searching the dictionary at the word level, and then reordering the translated words according to the grammar of the target language. As a result, rule-based MT built on direct approach does not accurately analyze the syntax of source texts nor does it identify the relationship between words.

In the 1980s, developers started to adopt the “indirect” approach when designing MT.

This type of MT translates texts in a three-stage method through analysis, transferring and synthesizing. Firstly, the MT analyzes the syntactic structure of source sentences and converts them into “intermediary, abstract representations of the meaning of the original”

(Ping, 2009). Then, these representations are transferred into representations which indicate the syntactic structure of the target language. Lastly, the system synthesizes the transferred representations and produces translated sentences. In contrast to the direct approach, the indirect approach can perform additional analysis on the source text and identify both sentence structure and meaning, instead of translating word for word.

However, there are still some challenges in the development of rule-based MT. Hutchins (2006) indicates that for this type of MT to work successfully, developers have to work on complicated grammar rules, and it is difficult to design a model which can apply all the grammar rules perfectly. Issues also arise from the dictionary used up by the MT system,

since every dictionary has its own limitations and cannot cover all meanings. Thus, rule-based MT has a comparatively high entry level in terms of development and application.

2.4.2 Corpus-based machine translation

In the 1990s, researchers discovered it is possible to build MT systems by applying bilingual corpora, especially collections of original and translated versions of texts. There are two types of corpus-based MT systems: example-based MT and statistical MT. The concept behind example-based MT is similar to that of translation memory in CAT tools. Based on a bilingual parallel corpus, the MT first matches the source texts with the most similar examples in the corpus, and then aligns the source text and the examples to find the corresponding parts. Lastly the corresponding parts are reordered and assembled to produce target texts. As Ping points out (2009), the main difference between TM and example-based MT is that the former requires human translators to do the task of reordering and assembling, while the latter can complete all the work automatically.

Thanks to Google Translate and Microsoft Bing Translate, statistical machine translation (SMT) has the method which the general public are most familiar, and there have been numerous studies related to the topic. In Hutchins’ presentation (2006), he describes SMT as a system based on bilingual corpora which requires “little or no linguistic

‘knowledge.’” Essentially, SMT is built on “word co-occurrences in SL and TL texts (of a corpus), relative positions of words within sentences, and length of sentences” (Hutchins, 2006, p.21). Sentences from the bilingual parallel corpus are aligned by statistical rules such as sentence length and relative positions of words. The translation process then involves two models: translation model and language model. The former model chooses the most probable

translation for a source fragment, which can be a word or a phrase, based on the frequencies of word co-occurrences in the aligned bilingual corpora. The language model organizes target fragments in the most probable order to produce translations based off the frequency of bigrams and trigrams in the target language.

Hutchins (2006) describes SMT as a “direct approach” that replaces a fragment in a source language with a fragment in a target language in the most probable sequence. In other words, it can be said that the size of a training corpora plays an essential role in SMT, and this is where the most obvious advantage of SMT lies. As Stein (2018, p.14) indicates, in contrast to rule-based MT, “SMT systems produce better translations in terms of word choice, disambiguation, etc.” Any types of word combinations can be translated, as long as they are included in the corpora in a certain number “to be identified statistically.” Thus, Stein (2018, p.14) concludes that the idea behind SMT is that “bigger corpora means better results.”

However, although studies on SMT have extended from word-based, phrase-based to syntax-based, there are still some issues mostly arising from training corpora which remain unresolved. As mentioned before, the size of the training corpus for SMT matters, and other scholars such as Ping (2009) also point out the success of SMT is basically determined by the training corpus. For instance, if most of the data used for compiling a training corpus comes from the internet, which is the most efficient way to collect a huge amount of bilingual texts, it is difficult to control the quality, resulting in unstable results when using the SMT program. Apart from this, Stein (2018) also argues that when SMT deals with certain language pairs, many problems stem from differences in grammar like “inflection, word order, use of pronouns, number and kind of temporal forms, etc.”

In Gao and Chiou’s research (2017), the scholars acknowledge that SMT can serve as a supplement to CAT tools when there is insufficient relevant past translation. Although Google Translate has shown in studies its usefulness for translating terminology and proper nouns, Gao and Chiou discovered the SMT system cannot “identify the beginning and ending of a multiword unit” and often provides translation in simplified Chinses based on probability.

As a result, human translators usually need to pay extra attention to pre-editing such as simplifying input, in order to improve the productivity of SMT and achieve the goal of

“human-aided machine translation.”

Stein (2018, p.15) summarizes the recent development of MT with the observation that,

“the use of linguistic information and statistical data, has become one of the most researched fields in MT over the last decade.” Combining the strengths of different MT systems not only increases the translation quality of MT but also opens up the possibility for research on MT for rare language pairs; regardless, a real breakthrough has yet to be achieved. This hybrid approach leads to shifting the research focus to identifying appropriate language resources to build MT systems, and especially MT for a specific domain, since “it turns out that the automatic translation of specialized domains is more reliable” (Stein, 2018, p.16).

2.4.3 Neural machine translation

According to Luong, Cho and Manning (2016), nowadays there is a considerable demand for machine translation, especially in the fields of “humanity and commerce.”

However, although the ultimate goal is to realize “fully automatic high-quality MT,” at the present stage, only “user- or platform-initiated low-quality translation” or “author-initiated high quality translation” are available. The first category includes translation service

provided by Google Translate or Bing Translator, while the other category requires post-editing from human translators or MT as supplementary tools for translators. As the technology of MT evolves, , the quality of automatic translation from statistical machine translation to neural machine translation has improved, but there are still obstacles that need to be overcome.

As Luong, Cho and Manning (2016) state, NMT was a “fringe research activity” back in 2014, which then became a widely-acknowledged research approach for general MT in 2016. Luong, Cho, and Manning (2016, p. 14) define NMT as “the approach of modeling the entire MT process via one big artificial neural network”. Simply put, NMT is a two-layered neural encoder-decoder architecture; the encoder network can receive an input source sentence and transform it into a series of vectors, each representing an input word, and then based on this series of vectors, the decoder network generates a translated text.

The emerging technology of neural machine translation is reported to perform better than SMT at the sentence level. Research (Kinoshit, Oshio and Mitsuhashi, 2017) has been conducted on comparing the performance of SMT and NMT with large parallel corpora, and the researchers indicate NMT scores higher both in BLEU – an automatic evaluation framework for machine translation (Papineni, Roukos, Ward and Zhu, 2002) – and human evaluation. The advantages of NMT are explained in the documentation of Google’s AutoML Translation service (2019).When rule-based MT was the mainstream approach to process natural language, it required professional programmers to instruct the computer step by step.

Now with large parallel corpora available, it is possible to get the machine learn by itself the language rules from examples using a certain framework. This new approach led to the

application of customized NMT. Trained with a domain-specific corpus, customized NMT can achieve what general MT cannot in the translation of a specific domain.

在文檔中神經機器翻譯於時尚網站在地化之應用 (頁 21-27)