Although human evaluation on MT can cover all aspects of translation quality, it requires time and effort. To better examine the performance of MT, Papineni, Roukos, Ward and Zhu propose an automatic method called BLEU (bilingual evaluation understudy) to assist developers. Simply put, this method consists of two components: a numerical
"translation closeness" metric, and a corpus of good quality human reference translations (Papineni, Roukos, Ward, Henderson and Reeder, 2002). BLEU is able to calculate the similarity of automated translation and reference translation by finding how many matching n-grams exist between the two texts at the sentence level. Apart from n-grams, sentence length is also put into consideration, since amore similar sentence length is an indicator of a better translation. According to the study (Papineni, et al., 2002), BLEU shows the ability to differentiate not only good and bad automated translations, but also ideal and poor human translation. The results of the automatic evaluation are statistically similar to the judgement of human participants in the study.
The score provided by BLEU is between 0 to 1. A score of 1 occurs when the translation is perfectly identical to the reference text. For the sake of convenience, a BLEU score is sometimes not presented as decimals but percentages. It is noteworthy that there is no strict standard to interpret a BLUE score. For instance, the documentation “Evaluating models”
published by Google’s AutoML Translation (2019) suggests using the following table (Table
2.1) as a rough guideline, while Lavie (2011) simply proposes that a BLEU score over 30 should be deemed as “understandable translation” and a score over 50 as “good and fluent translation.”
BLEU Score Interpretation
< 10 Almost useless 10 - 19 Hard to get the gist
20 - 29 The gist is clear, but has significant grammatical errors 30 - 40 Understandable to good translations
40 - 50 High quality translations
50 - 60 Very high quality, adequate, and fluent translations
> 60 Quality often better than human
Table 0.1 Interpretation of BLEU score
However, scholars have pointed out that although BLEU has its advantages in that it saves time and effort, its results do not necessarily correspond to the evaluation of professional translators (Kinoshit and Mitsuhashi, 2017), and that the differences of language pairs might also cause larger variations in BLEU scores (Junczys-Dowmunt, et al., 2016).
For instance, all words areweighted equally in BLEU, meaning that it cannot identify the extent of a mistranslation. Also, the automated evaluation is based on matching “exact words,”
so synonyms are considered as translation errors. BLEU is a method designed for MT developers to effectively measure and improve their systems, and does not necessarily meet the needs of language analysts. As Lavie (2011, p.40) points out, BLEU scores are not “easily
interpretable by most translation professionals.” Lavie (2011, p.50) also argues that compared to other automated metrics such as the Metric for Evaluation of Translation with Explicit Ordering (METEOR), BLEU does not show better “correlation with human judgement.”
At this point, most of the studies on machine translation and its automatic evaluation have centered around how to improve the technology itself and the related tools. Feedback from users such as professional translators on machine translation is rarely included in the discussion. Thus, in addition to following the methods of past research on building and evaluating a customized NMT system, this research intends to examine this technology from the perspective of professional translators. By combining the two approaches, it is expected that this paper will be able to investigate into the performance of the customized NMT in a more comprehensive way, yet at the same time also answer the question of whether or not the NMT is useful for professional translators.
Chapter Three: Methodology
According to previous studies, although CAT tools are prevalent in the localization industry, their development has hit a bottleneck. As a result, adding machine translation to a translator’s workflow was bound to happen. However, most of the research related to MT focuses on how to improve the technology itself to enhance translation efficiency, yet rarely takes the needs of professional translators into consideration. Thus, this research aims to examine how helpful the latest MT technology is for profession translators who possess the ability to operate CAT tools. For this purpose, it is necessary to answer the following questions: How does the proposed customized NMT perform? How do professional translators perceive the effects of the proposed NMT? What are the empirical effects of the proposed NMT? Lastly, what is the current ideal tool combination for professional translators?
In order to answer these questions, this research combines a collection of tools to conduct an experiment (see Figure 3.1). The first step is to compile a corpus in which English and Traditional Chinese parallel texts are collected from the official websites of the popular fashion brands H&M, ZARA and Burberry. Then, the bilingual corpus is split into two corpora, a training corpus and a validating corpus. The training corpus is applied to build an NMT system from scratch by adopting the framework of OpenNMT. The trained NMT is then used to translate the English text from the validating corpus into Traditional Chinese.
The automated translation is tested by BLEU – a method for automatic evaluation of machine translation – which references the bilingual texts from the validating corpus. In addition, the English text from the training corpus is further analyzed with a corpus analysis tool to identify some of the factors which might affect the performance of the customized NMT.
At the same time, a new translation project is created in the CAT tool, Trados 2017, with English text from the validating corpus serving as the source text, and the bilingual text from the training corpus to serve as a translation memory. The pre-translation function and analysis in Trados can provide some insight into the automatic translation generated from the customized NMT. Lastly, in order to compare the usefulness of the most common NMT, Google Translate and the customized NMT, an experiment modeling real translation projects and involving professional translators is conducted. The two groups of source text for the two separate translation projects are selected from the English text from the validating corpus.
With the support of the translation memory built from bilingual training corpus and machine translation from NMT or Google Translate, the participating translators have to complete the two projects and answer a feedback questionnaire. It is expected that the result of this experiment can reveal how and why these tools are useful for professional translators.
Figure 0.1 Workflow Chart