The impact of some linguistic features on the quality of neural machine translation

Авторы

DOI:

https://doi.org/10.33910/2687-0215-2019-1-2-365-370

Ключевые слова:

machine translation, neural machine translation, neural networks, transformer, translation evaluation, translation quality, tokenization, training corpus, byte-pair encoding, Yandex parallel corpus, Yandex corpus, WMT18 test set, news texts, Yandex.Translate, BLEU score

Аннотация

This paper investigates how different features influence the translation quality of a Russian-English neural machine translation system. All the trained translation models are based on the OpenNMT-py system and share the state-of-the-art Transformer architecture. The majority of the models use the Yandex English-Russian parallel corpus as training data. The BLEU score on the test data of the WMT18 news translation task is used as the main measure of performance. In total, five different features are tested: tokenization, lowercase, the use of BPE (byte-pair encoding), the source of BPE, and the training corpus. The study shows that the use of tokenization and BPE seems to give considerable advantage while lowercase impacts the result insignificantly. As to the BPE vocabulary source, the use of bigger monolingual corpora such as News Crawl as opposed to the training corpus may provide a greater advantage. The thematic correspondence of the training and test data proved to be crucial. Quite high scores of the models so far may be attributed to the fact that both the Yandex parallel corpus and the WMT18 test set consist largely of news texts. At the same time, the models trained on the Open Subtitles parallel corpus show a substantially lower score on the WMT18 test set, and one comparable to the other models on a subset of Open Subtitles corpus not used in training. The expert evaluation of the two highest-scoring models showed that neither excels current Google Translate. The paper also provides an error classification, the most common errors being the wrong translation of proper names and polysemantic words.

Библиографические ссылки

SOURCES

Anglo-russkij parallel’nyj korpus (versiya 1.3). [Online]. Available at: https://translate.yandex.ru/corpus (accessed 15.08.2019). (In Russian)

Index of /news-crawl. [Online]. Available at: http://data.statmt.org/news-crawl/ (accessed 11.09.2019). (In English)

OpenSubtitles.org. [Online]. Available at: http://www.opensubtitles.org (accessed 13.08.2019). (In Russian)

REFERENCES

Bahdanau, D., Cho, K., Bengio, Y. (2015) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473v7. [Online]. Available at: https://arxiv.org/abs/1409.0473 (accessed 15.08.2019). (In English)

Barrault, L., Bojar, O., Costa-jussà, M. R. et al. (2019) Findings of the 2019 Conference on Machine Translation (WMT19). In: Proceedings of the Fourth Conference on Machine Translation (WMT). Vol. 2: Shared Task Papers (Day 1). Florence, Italy, August 1–2, 2019. Stroudsburg, PA: Association for Computational Linguistics, pp. 1–61. (In English)

Bojar, O., Federmann, Ch., Fishel, M. et al. (2018) Findings of the 2018 Conference on Machine Translation (WMT18). In: Proceedings of the Third Conference on Machine Translation (WMT). Vol. 2: Shared Task Papers. Brussels, Belgium, October 31 — Novermber 1, 2018. Stroudsburg, PA: Association for Computational Linguistics, pp. 272–307. (In English)

Lison, P., Tiedemann, J. (2016) OpenSubtitles 2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia, May 23–28, 2016. Pp. 923–929. [Online]. Available at: http://www.lrec-conf.org/ proceedings/lrec2016/summaries/947.html (accessed 13.08.2019). (In English)

One model is better than two. Yandex.Translate launches a hybrid machine translation system. (2017) Yandex Blog. 14 September. [Online]. Available at: https://yandex.com/company/blog/one-model-is-better-than-two-yu-yandex-translate-launches-a-hybrid-machine-translation-system (accessed 15.08.2019) (In English)

Sennrich, R., Haddow, B., Birch, A. (2016) Neural Machine Translation of Rare Words with Subword Units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany, August 7–12, 2016. Vol. 1. Stroudsburg, PA: Association for Computational Linguistics, pp. 1715–1725. (In English)

Sutskever, I., Vinyals, O., Le, Q. V. (2014) Sequence to Sequence Learning with Neural Networks. In: Advances in Neural Information Processing Systems 27 (NIPS 2014). Red Hook, NY: Curran Associates, pp. 3104–3112. (In English)

Turovsky, B. (2016) Found in translation: More accurate, fluent sentences in Google Translate. Translate. News about Google Translate. 15 November. [Online]. Available at: https://www.blog.google/products/translate/ found-translation-more-accurate-fluent-sentences-google-translate/ (accessed 15.08.2019). (In English)

Vaswani, A., Shazeer, N., Parmar, N. et al. (2017) Attention is all you need. In: Advances in Neural Information Processing Systems 30 (NIPS 2017). Long Beach, California, USA, 4–9 December 2017. Red Hook, NY: Curran Associates, pp. 5998–6008. (In English)

Vilar, D., Xu, J., D’Haro, L. F., Ney, H. (2006) Error analysis of statistical machine translation output. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC-2006), Genoa, Italy, May 22–28, 2006. Pp. 697–702. [Online]. Available at: http://www.lrec-conf.org/proceedings/lrec2006/pdf/413_ pdf.pdf (accessed 10.08.2019). (In English)

Опубликован

02.10.2019

Выпуск

Раздел

Прикладная лингвистика