Extracting features from text to improve statistical machine translation
In this paper we investigate the technique of extending the Moses Statistical Machine Translation (SMT) system default set of features using shallow linguistic information from source and target phrases. Although a typical SMT system uses a phrase table with 5 default features, most systems are scalable and support any number of additional features. We assume that linguistic information extracted from the source and target phrases can improve the overall translation quality, i. e. make the system more robust and reduce the number of instances of incorrect word choice, punctuation mistakes and other problems SMT systems are prone to. First, we build a baseline SMT system. Then we extract shallow linguistic features directly from source and target phrases of the baseline system’s phrase table. The features are precomputed and stored in the phrase table, so they can be regarded as stateless dense features. We develop and examine 19 features incorporating information from source and target phrases. We explore features commonly used in monolingual and parallel data filtering techniques. The features we investigate include source and target phrase lengths, word, number and punctuation symbol count, word frequencies according to large monolingual corpora etc. For each feature, we build and evaluate a separate SMT system. We conduct a series of experiments on the English-Russian language pair and obtain statistically significant improvements of up to 0.4 BLEU compared to baseline configuration.
Cer, D., Galley, M., Jurafsky, D., Manning, Ch. D. (2010) Phrasal: A Toolkit for statistical machine translation with facilities for extraction and incorporation of arbitrary model features. In: Human Language Technologies: the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Demonstration Session. Proceedings of the conference (NAACL-HLT 2010), pp. 9–12. (In English)
Collin, Ch. (2013) Improved reordering for phrase-based translation using sparse features. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013), pp. 22–31. (In English)
Chiang, D., Knight, K., Wang, W. (2009) 11,001 new features for statistical machine translation. In: Proceedings of the 2009 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2009), pp. 218–226. (In English)
Dandapat, S., Forcada, M. L., Groves, D. et al. (2010) OpenMaTrEx: A free/open-source marker-driven example-based machine translation system. In: Proceedings of the 7th International Conference on Natural Language Processing (IceTAL 2010), pp. 121–126. (In English)
Federico, M., Bertoldi, N., Cettolo, M. (2008) IRSTLM: An open source toolkit for handling large scale language models. In: Proceedings of the 9th Annual Conference of the International Speech Communication Association 2008 (INTERSPEECH 2008), pp. 1618–1621. (In English)
Gao, Q., Vogel, S. (2008) Parallel implementations of word alignment tool. In: Software Engineering, Testing, and Quality Assurance for Natural Language Processing, (SETQA-NLP 2008), pp. 49–57. (In English)
Hasler, E., Haddow, B., Koehn, Ph. (2012) Sparse lexicalised features and topic adaptation for SMT. In: Proceedings of the 9th International Workshop on Spoken Language Translation (IWSLT 2012), pp. 268–275. (In English)
Heafield, K. (2011). Kenlm: Faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. (In English)
Khadivi, Sh., Ney, H. (2005). Automatic filtering of bilingual corpora for statistical machine translation. In: Proceedings of the 10th International Conference on Applications of Natural Language to Information Systems (NLDB 2005), pp. 263–274. (In English)
Koehn, Ph. (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (A meeting of SIGDAT, a Special Interest Group of the ACL held in conjunction with ACL 2004), pp. 388–395. (In English)
Koehn, Ph., Hoang, H., Birch, A. et al. (2007). Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions (ACL 2007), pp. 177–180. (In English)
Kavitha, K. M., Gomes, L., Lopes, G. P. (2011) Using SVMs for filtering translation tables. In: Proceedings of the 15th Portuguese Conference in Artificial Intelligence (EPIA 2011), pp. 690–702. (In English)
Och, F. J., Gildea, D., Khudanpur, S. et al. (2004) A smorgasbord of features for statistical machine translation. In: Proceedings of the Human Language Technologies Conference of the Association for Computational Linguistics: (HLT-NAACL 2004), pp. 161–168. (In English)
Papineni, K., Roukos, S., Ward, T. et al. (2002) BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pp. 311–318. (In English)
Pinnis, M., Skadiņš, R. (2012) MT adaptation for under-resourced domains — what works and what not. In: Proceedings of the Fifth International Conference “Human Language Technologies — The Baltic Perspective” (Baltic HLT 2012), pp. 176–184. (In English)
Rarrick, S., Quirk, Ch., Lewis, W. (2011) MT detection in Web-scraped parallel corpora. In: Proceedings of the 13th Machine Translation Summit (MT Summit XIII), pp. 422–430. (In English)
Taghipour, K., Khadivi, Sh., Xu, J. (2011) Parallel corpus refinement as an outlier detection algorithm. In: Proceedings of the 13th Machine Translation Summit (MT Summit XIII), pp. 414–421. (In English)
Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), pp. 2214–2218. (In English)
Zaidan, O. F. (2009). Z-MERT: A fully configurable open source tool for minimum error rate training of machine translation systems. The Prague Bulletin of Mathematical Linguistics, 91: 79–88. (In English)
Copyright (c) 2019 Александр Павлович Молчанов
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The work is provided under the terms of the Public Offer and of Creative Commons public license Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This license allows an unlimited number of persons to reproduce and share the Licensed Material in all media and formats. Any use of the Licensed Material shall contain an identification of its Creator(s) and must be for non-commercial purposes only.