Terminological subsystems of modern Russian school textbooks: A study based on Word2Vec and neural networks

Authors

  • Sergey I. Monakhov Herzen State Pedagogical University of Russia https://orcid.org/0000-0002-0759-9998
  • Vladimir V. Turchanenko Herzen State Pedagogical University of Russia; Institute of Russian Literature (Pushkinskij Dom), Russian Academy of Science
  • Ekaterina A. Fedyukova Independent researcher
  • Dmitry N. Cherdakov Herzen State Pedagogical University of Russia; Saint-Petersburg State University https://orcid.org/0000-0003-1533-4284

DOI:

https://doi.org/10.33910/2687-0215-2020-2-2-118-146

Keywords:

term, terminology, vector representation, school textbook, general education, Russian language, collocations, neural network, deep learning, Word2Vec, CBOW, skip-gram

Abstract

The article reports the results of the study that explored the inventory and functioning of scientific terms and special lexemes in textbooks for Russia’s secondary schools. The toolset included modern methods of natural language processing and deep learning. The number of terms from different fields of knowledge that a secondary school student should learn has never been evaluated. According to the preliminary evaluations based on the Model Basic Curriculum for General and Secondary Education 2015, a secondary school leaver is supposed to be able to understand, recognise and use about 1,000 terms and terminological combinations in the subject Russian Language alone. Thus, taking into account the number of school subjects, the total number of special vocabulary studied in general education schools is measured in thousands. At the same time, the comparative characteristics of the inventory and functioning of terms in textbooks for different school subjects are under-scrutinized and remain unknown. Besides, it is unclear how the terminological density of school textbooks for different subjects correlates with the place occupied by these subjects in the curriculum. The traditional way of compiling lists of special terms is simply to glean them from special texts and write down manually. This method is reliable to gain insights into the best selection practices, however, it cannot be applied to large data sets and does not reflect the term frequency, the specificity of their syntagmatic connections, or the systemic relationship between them. Our project is aimed at filling this gap through: 1) creating a full-text corpus of school textbooks approved by the Ministry of Education for grades 5–11, 2) automatic extraction, stratification, and mapping of terms with the help of distribution semantics algorithms, 3) creation and training of a deep neural network capable of predicting the subject, level of education and educational topic given a group of vector theoretical development of terminology science. They may also find practical application, e. g., in the development of different types of educational literature.

References

ЛИТЕРАТУРА

Караулов, Ю. Н. (1991) О состоянии русского языка современности: Доклады на конференции «Русский язык и современность. Проблемы и перспективы развития русистики» и материалы почтовой дискуссии, в которой приняли участие Ю. Д. Апресян и др. М.: Институт русского языка РАН, 66 с.

Лейчик, В. М. (2007) Терминоведение: предмет, методы, структура. 3-е изд. М.: Изд-во ЛКИ, 256 с.

Ментруп, В. К. (1983) К проблеме лексикографического описания общенародного языка и международных языков. В кн.: Н. Н. Попов (ред.). Новое в зарубежной лингвистике. Вып. 14. Проблемы и методы лексикографии. М.: Прогресс, с. 301–333.

Шелов, С. Д. (1998) Определение терминов и понятийная структура терминологии. СПб.: Изд-во СПбГУ, 236 с.

Brownlee, J. (2017) Deep learning for natural language processing: Develop deep learning models for your natural language problems. Vermont: Machine Learning Mastery Publ., 414 p.

Cabré, M. T., Estopà, R., Vivaldi, J. (2001) Automatic term detection: A review of current systems. In: D. Bourigault, Ch. Jacquemin, M.-C. L’Homme (eds.). Recent advances in computational terminology. Amsterdam: John Benjamins Publ., pp. 53–87. https://doi.org/10.1075/nlp.2.04cab

Durda, K., Buchanan, L. (2008) Windsors: Windsor improved norms of distance and similarity of representations of semantics. Behavior Research Methods, 40 (3): 705–712. https://www.doi.org/10.3758/BRM.40.3.705

Jones, M. N., Mewhort, D. J. K. (2007) Representing word meaning and order information in a composite holographic lexicon. Psychological Review, 114 (1): 1–37. https://doi.org/10.1037/0033-295X.114.1.1

Kilgarriff, A., Jakubíček, M., Kovář, V. et al. (2014) Finding terms in corpora for many languages with the sketch engine. In: Proceedings of the Demonstrations at the 14th Conference the European Chapter of the Association for Computational Linguistics, Sweden, April 2014. Gothenburg: Association for Computational Linguistics Publ., pp. 53–56. https://www.doi.org/10.3115/v1/E14-2014

Kutuzov, A., Kuzmenko, E. (2017) WebVectors: A toolkit for building web interfaces for vector semantic models. In: D. Ignatov et al. (eds.). Analysis of images, social networks and texts. AIST 2016. Communications in computer and information science. Vol. 661. Cham: Springer Publ., pp. 155–161. https://www.doi.org/10.1007/978-3-319-52920-2_15

Levy, O., Goldberg, Y. (2014) Linguistic regularities in sparse and explicit word representations. In: R. Morante, S. W-t. Yih (eds.). Proceedings of the Eighteenth Conference on Computational Natural Language Learning. Stroudsburg, PA: Association for Computational Linguistic Publ., pp. 171–180. https://www.doi.org/10.3115/v1/W14-1618

Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013) Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR, 2013. [Online]. Available at: https://arxiv.org/abs/1301.3781 (accessed 20.02.2021).

Mikolov, T., Sutskever, I., Chen, K. et al. (2013) Distributed representations of words and phrases and their compositionality. In: NIPS’13: Proceedings of the 26th International Conference on Neural Information Processing Systems. Vol. 2. Red Hook: Curran Associates Publ., pp. 3111–3119.

Mikolov, T., Yih, W.-t., Zweig, G. (2013) Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta: Association for Computational Linguistics Publ., pp. 746–751.

Rohde, D. L., Gonnerman, L. M., Plaut, D. C. (2006) An improved model of semantic similarity based on lexical co-occurrence. Communications of the ACM, 8: 627–633.137

REFERENCES

Brownlee, J. (2017) Deep learning for natural language processing: Develop deep learning models for your natural language problems. Vermont: Machine Learning Mastery Publ., 414 p. (In English)

Cabré, M. T., Estopà, R., Vivaldi, J. (2001) Automatic term detection: A review of current systems. In: D. Bourigault, Ch. Jacquemin, M.-C. L’Homme (eds.). Recent advances in computational terminology. Amsterdam: John Benjamins Publ., pp. 53–87. https://doi.org/10.1075/nlp.2.04cab (In English)

Durda, K., Buchanan, L. (2008) Windsors: Windsor improved norms of distance and similarity of representations of semantics. Behavior Research Methods, 40 (3): 705–712. https://www.doi.org/10.3758/BRM.40.3.705 (In English)

Jones, M. N., Mewhort, D. J. K. (2007) Representing word meaning and order information in a composite holographic lexicon. Psychological Review, 114 (1): 1–37. https://doi.org/10.1037/0033-295X.114.1.1 (In English)

Karaulov, Yu. N. (1991) O sostoyanii russkogo yazyka sovremennosti: Doklady na konferentsii “Russkij yazyk i sovremennost’. Problemy i perspektivy razvitiya rusistiki” i materialy pochtovoj diskussii, v kotoroj prinyali uchastie Yu. D. Apresyan i dr. Moscow: V. V. Vinogradov Russian Language Institute of the Russian Academy of Sciences Publ., 66 p. (In Russian)

Kilgarriff, A., Jakubíček, M., Kovář, V. et al. (2014) Finding terms in corpora for many languages with the sketch engine. In: Proceedings of the Demonstrations at the 14th Conference the European Chapter of the Association for Computational Linguistics, Sweden, April 2014. Gothenburg: Association for Computational Linguistics Publ., pp. 53–56. https://www.doi.org/10.3115/v1/E14-2014 (In English)

Kutuzov, A., Kuzmenko, E. (2017) WebVectors: A toolkit for building web interfaces for vector semantic models. In: D. Ignatov et al. (eds.). Analysis of images, social networks and texts. AIST 2016. Communications in computer and information science. Vol. 661. Cham: Springer Publ., pp. 155–161. https://www.doi.org/10.1007/978-3-319-52920-2_15 (In English)

Lejchik, V. M. (2007) Terminovedenie: predmet, metody, struktura. Moscow: URSS Publ., 256 p. (In Russian)

Levy, O., Goldberg, Y. (2014) Linguistic regularities in sparse and explicit word representations. In: R. Morante, S. W-t. Yih (eds.). Proceedings of the Eighteenth Conference on Computational Natural Language Learning. Stroudsburg: Association for Computational Linguistic Publ., pp. 171–180. https://www.doi.org/10.3115/v1/W14-1618 (In English)

Mentrup, V. K. (1983) K probleme leksikograficheskogo opisaniya obshchenarodnogo yazyka i mezhdunarodnykh yazykov. In: N. N. Popov (ed.). Novoe v zarubezhnoj lingvistike. Vyp. 14. Problemy i metody leksikografii. Moscow: Progress Publ., pp. 301–333. (In Russian)

Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013) Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR, 2013. [Online]. Available at: https://arxiv.org/abs/1301.3781 (accessed 20.02.2021). (In English)

Mikolov, T., Sutskever, I., Chen, K. et al. (2013) Distributed representations of words and phrases and their compositionality. In: NIPS’13: Proceedings of the 26th International Conference on Neural Information Processing Systems. Vol. 2. Red Hook: Curran Associates Publ., pp. 3111–3119. (In English)

Mikolov, T., Yih, W.-t., Zweig, G. (2013) Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta: Association for Computational Linguistics Publ., pp. 746–751. (In English)

Rohde, D. L., Gonnerman, L. M., Plaut, D. C. (2006). An improved model of semantic similarity based on lexical co-occurrence. Communications of the ACM, 8: 627–633. (In English)

Shelov, S. D. (1998) Opredelenie terminov i ponyatijnaya struktura terminologii. Saint Petersburg: Saint Petersburg University Press, 236 p. (In Russian)

Published

05.09.2021

Issue

Section

Applied Linguistics