A corpus-based approach in archaeolinguistics


  • Ilia A. Afanasev Saint Petersburg State University




archaeolinguistics, corpus-based approach, review, Old Church Slavonic, Ancient Greek, corpus linguistics, ancient languages, extinct languages


The article focuses on archaeolinguistics as a separate field of knowledge and outlines the features that distinguish it from other disciplines in comparative studies. It analyses the existing text collections and shows how they may find application in a corpus-based research in ancient languages. It also discusses approaches to creating new corpora of texts. The study focuses on Old Church Slavonic and Ancient Greek, in particular, it analyses the existing corpora in these languages, e. g., Corpus Cyrillo-Methodianum Helsingiense. Most of the corpora under study are not tagged. Some of them change the original writing system (from Glagolitic to Latin, using, for instance, ASCII), while the others have a restricted access. Some of the corpora are no longer available at all or available as part of local databases only. Thus, corpus-based resources in ancient languages in question are obviously insufficient. To facilitate more effective research, the easiest possible solution is to develop new corpora by using platforms specializing in linguistic analysis (e. g., CDLI or Lingvodoc) or systems that support DIY corpora. However, such platforms are often paywalled, may have limited functionality, or lack comprehensive user guides. With all the above in mind, there seems to be no ready solution for archaeolinguists who want to use a corpus-based approach in their study. They either have to make a considerable effort to modify an existing system for their purposes, or to build one of their own. In conclusion, the article proposes one of the possible ways to address these issues.



Anthony, L. (2019) AntConc (Version 3.5.8). [Computer Software]. Tokyo: Waseda University. Available at: https://www.laurenceanthony.net/software (accessed 22.03.2020). (In English)

Brezina, V., Weill-Tessier, P., McEnery, A. (2020) #LancsBox v. 5.x. [Computer Software]. Lancaster University. Available at: http://corpora.lancs.ac.uk/lancsbox (accessed 21.03.2021). (In English)

CCMH — Corpus Cyrillo-Methodianum Helsingiense. [Online]. Available at: https://korp.csc.fi/download/ccmh-src (accessed 24.02.2020). (In English)

CDLI — The Cuneiform Digital Library Initiative. [Online]. Available at: https://cdli.ucla.edu/ (accessed 08.03.2020). (In English)

CDLI Core Update — CDLI Core Update. The Cuneiform Digital Library Initiative. [Online]. Available at: https://cdli.ucla.edu/?q=news/cdli-core-update (accessed 08.03.2020). (In English)

CDLI Repository — CDLI. GitHub Repository. [Online]. Available at: https://github.com/cdli-gh (accessed 08.03.2020). (In English)

DDBDP — Duke Databank of Documentary Payri. Papyri.info [Online]. Available at: http://papyri.info/ddbdp (accessed 10.10.2020). (In English)

KSUCCA: King Saud University Corpus of Classical Arabic. Sketch Engine. [Online]. Available at: https://www.sketchengine.eu/corpus-of-classical-arabic-ksucca/#toggle-id-1 (accessed 21.03.2020). (In English)

LCaS — Corpora and tools. Corpora of Russian Federation. [Online]. Available at: http://web-corpora.net/?l=en (accessed 23.03.2020). (In English)

Lingvodoc 3.0. [Online]. Available at: http://lingvodoc.ispras.ru/ (accessed 10.03.2020). (In English)

Lingvodoc Repository — Lingvodoc repository on GitHub. GitHub. [Online]. Available at: https://github.com/ ispras/lingvodoc (accessed 10.03.2020). (In English)

Manuscript. Slavyanskoe pis’mennoe nasledie [Manuscript. Slavonic written heritage]. [Online]. Available at: http://manuscripts.ru/ (accessed 24.02.2020). (In Russian)

Albanian National Corpus. (2016) [Online]. Available at: albanian.web-corpora.net (accessed 23.03.2020) (In English)

MTAAC — MTAAC Work Packages Repository. [Online]. Available at: https://github.com/cdli-gh/mtaac_work (accessed 08.03.2020). (In English)

Obshtezhitie — The World Wide Web portal for the study of Cyrillic and Glagolitic manuscripts and early printed books. (2020) [Online]. Available at: http://www.obshtezhitie.net/ (accessed 24.02.2020). (In English)

Perseus — PerseusDL/treebank_data. GitHub. [Online]. Available at: https://github.com/PerseusDL/treebank_data (accessed 24.02.2020). (In English)

PROIEL — Haug, D., Jøhndal, M. (2008) Creating a parallel treebank of the Old Indo-European Bible translations. In: C. Sporleder, K. Ribarov (eds.). Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008). Marrakech: European Language Resources Association Publ., pp. 27–34. (In English)

RRuDi — A Russian Diachronic Online Corpus. [Online]. Available at: https://www.slawistik.hu-berlin.de/de/ member/meyerrol/subjekte/rrudi (accessed 24.02.2020). (In German)

SBLGNT — SBL Greek New Testament. [Online]. Available at: http://sblgnt.com/ (accessed 25.02.2020). (In English)

Sketch Engine — Text corpora in Sketch Engine. Sketch Engine. [Online]. Available at: https://www.sketchengine.eu/corpora-and-languages/corpus-list/ (accessed 21.03.2020). (In English)

Sketch English — Learn how language works. Sketch English. [Online]. Available at: https://www.sketchengine.eu/ (accessed 20.03.2020). (In English)

Tauber, J. K. (2017) MorphGNT: SBLGNT Edition. Version 6.12. [Online]. Available at: https://github.com/morphgnt/sblgnt (accessed 21.03.2021). (In English)

TITUS — Thesaurus Indogermanischer Text- und Sprachmaterialien. [Online]. Available at: http://titus.uni-frankfurt.de/indexe.htm (accessed 24.02.2020). (In English)158 Journal of Applied Linguistics and Lexicography, 2020, vol. 2, no. 2

Tsakorpus Repository — Tsakorpus 2.0. GitHub. [Online]. Available at: https://github.com/timarkh/tsakorpus (accessed 23.03.2020). (In English)

USC OSC — University of South California Old Slavic Corpus. [Resource no longer accessible]. Available at: https://bcf.usc.edu/~pancheva/HistoricalSyntaxSouthSlavic.html#participants (accessed 24.02.2020). (In English)

Zeman, D., Nivre, J., Abrams, M. et al. (2020) Universal Dependencies 2.6, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. [Online]. Available at: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3226 (accessed 21.03.2021). (In English)


Alfaifi, A., Atwell, E. (2013) Arabic Learner Corpus: Texts transcription and files format. In: Proceedings of the International Conference on Corpus Linguistics (CORPORA-2013). Saint Petersburg: Saint Petersburg University Press, pp. 1–8. https://www.doi.org/10.13140/2.1.3468.8963 (In English)

Alrabiah, M., Al-Salman, A., Atwell, E. S. (2013) The design and construction of the 50 million words KSUCCA. In: Proceedings of WACL’2 Second Workshop on Arabic Corpus Linguistics. Leeds: The University of Leeds Publ., pp. 5–8. (In English)

Alrabiah, M., Al-Salman, A., Atwell, E. S. et al. (2014) KSUCCA: A key to exploring Arabic historical linguistics. International Journal of Computational Linguistics (IJCL), 5 (2): 27–36. (In English)

Arkhangelsky, T. A., Kisilier, M. L. (2018) Korpusa grecheskogo yazyka: dostizheniya, tseli i zadachi [Corpora of modern Greek: Achievements and goals]. In: N. N. Kazansky (ed.). Indoevropejskoe yazykoznanie i klassicheskaya filologiya — XXII (chteniya pamyati I. M. Tronskogo). Materialy Mezhdunarodnoj konferentsii, prokhodivshej 18–20 iyunya 2018 g. [Indo-European linguistics and classical philology (Joseph M. Tronsky memorial Conference). Proceedings of the International Conference, St. Petersburg, 18–20 June, 2018]. Pt 1. Saint Petersburg: Nauka Publ., pp. 50–59. https://www.doi.org/10.30842/ielcp230690152203 (In Russian)

Berkowitz, L., Johnson, W. H. (1990) Thesaurus Linguae Graecae Canon of Greek authors and works. 3rd ed. New York: Oxford University Press, 536 p. (In English)

Dandapat, S., Sarkar, S., Basu, A. (2004) A hybrid model for Part-of- Speech Tagging and its application to Bengali. In: Proceedings of the International Conference on Computational Intelligence, ICCI 2004. Istanbul: Esenyurt Univercity Publ., pp. 169–172. (In English)

Eckhoff, H. M. (2018) A corpus approach to the history of Russian po delimitatives. Diachronica, 35 (3): 338–366. https://doi.org/10.1075/dia.00006.eck (In English)

Hasan, F., UzZaman, N., Khan, M. (2007) Comparison of different POS tagging techniques (n-gram, HMM and Brill’s tagger) for Bangla. In: K. Elleithy (eds.). Advances and innovations in systems. Computing sciences and software engineering. Dordrecht: Springer Publ., pp. 121–126. https://doi.org/10.1007/978-1-4020-6264-3_23 (In English)

Kilgarriff, A. (2013) Using corpora as data sources for dictionaries. In: H. Jackson (ed.). The Bloomsbury companion to lexicography. London: Bloomsbury Publ., pp. 77–96. https://www.doi.org/10.5040/9781472541871.ch-006 (In English)

Kopotev, M. (2014) Vvedenie v korpusnuyu lingvistiku. Prague: Animedia Company Publ., 185 p. (In Russian)

MADA — Habash, N., Rambow, O., Roth, R. (2010) MADA+TOKAN Manual. [Online]. Available at: http://www1.cs.columbia.edu/~rambow/software-downloads/CCLS-10-01.pdf (accessed 21.03.2020). (In English)

Mitrenina, O. (2014) The Corpora of Old and Middle Russian texts as an advanced tool for exploring an extinguished language. Scribum, 10 (1): 455–461. (In English)

Molina, M., Molin, A. (2016) In a lacuna: Building a Syntactically annotated corpus for a dead cuneiform language (on the basis of Hittite). In: Computational linguistics and intellectual technologies: Proceedings of the international conference “Dialogue 2016”. (Moscow, June 1–4, 2016). Moscow: Russian State University for the Humanities Publ. [Online]. Available at: http://www.dialog-21.ru/media/3476/molinammolina.pdf (accessed 21.03.2021). (In English)

Sokolov, E. G. (2019) The project of a deeply tagged parallel corpus of Middle Russian translations from Latin. Journal of Applied Linguistics and Lexicography, 1 (2): 337–364. https://www.doi.org/10.33910/2687-0215-2019-1-2-337-364 (In English)

Vendina, T. I. (2002) Srednevekovyj chelovek v zerkale staroslavyanskogo yazyka. Moscow: Indrik Publ., 336 p. (In Russian)159

Zakharov, V. P. (2015) Istoricheskie korpusa i korpusnye diakhronicheskie issledovaniya. In: Pis’mennoe nasledie i informatsionnye tekhnologii “El’Manuscript-2015”. Novosibirsk: State Public Scientific- Technological Library of the Siberian Branch of the RAS Publ., pp. 11–13. (In Russian)





Applied Linguistics