Historical and Poetic Subcorpus of the National Kazakh Language Corpus
DOI:
https://doi.org/10.25178/nit.2025.2.19Keywords:
poetic subcorpus; metatextual annotation; Arabic script; writing system; text database; Kazakh language; Kazakh poetry; National Corpus of the Kazakh LanguageAbstract
The article analyzes the key aspects of digitizing the samples of Kazakh oral folk literature from the 15th to 19th centuries, originally written in Arabic script, and their integration into the National Corpus of the Kazakh Language (NCKL). This work constitutes the first stage in the creation of a Historical and Poetic Subcorpus of the NCKL. As part of the study, a comparative analysis of existing poetic subcorpora in other languages (Russian, Czech, Bashkir, and Persian) was conducted, allowing for the identification of the most effective methods and approaches for developing the Kazakh subcorpus.
A significant outcome of the project is the development of a metatextual annotation model comprising 28 parameters that consider the specifics of Kazakh poetry. Key elements of Kazakh verse were identified, including stanza structure, syllable count, rhyme schemes, and metrical feet. The annotation system developed enables an accurate representation of the poetic features of the texts while accounting for the influence of Eastern literature and folk genres on the evolution of the Kazakh poetic tradition. One of the important innovations introduced is the semantic annotation of archaic vocabulary.
The article also presents the design of the subcorpus interface, which allows users to explore poetic works in their original Arabic script alongside their transcribed Cyrillic versions. This makes the subcorpus a valuable tool for linguistic and literary research.
References
Akhmetov, Z. (1973) Theory of the poetic word. Almaty, Mektep. 212 p. (In Kazakh)
Bazarbayeva, Z. M. (2008) Kazakh intonation. Almaty, Daik-Press. 281 p. (In Russ.)
Bazarbayeva, Z. (2022) Intonology: in 5 vols. Almaty, Everest. Vol. 1. 440 p. (In Russ.)
Baitursynov, A. (2003) The study of literature. Almaty, Atamura. 208 p. (In Kazakh)
Baitursynuly, A. (1991) Bright path. Almaty, Zhalyn. 494 p. (In Kazakh)
Valikhanov, Ch. Ch. (1986) On the forms of Kazakh folk poetry. Moscow Nauka 416 p. (In Russ.)
Gasparov, M. L. (1974) Modern Russian verse Metrics and rhythmics. Moscow, Nauka. 487 p. (In Russ.)
Gasparov, M. L. (2002) Essay on the history of Russian verse Metrics Rhythm Rhyme Stanza. 2nd ed. ext. Moscow, Fortuna Limited. 319 p. (In Russ.)
Gasparov, M. L. (2013) Meter and meaning On one of the mechanisms of cultural memory. Moscow, Fortuna EL. 414 p. (In Russ.)
Grishina, E. A., Korchagin, K. M., Plungian, V. A. and Sichinava, D. V. (2009) The poetic subcorpus within the Russian National Corpus: general structure and application prospects. Natsional'nyi korpus russkogo yazyka 2006–2008. In: Novye rezul'taty i perspektivy, ed. by V. A. Plungian. St. Petersburg, Nestor-Istoriya. 502 p. Pp. 71–113. (In Russ.)
Gumilev, L. N. (1999) Ancient Turks, comp. by A. I. Kurkchi Moscow, Institute DI-DIK. 480 p. (In Russ.)
Zhanabayev, K. (2014) The poetic system of the zhyrau works of the 15th–18th centuries Towards the foundations of literary translation. Almaty, Kazakh University. 260 p. (In Russ.)
Zhanabayev, K., Islyamova, U. and Seitbekova, A. A. (2022) Frequency poetic dictionary of the zhyrau language of the 15th–18th centuries. Tiltanym, no. 86(2), pp. 28–38. (In Russ.) DOI: https://doi.org/10.55491/2411-6076-2022-2-26-36
Zhanabekova, A. (2013) The role of linguistic annotation based on the experience of creating the National Corpus of the Kazakh Language. In: Zubov, A. V. (ed.) Problems of modern applied linguistics. Minsk, MSLU. 531 p. Pp. 212–216. (In Russ.)
Zhanabekova, A. (2017) On the development of metadata for inclusion in the National Corpus of the Kazakh language. In: Malbakov, M. (ed.) Materials of the international scientific-theoretical conference “The legacy of Akhmet Baitursynuly: research, systematization and popularization”. Almaty, Eltanym. 396 p. Pp. 229–234. (In Kazakh)
Zhanabekova, A., Pirmanova, K. K. and Karbozova, B. D. (2020) Development of lexico-semantic annotation in the National Corpus of the Kazakh language. Tyurkologiya, no. 4, pp. 201–216. (In Russ.)
Zhanabekova, A. and Kozhakhmetova, A.K. (2021) Processing of texts included in the metatagging using special computer software. Tiltanym, no. 3, pp. 37–52. (In Russ.) DOI: http://doi.org/10.55491/2411-6076-2021-3-37-52
Zholdasbekov, M. (1990) Precious sources. Almaty, Zhazushy. 352 p. (In Kazakh)
Zhubanov, A. (2010) Theoretical foundations of building the electronic corpus of the Kazakh literary language. In: Khabieva, A. and Nuryumkyzy, G. (eds.) Language and culture: the anthropocentric paradigm of language. Materials of the republican scientific-theoretical conference dedicated to the 60th anniversary of Professor Zh. A. Mankeyeva. Almaty, Institute of Linguistics named after A. Baitursynuly. 385 p. Pp. 191–197 (In Kazakh)
Zhubanov, A. (2016) The most valuable part of the search engine in the National Corpus of the Kazakh language is the metatagging characterizing the text as a whole. Tiltanym, no. 2, pp. 3–9 (In Russ.)
Zhubanov, A. and Zhanabekova, A. (2017) Corpus linguistics. Almaty, Kazakh language publishing house 336 p. (In Kazakh)
Zhumagulov, A. B. (2012) History of the periodization of Kazakh literature. Karaganda, Buketov KarSU. 143 p. (In Kazakh)
Kelimbetov, N. (1991) Ancient literature. Almaty, Mektep. 264 p. (In Kazakh)
Kenzhebayev, B. (2004) From the Turkic Khaganate to the present. Almaty, Ana tili. 344 p. (In Kazakh)
Korchagin, K. M. (2015) 20th century poetry in the poetic subcorpus of the Russian National Corpus: the problem of representativeness. Trudy Instituta russkogo yazyka im. V. V. Vinogradova, no. 3(6), pp. 235–256 (In Russ.)
Kyraubayeva, A. (1999) Ancient literature. Almaty, Kazakh University. 138 p. (In Kazakh)
Orekhov, B. V. (2015) Once again about the research potential of the poetic corpus: meter, lexis, formula. Trudy Instituta russkogo yazyka im. V. V. Vinogradova, no. 6, pp. 449–463 (In Russ.)
Orekhov, B. V. (2019a) Bashkir verse of the 20th century. A corpus-based study. St Petersburg, Aleteya. 344 p. (In Russ.)
Orekhov, B. V. (2019b) Meter of segments longer than a line in Bashkir syllabic verse. Izvestiya RAN. Seriya literatury i yazyka, vol. 78, no. 2, pp. 41–50 (In Russ.)
Orekhov, B. V. and Stepina, D. S. (2022) The Persian poetic corpus. Trudy Instituta russkogo yazyka im. V.V. Vinogradova, no. 1, pp. 65–72 (In Russ.)
Ömiraliev, Q. (1976) Language of Kazakh poetry of the 15th–19th centuries. Almaty, Gylym. 269 p. (In Kazakh)
Ömiraliev, Q. (2010) Studies on Old Turkic literary monuments. Almaty, Arys. 650 p. (In Kazakh)
Plungyan, V.A. (2014) Non-classical verse of Lermontov: some details. Uchyonye zapiski Petrozavodskogo gosudarstvennogo universiteta Seriya obshchestvennye i gumanitarnye nauki, no. 7 (144), pp. 40–51 (In Russ.)
Savchuk, S. O., Arkhangelsky, T. A., Bonch-Osmolovskaya, A. A., Donina, O. V., Kuznetsova, Y. N., Lyashevskaya, O. N., Orekhov, B. V. and Podryadchikova, M. V. (2024) The Russian National Corpus 2.0: new possibilities and development prospects. Voprosy yazykoznaniya, no. 2, pp. 7–34. (In Russ.) DOI: http://doi.org/10.31857/0373-658X.2024.2.7-34
Saparnyyazov, N. and Khozhaniyazov, U. (1959) Sharyar. Nukus, Karakalpak State Publishing House. 107 p. (In Russ.)
Seitbekova, A. and Elesbay, N. (2024) Historical poetic subcorpus: a database of Old Kazakh poetic texts. Tiltanym, no. 3, pp. 140–150. (In Russ.) DOI: http://doi.org/10.55491/2411-6076-2024-3-140-150
Sichinava, D.V. (2012) Poetic subcorpus of the Russian National Corpus: some examples of searching for verse metrics. Slavyanskiy stikh, vol. 9, pp. 482–491 (In Russ.)
Suvorov, M. N. (2015) Medieval literature of the Muslim world. St Petersburg, Presidential Library. 151 p. (In Russ.)
Suyunshaliev, Kh. (1983) Kazakh literature. 17th–19th centuries. Almaty, Mektep. 168 p. (In Kazakh)
Syzdykova, R. (1970) Syntactic structure of Abai's poems. Almaty, Gylym. 173 p. (In Kazakh)
Syzdykova, R. (2000) Kazakh language reference book (spelling, punctuation, orthoepy). Astana, Elorda. 480 p. (In Kazakh)
Syzdykova, R. (2009) Archaisms and neologisms in the Kazakh language. Almaty, Arys. 182 p. (In Kazakh)
Fazylzhan, A. M. (2023) Experience of developing the National Corpus of the Kazakh language. Almaty, ZhK Asyl. 446 p. (In Kazakh)
Lord, A. B. (1991) Epic Singers and Oral Tradition. Ithaca, London, Cornell University Press. 280 p.
Plecháč, P. and Kolár, R. (2015) The corpus of Czech verse. Studia Metrica et Poetica, vol. 2, no. 1, pp. 107–118.
Published
How to Cite
For citation:
Seitbekova A. A., Fazylzhan A. M., Seydamat A. K., Abaeva M. K. and Mursal A. Historical and Poetic Subcorpus of the National Kazakh Language Corpus. New Research of Tuva, 2025, no. 2, pp. 312-338. (In Russ.). DOI: https://doi.org/10.25178/nit.2025.2.19
Issue
Section

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Author(s) license holder(s) grant rights for their work to the journal (grantee of a license) under the simple non-exclusive open license in accordance with Art. 1286.1 «Open license for a research work, work of literature or fine arts», Civil Code of the Russian Federation.
New Research of Tuva publishes articles under the Creative Commons Attribution-NonCommercial license (CC BY-NC).
Since it is an open license, author(s) reserve the right to upload the article to their institutional repository, submit it to another journal (if it allows republications), or republish it on their own website (in full, or in part).
However, several conditions apply here:
a) The republished version must always contain the name(s) and affiliation(s) of the author(s), the original title and the hyperlink to the original version on the New Research of Tuva website;
b) It must be in open access, free of charge, and no category of readers must be in any way whatsoever advantaged over general readership.
c) should the contribution be submitted elsewhere by its author(s) without substantial modification (30% or more of original text unchanged), the body of the article should contain a disclaimer that the original version was published in New Research of Tuva (with a link to the respective page)
The CC-BY-NC is a non-revocable license which applies worldwide and lasts for the duration of the work’s copyright.