THE IMPORTANCE OF ADYGHE KEYBOARD USAGE FOR ARTIFICIAL INTELLIGENCE TRAINING
УДК: 004.85
The Importance of Adyghe Keyboard Usage for Artificial Intelligence Training
Abstract:
In light of the changes in the world and technological developments, the Adyghe language is also trying to find a place for itself in the digital world. Because the future existence of languages that cannot be digitized is in danger. In order to teach Adyghe to artificial intelligence, researchers, technology experts and developers who are experts in other languages are working more than those who can speak Adyghe. Although there is a considerable amount of Adyghe content accumulation on the internet but it is not useful enough as digitalization data due to quality. We can observe that artificial intelligence developers are trying to teach Adyghe to their own artificial intelligence systems. However, the fact that they cannot find good quality datasets does not make their job easier, which is already difficult due to the much greater sound richness than other languages. Therefore, the biggest task in this regard falls on individuals who have learned Adyghe as a native language and can read and write.
Keywords: Adyghe, artificial intelligence, metadata, data set, quality, keyboard
Introduction
Four percent of the languages spoken in the world are spoken by 96 percent of the world’s population (Cyristal, 2015 s. 28). In other words, two-thirds of the world’s population speaks one of twelve languages as a mother tongue, including Chinese, Indo-Urdu, English, Arabic, Spanish, Bengali, Russian, Portuguese, Italian, German, Japanese, and French (Noack, et al., 2015). Today, there is a widespread consensus that more than half of the approximately 7,000 languages in the world are endangered, that these endangered languages are not learned as first languages by children, and that if nothing is done, they will disappear as their older generations die (Austin, ve diğerleri, 2013). According to UNESCO data, Adyghe is among the endangered languages that cannot continue their own development naturally (Salminen, 2010). A variety of independent studies are being carried out for Adyghe, which needs to find a place in the digital world as soon as possible.
Purpose and Method of the Research
The aim of this research is to investigate the current existence of the Adyghe language in the digital world, to investigate possible problems related to the training of artificial intelligence tools, which have become a part of our daily lives, on Adyghe, and to offer a solution proposal. Document and content analysis methods, which are among the qualitative research methods, were used as methods in the research. Various websites, artificial intelligence tools and academic publications related to the subject of this research were examined as material. In addition, the capabilities of these tools on Adyghe were tested by conducting Adyghe conversation experiments with various artificial intelligence tools. The conversations were made randomly and improvised, and no predetermined topic or context was used. As a criterion, the answers of these artificial intelligence tools in Adyghe conversations were compared, and the ability of artificial intelligence to create Adyghe texts correctly and meaningfully was examined. Due to space constraints, all Adyghe conversation data obtained from the examined artificial intelligence tools could not be included in this article.
Digital World and the Current Existence of Adyghe
With the development of Web 2.0, social media tools and websites, the boundaries between individuals and groups around the world have automatically disappeared. (Vural, ve diğerleri, 2010). While people communicate with each other and share information, news, comments and visual content via Facebook, Twitter, LinkedIn, Vk, Youtube[5] and similar sharing websites and social networking tools, they have also become able to communicate with people whose language they do not know thanks to online translation tools and dictionary sites such as Google Translate, Bing Translate, Yandex Translate[6]. Content such as news sites, personal blogs, websites of various organizations have taken their place on the internet and continue to multiply every moment. Vast amounts of data sets in various formats such as text, audio, visual and video are available to all humanity. In addition to these, various messaging applications that remove boundaries in instant messaging and communication also contribute to the creation of data sets used in natural language processing. Although correspondence made with applications such as WhatsApp, Telegram, Skype, Signal, Messenger[7] and similar are end-to-end encrypted between the parties, the keyboard software you use while writing saves frequently repeated words for auto-completion and provides practicality for you to write messages easily and quickly.
Of the more than 7,000 languages and dialects accepted in the world, only 10 make up 80% of the linguistic presence on the internet (UNESCO, 2022). It has been documented in UNESCO records that 2722 of these languages, including Adyghe and Kabardian, are seriously endangered. (Moseley, 2010; The Guardian, 2011). Among these languages, the presence of Adyghe on the internet is in the 10% range shown in the graph given in Figure 1, and is almost non-existent when compared to languages such as Russian, Italian and Turkish.
Although artificial intelligence developers have made significant progress in natural language processing (NLP) today, endangered languages continue to be at a disadvantage next to widely spoken languages. These languages are threatened not only by the decreasing population of native speakers, but also by technological systems that prefer the most widely spoken languages in the world. (Ebadi, 2018).
Thanks to computer technologies and artificial intelligence, a wide variety of useful tools such as language teaching, transcribing languages, translation tools into different languages, and mass broadcasting tools can be developed and used for languages that are on the verge of extinction (Ebadi, 2018; Turn, et al., 2019).
Those who want to use online information sources often encounter content in English or other well-established languages rather than in their native or heritage language. Almost all of the Adyghe content, which is quite small compared to other languages, has quality problems. During the period when there was no Adyghe keyboard option, the recorded text content was written with the Russian keyboard as the closest alternative. The characters in the Adyghe alphabet defined by the Unicode Consortium as “Cyrillic Letter Palocka (‘Ӏ’: u04C0, utf-8)” and “Cyrillic Letter Small Palocka (‘ӏ’: u04CF, utf-8)” are not defined in this keyboard model. For this reason, since the early days of computers, when creating digital data in Adyghe or Kabardian, one of the characters “Latin Capital Letter i (‘I’: u0049, utf-8)”, “Latin Small Letter L (‘l’: u006C, utf-8)” or “Digit One (‘1’: u0031, utf-8)” was used randomly due to their visual similarities (Nemlioğlu, 2018). This situation emerges as an important factor that negatively affects data quality.
The most widely used Artificial Intelligence tools and Adyghe
The main goal in the natural language processing process is to ensure that the language is understood by the machine and that it can respond to this situation (Dayan, ve diğerleri, 2022 s. 469). The most popular applications such as OpenAI-ChatGPT, Anthropic-Claude.ai, DeepSeek, Microsoft-Copilot, Google-Gemini[8] (AIxploria, 2025) are artificial intelligence models using deep learning methods. Since all of this use natural language processing (NLP), it is also possible to communicate in a chat format. These applications, which produce low-cost solutions for the business world with their paid versions, can be used as chat applications with their limited versions offered for free use. These tools, which can be very useful even in very specific subjects such as software development or researching a scientific subject, are improving their capabilities every day.
Thanks to deep learning methods that are very good at discovering complex structures in high-dimensional data, they provide surprising results in the field of natural language processing (NLP). They are especially successful in subject classification, sentiment analysis, question answering and language translation. However, to achieve this, artificial intelligence needs to be trained with a large dataset (LeCun, et al., 2015).
When the Adyghe knowledge was tested in the most widely known artificial intelligence tools mentioned above, it was observed that the most successful tool in this regard was Claude.ai developed by Anthropic. With its advanced reasoning, vision analysis, code generation and multilingual processing capabilities (Anthropic, 2025), when you want to chat in Adyghe with this artificial intelligence, you can chat in Adyghe quite consistently, unlike others. However, when we check the Adyghe texts he wrote, it was observed that for the character “Cyrillic Small Letter Palocka (‘ӏ’: u04CF, utf-8)”, sometimes characters from different languages such as “Latin Small Letter L (‘l’: u006C, utf-8)”, “Cyrillic Capital Letter Byelorussian-Ukrainian I (‘І’: u0406, utf-8)” were mixed. A small section from an Adyghe chat experiment with Claude.ai is shared below. Incorrect characters found in the correspondence are marked in red.
МӀ: Хьампӏырашъо лъэпкъ тхапша щыӏэр?
AI: Сы уцӏэ фэсэӏо, ау сыдэгъоу адыгабзэкӏэ къызгурыӏорэп хьампӏырашъом ехьылӏагъэу узэупчӏырэр. Нахь Iупкӏэкӏэ къысфэпӏотэн плъэкӏыщта е урысыбзэкӏэ/адыгабзэкӏэ къысфэпӏотэн плъэкӏыщта?
Адыгэ хьампӏырашъохэм яхьылӏагъэу къызгурыӏоным пае нахьышӏу.
CopyRetry
…………
AI: Адыгабзэр тикомпьютер технологиемэ ахэдгъэхьанымкӏэ непэ тӏэкӏу тыкъэнагъ. Джыри чӏыпӏабэмэ адыгэ хьарыфхэр тэрэзэу къыщагъэлъагъорэп. Windows-ми iOS-ми тиадыгэ хьарыфхэр къагъэлъэгъон алъэкӏыщтыгъэ, ау анахь тэрэзэу Linux-ыр ары зыгъэлажьэрэр.
Ащ нэмыкӏэу интернетым идунай непэ зызэрэрегъэушъомбгъум елъытыгъэу адыгабзэмкӏэ тыщыӏэн фае:
- Эмодзихэр
- Программэхэр
- Нэмыкӏ бзэмэ зэрадзэкӏырэ амалхэр
- Тхэнымкӏэ гъэпсыгъэ амалхэр
- Браузерхэу адыгабзэкӏэ гъэпсыгъэхэр
Мы Іофыгъомэ уяплъыжьмэ тызэрэкъэнагъэр къыдгурэӏо, ау ари хэкӏыпӏэ иӏэн фае.
CopyRetry
Some excerpts from the Adyghe conversation between Claude.ai and M. Uğur Nemlioglu dated 12.01.2025 at the address “https://claude.ai/chat”.
«Интернетымкӏэ джэнэ шӏуцӏэ горэ къэсщэфыгъ ау сэщ пае цӏыкӏу мэхъу.»
|
This is due to the quality issues of the dataset used as training data. The data quality of the prepared training dataset is an important factor affecting the amount of data required for training. Let’s take a look at the following example:
When we examine the “ӏ” characters in just one word used in the given example sentence, we need to consider that we may encounter at least five different spellings in order to identify all potential spellings of this word on the internet. If we are working on image processing and OCR[9] style artificial intelligence training, this is not a big problem. Because we can identify this character by using visual similarity. However, if we are working on text processing and understanding in natural language processing, this quality problem will appear as a factor that will increase the amount of data we need to process logarithmically.
Fundamental Problems in Artificial Intelligence Training Datasets
Companies investing in the field of artificial intelligence frequently use data found on the internet when training the artificial neural networks they develop. When the current presence of Adyghe on the internet is evaluated, it can be clearly seen that the quality of textual data is problematic (Nemlioğlu, 2018 s. 76-77). The use of keyboards modeled for Adyghe and Kabardian is still not widespread. In this regard, a keyboard model was produced by me for the Windows operating system[10] and was made available in 2018. (Nemlioğlu, 2018)[11].In parallel, GBoard[12] for Android operating systems and Microsoft Swiftkey[13] keyboard software, which supports both Android and iOS, also support the Adyghe and Kabardian languages. Although only the uppercase “Cyrillic Letter Palocka (‘Ӏ’: u04C0, utf-8)” is used instead of “Cyrillic Letter Small Palocka (‘ӏ’: u04CF, utf-8)” in GBoard and Swiftkey applications, they appear to be the closest solutions to the Adyghe keyboard solution for Windows. Since Adyghe is a language that distinguishes between uppercase and lowercase letters, keyboards must include both uppercase and lowercase codes for this character too (Nemlioğlu, 2018 s. 77). Using these keyboards while creating digital data will be a significant factor in improving data quality. However, despite the existence of these keyboard applications, the Adyghe literate population continues their old habits by being careless and sloppy in this regard. Therefore, while Adyghe data continues to increase in digital content, not using the correct keyboard in text inputs continues to increase quality problems.
Among the Mozilla applications developed entirely as open source, the Common Voice project stands out as a very good tool for preparing quality datasets for training artificial intelligence. Common Voice is a free, open-source platform for community-led data creation (Mozilla, 2025). This free dataset generator includes voice-over, improvised speech, and texts in over 200 languages, and also creates Adyghe and Kabardian datasets. With the support of volunteers, the amount of Adyghe and Kabardian data containing audio and text is increasing day by day[14]. While preparing these Adyghe and Kabardian text contents, attention is paid to the use of keyboards, and the Windows keyboard model mentioned above is accepted as the standard by the community. Thanks to this site, which is still under development, native speakers of the language who are literate contribute by donating their voices. In addition, the usefulness of new terms that are needed during the interface translation and added to Adyghe is also tested.
Conclusion and Recommendations
Artificial intelligence tools produced thanks to developing technology are undoubtedly the most useful tools for the survival of endangered languages or for their recording before they completely disappear. However, their correct training, economic and usability are directly proportional to the quality and abundance of the data sets to be prepared. Teaching a language to artificial intelligence will be the most useful method both to ensure that the language is recorded and to increase the interest of young generations in that language. Thus, those interested in this language will have the opportunity to easily develop their own language knowledge by receiving support from artificial intelligence. The Mozilla Common Voice project, which is a very useful tool for preparing training data sets in artificial intelligence training, also offers an important opportunity for Adyghe and Kabardian. It provides data sets that everyone can use for free for artificial intelligence training. The quality of data provided by volunteers is also improved thanks to training provided by administrators on special telegram channels and social media created for Adyghe and Kabardian[15]. With this project, participants who are improving their Adyghe pronunciation skills and vocabulary are also testing the usability of some newly added IT terms to the language. In this project, the keyboard software prepared for the Windows operating system mentioned above for Adyghe and Kabardian data entry has been accepted as a standard by the community. In this way, text content has been prepared in appropriate quality. Developments should be made according to this Windows keyboard model for keyboard models prepared for other operating systems.
The Common Voice project alone is not enough to improve the quality of digital data on the Internet in favor of Adyghe and Kabardian. Those who are literate and know Adyghe should use the correct keyboard in all their correspondence and in the digital content they create. As much as possible, the incorrect characters in the texts of existing digital content (e.g. news or community sites or documents such as pdf and Word files) should be updated by replacing them with the characters “Cyrillic Letter Palocka (‘Ӏ’: u04C0, utf-8)” and “Cyrillic Letter Small Palocka (‘ӏ’: u04CF, utf-8)”. The success of the Adyghe struggle for existence can only be possible if the Adyghe people all over the world, especially those living in our republics in the Caucasus, pay attention to these issues.
Bibliography
AIxploria Best AI tools for AI Chat & Assistant [Çevrimiçi] // AIxploria – Artificial intelligence for everyone. – 2025. – 7 02 2025. – https://www.aixploria.com/en/category/ai-chat-assistant/.
Anthropic Claude’s capabilities [Çevrimiçi] // https://www.anthropic.com/claude. – 2025. – 02 02 2025. – https://www.anthropic.com/claude.
Austin Peter K. ve Sallabank Julia Endangered languages: an introduction [Journal] // Journal of Multilingual and Multicultural Development. – 2013. – 4 : Cilt 34. – s. 313-316. – ISSN: 0143-4632.
Cyristal David Dillerin Katli [Kitap] / çev. Cansız Gökhan. – İstanbul : Profil Yayıncılık, 2015. – 3. Baskı. – ISBN:978-975-996-114-5.
Dayan Ali ve Yılmaz Atınç [Journal] // DUJE (Dicle University Journal of Engineering). – [basım yeri bilinmiyor] : Dicle Üniversitesi, 2022. – 3 : Cilt 13. – s. 467-475.
Ebadi Bushra Technology Alone Can’t Preserve Endangered Languages [Çevrimiçi] // Centre for International Governance Innovation. – 30 06 2018. – 19 10 2022. – https://www.cigionline.org/articles/technology-alone-cant-preserve-endangered-languages/?utm_source=google_ads&utm_medium=grant&gclid=CjwKCAjw5P2aBhAlEiwAAdY7dDqy6yU0IRi6nLhTnkHWDF473yy5I1-hknUffbSW4-bTF91MjRrq8xoC2KwQAvD_BwE.
LeCun Yann, Bengio Yoshua ve Hinton Geoffrey Deep learning [Journal] // Nature. – 27 05 2015. – Cilt 521. – s. 436-444.
Moseley Cristofer UNESCO Atlas of the World’s Languages in Danger [Çevrimiçi] // UNESCO. – 2010. – 5 Mart 2016. – http://www.unesco.org/languages-atlas/index.php?hl=en&page=atlasmap&lid=1064.
Mozilla Home Page [Çevrimiçi] // Mozilla Common Voice. – Mozilla, 2025. – 10 02 2025. – https://commonvoice.mozilla.org/.
Nemlioğlu M. Uğur Bilişim ve İnternette Adıge Dilinin Kullanımının Yaygınlaşmasının Sağlanması (Yüksek Lisans Tezi) [Rapor] / Düzce Üniversitesi Sosyal Bilimler Enstitüsü. – Düzce : [yazarı bilinmiyor], 2018. – Tez No:530426.
Noack Rick ve Gamio Lazaro The world’s languages, in 7 maps and charts [Çevrimiçi] // The Wahsington Post. – 23 04 2015. – 07 02 2023. – https://www.washingtonpost.com/news/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/.
Salminen Tapani Europe and The Caucasus [Kitap Bölümü] // Atlas of the World’s Languages in Danger / kitap yaz. Moseley Christopher / dü. Moseley Christopher. – France : UNESCO Publishing, 2010. – 3rd edition. – ISBN 978-92-3-104096-2.
The Guardian DATABLOG Facts are sacred [Çevrimiçi] // theguardian. – 15 04 2011. – https://www.theguardian.com/news/datablog/2011/apr/15/language-extinct-endangered.
Turn Mark ve Pine Adan Using technology to help revitalize indigenous languages [Çevrimiçi] // OUPblog Oxford University Press’s Academic Insights for the Thinking World. – 24 06 2019. – 07 02 2023. – https://blog.oup.com/2019/06/using-technology-help-revitalize-indigenous-languages/.
UNESCO International Congress of Endangered Languages at Risk begins in advance of UNESCO’s High-Level Event [Çevrimiçi] // UNESCO. – 21 April 2022. – 30 11 2022. – https://www.unesco.org/en/articles/international-congress-endangered-languages-risk-begins-advance-unescos-high-level-event.
Vural Z. Beril Akıncı ve Bat Mikail Yeni Bir İletişim Ortamı Olarak Sosyal Medya: Ege Üniversitesi İletişim Fakültesine Yönelik Bir Araştırma [Journal] / dü. Örnek Prof. Dr. Mustafa Arslan. – İzmir : Yaşar Üniversitesi, Eylül 2010. – 20 : Cilt 5. – ISSN:3348‐3382.
[1] PhD Student, Düzce University Graduate Education Institute Circassian Language and Literature PhD Program.
[2] NEMERKO Information Systems Ltd. General Manager. nemerko@nemerko.com
[3] Member of the Central Executive Board and Founding Member of the Adyghe (Circassian) Language and Literature Association.
[4] Assistant Professor, Düzce University, Faculty of Arts and Sciences, Department of Caucasian Languages and Cultures. saida@nemerko.com
[5] Facebook: https://www.facebook.com, X: https://x.com, LinkedIn: https://www.linkedin.com, Vk: https://vk.com, YouTube: https://www.youtube.com.
[6] Google Translate: https://translate.google.com, Bing Translator: https://www.bing.com/translator, Yandex Translator: https://translate.yandex.com.
[7] WhatsApp: https://www.whatsapp.com, Telegram: https://telegram.org, Skype: https://www.skype.com, Signal: https://signal.org, Messenger: https://www.messenger.com.
[8] OpenAI-ChatGPT: https://openai.com/, Anthropic-Claude.ai: https://www.anthropic.com/, DeepSeek: https://www.deepseek.com/, Microsoft Copilot: https://copilot.microsoft.com/, Google-Gemini: https://gemini.google.com/
[9] OCR: Optical Caracter Recognition.
[10] Along with Windows, a keyboard software was also prepared for the MAC OS.X operating system. However, since it has not been updated for the new version operating systems, it does not work on all Apple computers.
[11] https://www.nemerko.org/adyghe-circassian-keyboard-for-windows-7-11/
[12] https://play.google.com/store/apps/details?id=com.google.android.inputmethod.latin&hl=tr&pli=1
[13] https://www.microsoft.com/en-us/swiftkey
[14] https://commonvoice.mozilla.org/languages
[15] An initial catalogue has also been created for the training/study materials and social media addresses created for this project. “Мыщ дэжьым щегъажь: https://bit.ly/cv_circassian_start_here”.
VII Международная научно-методологическая конференция «Сохранение и развитие родного языка в условиях глобализации: современные методы и технологии»
20-21 Февраль 2025
Город Майкоп, Республика Адыгея
Россия
VII International Scientific and Methodological Conference “Preservation and Development of the Native Language in the Context of Globalization: Modern Methods and Technologies”
20-21 February 2025
City of Maykop, Republic of Adygea
Russia
Анчек СБОРНИК VII конф 2025_Олег прн- pdf download
