Decolonizing Minority Language Technology

Whenever we use computers and smartphones, we make use of language technology (NLP, from Natural Language Processing), even if we are not aware of it. We use it when typing and using text prediction, or when using a search engine, or we turn to automatic translation for getting the gist of a passage written in another language. Without even noticing it, language technology has crept into our lives. But technology is never neutral: it is developed by humans and reflects their mindset and culture.

We see five people of all ages fishing together in a boat, with a cat playing with the fish in the water. Two elders are talking casually on the left side of the boat while they fish, and on the right side, a middle-aged person is teaching a child how to tie a hook onto their line. On the far right, a child has just caught a fish. Instead of a mast, the boat has an antenna with a glowing light on top. Glowing cables trail down from this antenna and connect to a light strip that wraps around the edges of the boat, as well as a glowing net that sits in the water. There are birds in the sky, flying close to two floating buildings off in the distance.
Illustration by Maggie Haughey


Whenever we use computers and smartphones, we make use of language technology (NLP, from Natural Language Processing), even if we are not aware of it. We use it when typing and using text prediction, or when using a search engine, or we turn to automatic translation for getting the gist of a passage written in another language. Without even noticing it, language technology has crept into our lives. But technology is never neutral: it is developed by humans and reflects their mindset and culture. To what extent, then, can language technology be affected by colonial attitudes and how does this impact the digital usability of minority languages? We will try to answer this question starting with the case of certain minority languages of the European Union. The European Union is home to over 60 indigenous regional or minority languages and to 24 official languages. Many of these minority languages are indigenous, as they developed on the territories in ancient times. This is the case, for example, for Sardinian in Italy, Occitan in France, Kashubian in Poland, or the Sàmi language group in Norway and Finland. They were actively spoken and used until recent times, in many cases until the advent of the recent political construct of the Nation-State, which contributed actively to the eradication of languages other than the one chosen as the National, official language.

Until recently, the advent of the Internet and digital communication has replicated and reinforced the dichotomy of official vs. non-official languages. The cost of the development of software has determined the availability of interfaces and language-based Internet applications in the official languages only. The dominance of English has made it so that even other official languages struggle to be represented enough on the Internet: in 2012, research carried out by the META-NET Network of Excellence which culminated in the publication of 30 Language White Papers showed how 29 European languages were at risk of digital extinction because of lack of sufficient support in terms of language technologies.

The relationship between minority languages and technology can be described at least along three different dimensions. The first is availability of technology: official languages tend to have the whole range of media, services, interfaces and apps available. Minority languages have far lesser opportunities: often not even a keyboard for typing with appropriate characters is available, not to mention more advanced technology such as machine translation1 or speech recognition2. A clear picture of the technology available for minority language is still missing.

A second aspect is usability of technology, when available. For sure, the Internet and the digital domain have a positive side for minority languages: its availability (especially in a hyper-connected Europe) and relatively low cost (if compared to access to print, for instance), has made possible the creation of content in languages other than the official, dominant ones. For interpersonal communication, people using non official languages at home have begun using them for messaging and chatting. Spontaneous movements have originated, which in some cases have led to the development of dedicated software solutions. There is a keyboard for writing the Sàmi language, you can access websites and read ebooks in Welsh, there are video games in Breton and Corsican. However, availability does not imply that services, interfaces, apps and Wikipedias are actually used. Some studies reveal that minority language speakers switch easily to their dominant language when using language-based digital technologies, either because the technology is inherently better, or because the range of services available is much wider. In the majority of cases, using a minority language requires a good amount of perseverance, will, and resilience, since the user experience in using minority languages is interspersed with flaws and difficulties. It has been reported, for instance, that difficulties in typing due to lack of specific keyboards can even lead to complete avoidance of writing in chatting apps in favor or recording voice messages. A consistently difficult and painful experience in everyday use of a language is very likely to lead an average user to abandon a language in favor of another one, that has better support or gives access to more services and opportunities, especially if this other choice is a language that is already among those in the multilingual competence of the speaker.

The third aspect is how technology is developed for minority languages. We have observed that technology initiatives for minority languages tend to polarize around two extremes. On the one hand, provision of technology and media is poured top-down by big companies, with little or no involvement of speakers’ communities. In this case, a patronizing approach can also be spotted: since very little is available, anything that is provided must be good and welcome by definition. Very often companies offer ready-made solutions without taking into accounts the real needs, desires, and expectations of minority language speakers. It is as if the assumption was that these speakers should be grateful for whatever product or opportunity is given to them, no matter if it is actually interesting or relevant to their lives. Notable exceptions to this behavior is exemplified by van Esch et al. (2019), who repeatedly stress the need for close collaboration with language speakers when planning the development of Natural Language Processing applications. On the other hand, there is the activists’ approach, which often gives rise to useful initiatives (the various localization experiences of Facebook, for instance, or Wikipedia projects). Though commendable, these initiatives tend to suffer from lack of coordination, little planning, and even lesser discoverability. This leads to a very serious problem for communities where resources are not unlimited: reduplication of efforts.

In order to decolonize language technology for minority languages, it is important to get a clearer picture of the extent to which minority languages are used over digital media, with what frequency, and for what purposes. Equally important is to know about the obstacles that minority language speakers face when (if) trying to use these languages: do they experience technical difficulties? Are they blocked by some kind of self-induced paranoia? As writing in a minority language is a kind of exposure to the outside world, do people refrain from it for fear of being mocked or stigmatized? Similarly, little is known about the desire of minority languages speakers regarding digital opportunities: what do they want or expect to be made available?

The DLDP Survey

With the aim of understanding the specific needs and the peculiar behaviour of speakers of these languages, in 2016 we carried out a survey focused on gathering information about their personal digital use of the language and about any known digital resource and service that make use of their language. The survey was part of the work carried out in the framework of the Digital Language Diversity Project (DLDP), a project funded under the European Erasmus+ funding scheme.

The main goal of the DLDP Survey was to inquire about the digital behavior, desires, and expectations of speakers of regional and minority languages. Secondarily, it was also aimed at gathering evidence and information to feed the Digital Language Vitality Scale, one of the tools developed by the DLDP project for measuring a language’s digital presence and usability, from the threefold perspective of the available infrastructure supporting digital use, the availability and usability of digital media and the availability and usability of digital services. The survey was therefore designed around three main conceptual blocks: first, the digital capacity of the language, i.e. if the technological conditions for its digital use are in place, such as the availability of internet connection, or the possibility to type the language. Second, the opportunity for making use of the language, under the form of available contexts and purposes for its digital use such as digital media and services. Third, the speakers’ attitudes towards digital use of the language: if it is felt as desirable, what are the underlying motivations for it, what are the blocking factors, if any? Particular attention was devoted to highlighting the possible problems encountered in using the language digitally.

We received feedback from more than 1300 speakers, who showed enthusiasm for the initiative and felt thrilled by the interest manifested in their native language. The project focused its attention on four languages: Breton, Basque, Sardinian, and Karelian. These languages are representative of different stages of digital development, and have different degrees of institutional and community support. A short introductory description of the four languages is provided in Hernaiz and Berger (2017), Hicks (2017), Russo and Soria (2017) and Salonen (2017). To ensure maximal comparability and reusability, we developed the questionnaire template in English and then translated and localised it into the four languages.

Are minority languages used digitally?

The analysis of the data showed some interesting results. First, the fact that minority languages are indeed extensively used on the Internet, in particular for texting and chatting. 97% of Basque speakers, 94,5% of Bretons, 85% of Sardinians and 74% of Karelians state to be using the language online, mostly actively. Digital media appear to be available in all the four languages considered. A group of questions was related to the use of the minority language for e- communication, for example for writing email, texting, chatting or other instant messaging via Whatsapp, Google chat, Snapchat, Skype, Facebook Messenger, etc. Data show extensive use of the minority languages for e-communication. Karelian speakers show a higher passive use of the language, but this is probably linked to their older age with respect to those of the other three languages.

Regional and minority language speakers have a strong desire to use their languages digitally, in all the sociolinguistic domains and for all the purposes where major languages are used. The search for normality, with the language being displayed on all media, including digital ones, is strong among all languages considered. There is also a wide awareness of the ability of the languages to function properly on digital media, as fully apt ones. Less polarised agreement is shown for the statement related to the more ease with which the official language can be used instead of the minority one. This is not only understandable but it clearly reflects the case of all those languages competing with a major, often national one for which digital opportunities are stronger and more varied. In this context, it is no surprise that people consider it “easier” to use the official language. This tendency is more evident for Breton and Sardinian that it is for Basque, which is probably a sign of the stronger positioning of Basque in the local society as a language of wider communication.

For what reasons are people not using the language?

What is there that holds people back from using a language on digital media and devices? With the DLDP survey we tried to single out some of the possible reasons. The answers provided in the questionnaire were modelled after the most common reasons reported in the literature on minority languages:

  • (real or perceived) lack of personal ability in writing the language, possibly due to the lack of a standardised orthography;
  • lack of technological and/or infrastructural support which makes the language’s digital use tiring, difficult, slow, impractical;
  • fear of being misunderstood or mocked; fear to offend others;
  • a “language stigma”, i.e. belief that the language is not to be used outside the private and spoken context.

For Basque and Breton, the percentage of respondents not using the language for digital purposes is very low, and therefore their responses will not be discussed here. For Sardinian, the two most recurrent motivations were the lack of written competence and the idea that Sardinian is a language not used for writing, but only for speaking. Other, clearly related responses were the unavailability of spell checkers and the fear of not being understood. In the case of Karelian, psychological factors tended to prevail over technological ones: perceived lack of writing abilities in the language, fears of being teased or to provoke offence are often reported as reasons for not using Karelian for e-communication.

People tend to excuse their perceived little competence in the orthography of the language by perpetuating the misrepresentation of a language as purely or mostly oral. Their little competence can be real, or some sort of self-censorship may apply. Sardinian speakers also report about the lack of a standardised orthography, a well-known reality for the language. Apparently, outside stigma is never reported as a reason for not using the language. This could well be the case, with a self-induced stigma stronger than the one coming from the outside. But, it could also happen, as it is frequent in qualitative questionnaires, that respondents tend to deny suffering from outside stigma and prefer to describe their lack of use of the language as a personal and independent choice.

What is required?

When a language fails to show all the digital paraphernalia, one would be tempted to offer whatever product or service. But what do we really know about the desires and needs of the speakers of those languages? The needs of a language that is little spoken, and mostly by elderly people cannot be the same of a vibrant community where the language is institutionally supported and widespread among children as well. Ignoring those needs is not only incorrect, but also counter-effective: a technology that is provided too soon to a speakers’ community is likely not to be used. It is a waste of resources and may even further discourage people from using the language. The DLDP survey offered us a privileged view of the needs and expectations of four very different speakers’ communities. At the two extremes there are Basque and Karelian: a strong, widely spoken language, with a committed community on one side and a language that is scattered among isolated speakers, with few opportunities to speak the language outside familiar contexts on the other. Their requirements can hardly be similar.

The requests of Basque speakers are clear and show self-confidence and knowledge of the digital possibilities of the language. The first request concerns reliable translation tools from and into the English language without having to pass through Spanish. A second concern is linked to the little availability of tools and interfaces translated and localised into Basque, such as Apple’s iOS or MacOS. Translations need to be of good quality: it is preferable to use an English or Spanish interface to an odd translation into Basque. Provision of digital products for children and youngsters is also strongly required, as young people tend to consume videogames and use apps in English or Spanish only, and the lack of good quality products in Basque targeting their age is another factor that discourages them from using the language. This is also the main reason behind the need for localised Android and iOS systems: as smartphones are an integral part of everyone’s life, the absence of Basque on them can bear important consequences on the perception of the language by young adults.

Karelian speakers, instead, turn to digital technologies to look for opportunities for using the language or for learning it. According to their replies, even a Facebook group could be enough. This is compatible with the sociolinguistic context: Karelian is a non-territorial language that is now spoken in different parts of Finland, and overall intergenerational transmission was interrupted after WWII. Speakers of Karelian are eager to liaise and connect with other speakers, but in order to do so they need to overcome the diversity of variants and the lack of an agreed common standard. This is a requirement that comes before any digital use of the language but is strongly connected to it: speakers who are not confident in writing will hardly expose themselves on digital media, where communication is still predominantly made in writing.

This view is also shared by speakers of Sardinian, a minority language spoken on the Isle of Sardinia, Italy. Despite being officially recognised the language is still perceived as fragmented into multiple variants and an orthographic standard is available, but not widespread. Therefore speakers stress the need for a stronger presence of the language in school and everyday public life and for a standardized orthography. It is interesting to note how Sardinian appears fairly well endowed with digital media and language processing software. As such, its potential is good and its speakers mostly need encouragement and support for using the language altogether. Once the psychological barriers are removed, digital use is likely to come as a natural evolution. In such a situation, actions headed at promoting knowledge of the existing available opportunities for using the language digitally can be more useful than the development of a new technology.

For Breton, most of the respondents are aware of the existence of a Wikipedia in Breton, with 19% of them even contributing to it by editing existing articles or writing new ones (8%). While the digital basics are firmly in place, the relative lack (or lack of awareness) of advanced services, apps and localised software stands out. At the same time, the respondents show a strong desire in this direction. For instance, automatic translation is almost completely lacking except for the online translation of Breton to French offered by Ofis ar Brezhoneg3; Google Translate is not available for Breton, yet. Indeed, if popular apps and key software interfaces are not provided in Breton soon, unable to compete with French apps, the language will inevitably appear less appealing to the younger generations.

As a general remark, minority language speakers point out the lack of a reference site or facility where all the services, sites, applications but also movies, books, music, etc. available in the minority language are collected. The lack of information appears to be one of the main problems affecting the digital use of those languages: for instance, it is often the case that speakers are not even aware of the availability of a device, a site or a resource (as in the case of Wikipedia, for example).


“If we do not make it further in the digital world, many young people will reject Basque, and this is how it is today.” These are the words of a Basque speaker in a comment about the DLDP survey. Digital technologies are certainly an extremely important instrument for language revitalization and reclamation. However, in this brief essay, we want to stress the importance of close collaboration of industry and academia on the one side and with speakers’ communities on the other. Minority language speakers do not need ready-made solutions: their precise needs and requirements must be listened to and accommodated into products that are tailored to those needs. The socio-linguistic contexts of the various minority languages can differ greatly, and so must the solutions that are provided. Using the words of John Hobson: “The internet and digital world cannot save us. They cannot save Indigenous languages. Of course these things have benefits but they are not the Messiah. We don’t need another website or DVD or multimedia application, these are short term, quick fix solutions. What we really need is sustainable initiatives, to create opportunities for Indigenous language users to communicate with each other in their native tongue. To get people speaking again4”.

Note: The results of the survey (raw data) are available under a CC-BY 4.0 license and are deposited in the ILC4CLARIN Repository5.

This article has been written in Italian. It has been translated into English by the author.


  1. Machine translation allows to automatically translate one text from one language (source language) into another (target language). See https://en.wikipedia.org/wiki/Machine_translation ↩︎

  2. Speech recognition is the technology that allows to automatically transcribe speech by recognising words and phrases. See https://en.wikipedia.org/wiki/Speech_recognition ↩︎

  3. The Public Office for the Breton Language; see: http://www.brezhoneg.bzh/ ↩︎

  4. John Hobson, University of Sydney, 16 May 2013. Source: http://au.artshub.com.au/news-article/news/arts/digital-not-always-the-answer-195370 ↩︎

  5. Soria, Claudia; Quochi, Valeria; Russo, Irene; et al., 2017, Digital Language Diversity Project Survey Data, ILC-CNR for CLARIN-IT repository hosted at Institute for Computational Linguistics “A. Zampolli”, National Research Council, in Pisa, http://hdl.handle.net/20.500.11752/ILC-77 ↩︎