This is a work-in-progress: please send us your suggestions to add to this space, in any language!
Tanan Ch’at’oh is a Gwich’in language nest located in Fairbanks, Alaska. It is home to a community based effort to revitalize the Gwich’in language, which is an endangered Arctic Indigenous language spoken by less than 800 people around the world.
Masakhane is a grassroots organisation whose mission is to strengthen and spur NLP research in African languages, for Africans, by Africans.
Open knowledge language projects
Rising Voices, the outreach initiative of Global Voices, aims to help bring new voices from new communities and speaking endangered or indigenous languages to the global conversation by providing training, resources, microgrant funding, and mentoring to local underrepresented communities that want to tell their own digital story using participatory media tools.
The free, mobile, and open source platform built with Indigenous communities to manage and share digital cultural heritage. Mukurtu (MOOK-oo-too) is a grassroots project aiming to empower communities to manage, share, and exchange their digital heritage in culturally relevant and ethically-minded ways. They are committed to maintaining an open, community-driven approach to Mukurtu’s continued development. Our first priority is to help build a platform that fosters relationships of respect and trust.
The TK and BC Labels are an initiative for Indigenous communities and local organizations. Developed through sustained partnership and testing within Indigenous communities across multiple countries, the Labels allow communities to express local and specific conditions for sharing and engaging in future research and relationships in ways that are consistent with already existing community rules, governance and protocols for using, sharing and circulating knowledge and data. The TK Labels support the inclusion of local protocols for access and use to cultural heritage that is digitally circulating outside community contexts.
The TK Labels identify and clarify community-specific rules and responsibilities regarding access and future use of traditional knowledge. This includes sacred and/or ceremonial material, material that has gender restrictions, seasonal conditions of use and/or materials specifically designed for outreach purposes. (Note: the labels may not cover all the many different and complex Indigenous contexts).
A tool for mapping where languages are spoken around the world. Click on the markers to hear recordings of languages spoken in those locations.
This map shows the number of items available in archives participating in the Open Language Archives Community for any given language. The main aim of these archives is to make available information about the smaller languages of the world.
Language archives and libraries
DELAMAN is an international network of archives of data on linguistic and cultural diversity, in particular on small languages and cultures under pressure.
AILLA is a digital language archive of recordings, texts, and other multimedia materials in and about the indigenous languages of Latin America. AILLA’s mission is to preserve these materials and make them available to Indigenous Peoples, researchers, and other friends of these languages now and for generations to come.
The Alaska Native Language Archive houses documentation of the various Native languages of Alaska and helps to preserve and cultivate this unique heritage for future generations. As the premier repository worldwide for information relating to the Native languages of Alaska, the Archive serves researchers, teachers and students, as well as members of the broader community.
The California Language Archive is an online catalog of indigenous language materials in archives at the University of California, Berkeley.
A digital repository holding materials for over 600 endangered languages recorded in over 70 countries, making them accessible and available for future generations.
The digital language archive of the University of Hawaiʻi. Founded in 2008, the archive houses texts, images, audio, and video collected from around the world by linguists, anthropologists, ethnomusicologists, and more.
The Native American Languages Collection is comprised of analog and digital print-based materials, audio recordings and video footage relating to the diverse languages of the Americas.
A digital archive of records of some of the many small cultures and languages of the world. Their research group has developed models to ensure that the archive can provide access to interested communities, and conforms with emerging international standards for digital archiving. They hold 14,500 hours of audio recordings and 2,000 hours of video recordings that might otherwise have been lost. These recordings are of performance, narrative, singing, and other oral tradition. This amounts to 150 terabytes, and represents 1,315 languages, mainly from the Pacific region.
The Repository and Workspace for Austroasiatic Intangible Heritage (RWAAI) is a digital multimedia resource committed to the preservation of research collections documenting the languages and cultures of communities from the Austroasiatic language family of Mainland Southeast Asia and India.
The Archives contains works collected, compiled, or created by SIL, its strategic partners, or members of ethnolinguistic minority communities. Search and browse over 46,000 resources dating from 1935 to the present that describe, document, and/or communicate in the languages and cultures SIL serves.
The Language Archive (TLA) is an integral part of the Max Planck Institute for Psycholinguistics in Nijmegen. It contains various types of materials, including: audio and video language corpus data from languages around the world; photographs, notes, experimental data, and other relevant information required to document and describe languages and how people use them; records of speech in everyday interactions in families and communities; naturalistic data from adult conversations from endangered and under-studied languages, and linguistic phenomena.
The Center for Native American and Indigenous Research (CNAIR) promotes innovative uses of the Library’s collections that benefit Indigenous communities and scholarship.
An urgent global initiative to document and make accessible endangered oral literatures before they disappear without record.
The CoRSAL group supports archiving of audio, video, and text on the under-resourced languages of South Asia. CoRSAL engages in research at the intersection of language documentation, description, and information science.
A project to develop digital collection, storage and distribution strategies for multimedia anthropological information from the Himalayan region.
A digital language archive providing access to world languages.
An open archive of endangered / under-documented languages.
The Rosetta Project is a global collaboration of language specialists and native speakers working to build a publicly accessible digital library of human languages.
Home to the Lakota and Dakota nations, the Standing Rock Sioux Tribe is committed to protecting the language, culture and well-being of its people through economic development, technology advancement, community engagement and education.
A repository of resources for endangered languages that also invites submissions.
An international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources.
Language documentation and revitalization
Living Languages supports Aboriginal and Torres Strait Islander people who are working to maintain, revitalize and reclaim their languages.
Language on the Move (ISSN 2203-5001) is a peer-reviewed sociolinguistics research site devoted to multilingualism, language learning, and intercultural communication in the contexts of globalization and migration. Language on the Move aims to disseminate sociolinguistic research to a broad global audience.
Wikitongues safeguards language documentation, expands access to mother-tongue resources, and directly supports language revitalization projects.
ICLDC is a forum which regularly brings together linguists, students, and community activists to share resources and research and discuss issues of importance in documenting and revitalizing the world’s endangered languages.
Founded in 2010, the Endangered Language Alliance (ELA) is a non-profit dedicated to documenting Indigenous, minority, and endangered languages, supporting linguistic diversity in New York City and beyond.
A social catalyst organisation that participates in reviving endangered languages and public health and justice. They use an approach called Community Self-Documentation which adapts to each community and leads naturally to language and cultural revival.
The Language Conservancy stands as the foremost organization working with endangered languages in North America. They work daily on the ground in partnership with dozens of communities to revitalize their languages.
The Institute on Collaborative Language Research, known as CoLang, is a biennial gathering for people to learn about language documentation, descriptive linguistics, and language revialization.
The Interdisciplinary Centre for Social and Language Documentation (CIDLeS) is a non-profit institution founded in January 2010 in Minde (Portugal) by a group of national and international researchers. CIDLeS aims at improving and deepening research in two linguistic areas: language documentation and linguistic typology. Besides the documentation, study and dissemination of European endangered and minority languages CIDLeS is also engaged in the development of language technologies for scientific and didactic work on lesser-used languages. CIDLeS has three research groups (CIDLeS Media Lab, Language Documentation and Language Typology and Language Revitalization) whose projects are interrelated with the aim of fostering interdisciplinary research.
Living Tongues’ mission is to ensure language survival for generations to come, by supporting speakers who are safeguarding their languages from extinction through activism, education, and technology. Their research teams document endangered languages and cultural practices, publish scientific studies, run digital training workshops to empower language activists, and collaborate with communities to create language resources that will serve as a basis for language revitalization.
The unique oral literatures of indigenous peoples are rapidly being lost through the death of the traditional practitioners and through the schooling of the next generation. The Program for Oral Literature of the Firebird Foundation has initiated a project to fund the collection of this body of rapidly disappearing literature. This literature may consist of ritual texts, curative chants, epic poems, musical genres, folk tales, songs, myths, legends, historical accounts, life history narratives, word games, and so on.
The Foundation for Endangered Languages exists to support, enable and assist the documentation, protection and promotion of endangered languages.
ELF is a 501(c)3 founded in 1996 with the goal of supporting endangered language preservation and documentation projects. Their main mechanism for supporting work on endangered languages has been funding grants to individuals, tribes, and museums. ELF’s grants have promoted work in over 60 countries and have funded a wide range of projects, from the development indigenous radio programs in South Dakota, to recording of the last living oral historian of the Shor language of western Siberia, to the establishment of orthographies and literacy materials to be used by endangered language teaching programs all over the world.
Since 2002 ELDP has been dedicated to its mission to document and preserve endangered languages by funding documenters worldwide to conduct fieldwork and to archive their documentary collections and to make them freely available. Every year, we provide between 30-40 grants for documentation projects around the globe.
A free, open-access journal publishing articles on language, research, and book reviews.
A peer-reviewed, open-access journal sponsored by the National Foreign Language Resource Center and published exclusively in electronic form by the University of Hawaiʻi Press. We publish one volume per year with no fees either for contributors or for readers. We upload articles four times per year in a publish-on-acceptance model.
The Center for Endangered Languages Documentation works with speech communities in Papua in documenting their language and their culture; trains local linguists, students, and experts in state of the art documentation techniques; supports teachers, government agencies, artists, and activists in developing and using materials in local languages, and is committed to establishing sustainable structures to access linguistic and anthropological data from all over the world at the State University of Papua (UNIPA).
Fortalecer la diversidad étnica y lingüística, mejorando las condiciones de vida de los grupos étnicos y las comunidades marginalizadas.
To strengthen cultural and linguistic diversity, and to improve quality of life for ethnic groups and marginalized communities.
A free language documentation tool created at Swarthmore College in 2005, which has since expanded to include 200+ endangered languages.
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, Monojit Choudhury (2020), “The State and Fate of Linguistic Diversity and Inclusion in the NLP World”
Language technologies contribute to promoting multilingualism and linguistic diversity around the world. However, only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications. In this paper we look at the relation between the types of languages, resources, and their representation in NLP conferences to understand the trajectory that different languages have followed over time. Our quantitative investigation underlines the disparity between languages, especially in terms of their resources, and calls into question the “language agnostic” status of current models and systems. Through this paper, we attempt to convince the ACL community to prioritise the resolution of the predicaments highlighted here, so that no language is left behind.
Daan van Esch, Elnaz Sarbar, Tamar Lucassen, Jeremy O’Brien, Theresa Breiner, Manasa Prasad, Evan Crew, Chieu Nguyen, Françoise Beaufays Google Mountain View, CA, USA. November 2019.
This technical report describes our deep internationalization program for Gboard, the Google Keyboard. Today, Gboard supports 900+ language varieties across 70+ writing systems, and this report describes how and why we have been adding support for hundreds of language varieties from around the globe. Many languages of the world are increasingly used in writing on an everyday basis, and we describe the trends we see. We cover technological and logistical challenges in scaling up a language technology product like Gboard to hundreds of language varieties, and describe how we built systems and processes to operate at scale. Finally, we summarize the key take-aways from user studies we ran with speakers of hundreds of languages from around the world.
Beautiful and free fonts for all languages.
When text is rendered by a computer, sometimes characters are displayed as “tofu”They are little boxes to indicate your device doesn’t have a font to display the text.
Google has been developing a font family called Noto, which aims to support all languages with a harmonious look and feel. Noto is Google’s answer to tofu. The name noto is to convey the idea that Google’s goal is to see “no more tofu”. Noto has multiple styles and weights, and is freely available to all. The comprehensive set of fonts and tools used in our development is available in our GitHub repositories.
Released over 75 new and updated font families, built using the open source pipeline and sources. Read more updates.
The technology standard for characters in language scripts, writing systems, and emojis. Unicode Consortium is the non-profit that governs these standards.
Language identification and language support for computer interfaces: amazing set of resources and references about languages and language families, scripts, and their territories
Extracted CLDR Supplemental Data
Mozilla Common Voice Initiative to help teach machines how real people speak. The project goal is to help make voice recognition open, accessible and available for developers in widely spoken languages as well as those with a smaller population of speakers often underserved by commercial speech recognition services.
Lexicographical data on Wikidata: Words, words, words
Jens Ohlig, 25 March 2019. Wikimedia Germany blog.
Language is what makes our world beautiful, diverse, and complicated. Wikidata is a multilingual project, serving the more than 300 languages of the Wikimedia projects. This multilinguality at the core of Wikidata means that right from the start, every Item about a piece of knowledge in the world and every property to describe that Item can have a label in one of the languages we support, making Wikidata a polyglot knowledge base that speaks your language. Expanding Wikidata to deal with languages is an exciting new application.
Glottolog is a bibliographic database of the world’s lesser-known languages, developed and maintained first at the Max Planck Institute for Evolutionary Anthropology in Leipzig, Germany (between 2015 and 2020 at the Max Planck Institute for the Science of Human History in Jena, Germany). Its main curators include Harald Hammarström and Martin Haspelmath. (Wikipedia)
Eberhard, David M., Gary F. Simons, and Charles D. Fennig (eds.). 2021. Ethnologue: Languages of the World. Twenty-fourth edition. Dallas, Texas: SIL International.
By the UCLA Center for Critical Internet Inquiry, with participant observations across a range of scenarios. https://twitter.com/c2i2_ucla
A year long research workshop (2021-22) on large multilingual models and datasets.
A source for Internet usage and population statistics.
A source for statistics across a wide range of sectors and industries.
A source for statistics on world populations.
Print and digital atlas documenting the world’s languages that are on a continuum from being vulnerable to being critically endangered and extinct.
A website that provides statistics about web technologies.
An April 2017 study by KPMG in India and Google.
Other language treasures!
An immersive VR experience.
A free and interactive digital map of the world’s most linguistically diverse metropolitan area.
A multi-media story about the resurgence of Australia’s First languages.
The World Federation of the Deaf is an international non-profit and non-governmental organisation of deaf associations from 133 countries working to ensure equal rights for 70 million people around the globe.
Linguistic Diversity and Social Justice: An Introduction to Applied Sociolinguistics.
Piller, I. (2016). Linguistic Diversity and Social Justice: An Introduction to Applied Sociolinguistics. Oxford University Press. Retrieved from https://oxford.universitypressscholarship.com/
Toward a Wikipedia For and From Us All. In ::Wikipedia @ 20.
Vrana, A. G., Sengupta, A., & Bouterse, S. (2020). Toward a Wikipedia For and From Us All. In ::Wikipedia @ 20. Retrieved from https://wikipedia20.pubpub.org/pub/myb725ma
Geographies of Digital Exclusion
The chart on digital participation by world regions was adapted from Mark Graham, Martin Dittus: Geographies of Digital Exclusion, published by Pluto press in January 2021, in print and digital open access versions.
Through digital storytelling, the festival amplifies the work of diverse practitioners who explore the power of language to connect the past, present, and future.
Last Whispers is a project about the mass extinction of languages. By definition, this extinction occurs in silence, since silence is the very form it takes. Last Whispers sounds what has gone silent. While we drown in the noise of our own voices — uttered within dominant cultures and languages — we are surrounded by a vast ocean of silence. Last Whispers aims to further awareness of linguistic extinction.