Home > 

Resources and Inspirations

Resources and Inspirations

This is a work-in-progress: please send us your suggestions to add to this space, in any language!

Community-led language initiatives

Kaytetyemoji emoji app

With only a handful of full speakers left, Kaytetye is a highly endangered language from Australia’s Northern Territory. The Kaytetyemoji app includes about 44 emojis, phrases, and audio pronunciations. It is a crucial community-driven multigenerational initiative to embed some Kaytetye language online.


Masakhane is a grassroots organisation whose mission is to strengthen and spur NLP research in African languages, for Africans, by Africans.

Tanan Ch’at’oh

Tanan Ch’at’oh is a Gwich’in language nest located in Fairbanks, Alaska. It is home to a community based effort to revitalize the Gwich’in language, which is an endangered Arctic Indigenous language spoken by less than 800 people around the world.

Open knowledge language projects

Language Landscapes

A tool for mapping where languages are spoken around the world. Click on the markers to hear recordings of languages spoken in those locations.


The free, mobile, and open source platform built with Indigenous communities to manage and share digital cultural heritage. Mukurtu (MOOK-oo-too) is a grassroots project aiming to empower communities to manage, share, and exchange their digital heritage in culturally relevant and ethically-minded ways. They are committed to maintaining an open, community-driven approach to Mukurtu’s continued development. Our first priority is to help build a platform that fosters relationships of respect and trust.

OFDN Conversations

A podcast by the O Foundation. People are knowledge, and their voices speak of wisdom. Through OFDN Conversations, you can listen the tenacity of communities making language, media and technology work for them, from one inspiring individual at a time.

OLAC Language Archives Data visualization

This map shows the number of items available in archives participating in the Open Language Archives Community for any given language. The main aim of these archives is to make available information about the smaller languages of the world.

Rising Voices

Rising Voices, the outreach initiative of Global Voices, aims to help bring new voices from new communities and speaking endangered or indigenous languages to the global conversation by providing training, resources, microgrant funding, and mentoring to local underrepresented communities that want to tell their own digital story using participatory media tools. Rising Voices has also conducted research on indigenous language activists and their communities' challenges and needs in regards to digital safety and security.

TK labels

The TK and BC Labels are an initiative for Indigenous communities and local organizations. Developed through sustained partnership and testing within Indigenous communities across multiple countries, the Labels allow communities to express local and specific conditions for sharing and engaging in future research and relationships in ways that are consistent with already existing community rules, governance and protocols for using, sharing and circulating knowledge and data. The TK Labels support the inclusion of local protocols for access and use to cultural heritage that is digitally circulating outside community contexts.
The TK Labels identify and clarify community-specific rules and responsibilities regarding access and future use of traditional knowledge. This includes sacred and/or ceremonial material, material that has gender restrictions, seasonal conditions of use and/or materials specifically designed for outreach purposes. (Note: the labels may not cover all the many different and complex Indigenous contexts).

Language archives and libraries

Alaska Native Languages Archive (ANLA)

The Alaska Native Language Archive houses documentation of the various Native languages of Alaska and helps to preserve and cultivate this unique heritage for future generations. As the premier repository worldwide for information relating to the Native languages of Alaska, the Archive serves researchers, teachers and students, as well as members of the broader community.

Archive of the Indigenous Languages of Latin America (AILLA)

AILLA is a digital language archive of recordings, texts, and other multimedia materials in and about the indigenous languages of Latin America. AILLA’s mission is to preserve these materials and make them available to Indigenous Peoples, researchers, and other friends of these languages now and for generations to come.

California Language Archive (CLA)

The California Language Archive is an online catalog of indigenous language materials in archives at the University of California, Berkeley.

Computational Resource for South Asian Languages (CoRSAL)

The CoRSAL group supports archiving of audio, video, and text on the under-resourced languages of South Asia. CoRSAL engages in research at the intersection of language documentation, description, and information science.


DELAMAN is an international network of archives of data on linguistic and cultural diversity, in particular on small languages and cultures under pressure.

Digital Himalaya

A project to develop digital collection, storage and distribution strategies for multimedia anthropological information from the Himalayan region.

Endangered Languages Project (ELDP)

A repository of resources for endangered languages that also invites submissions.

Kaipuleohone (University of Hawai’i Digital Ethnographic Archive)

The digital language archive of the University of Hawaiʻi. Founded in 2008, the archive houses texts, images, audio, and video collected from around the world by linguists, anthropologists, ethnomusicologists, and more.

Language Archive Cologne (LAC)

A digital language archive providing access to world languages.

Open Language Archives Community (OLAC)

An international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources.

Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC)

A digital archive of records of some of the many small cultures and languages of the world. Their research group has developed models to ensure that the archive can provide access to interested communities, and conforms with emerging international standards for digital archiving. They hold 14,500 hours of audio recordings and 2,000 hours of video recordings that might otherwise have been lost. These recordings are of performance, narrative, singing, and other oral tradition. This amounts to 150 terabytes, and represents 1,315 languages, mainly from the Pacific region.

Pangloss Collection

An open archive of endangered / under-documented languages.

Repository and Workspace for Austoasiatic Intangible Heritage at Lund University (RWAAI)

The Repository and Workspace for Austroasiatic Intangible Heritage (RWAAI) is a digital multimedia resource committed to the preservation of research collections documenting the languages and cultures of communities from the Austroasiatic language family of Mainland Southeast Asia and India.

Rosetta Project

The Rosetta Project is a global collaboration of language specialists and native speakers working to build a publicly accessible digital library of human languages.

SIL International Language and Culture Archives

The Archives contains works collected, compiled, or created by SIL, its strategic partners, or members of ethnolinguistic minority communities. Search and browse over 46,000 resources dating from 1935 to the present that describe, document, and/or communicate in the languages and cultures SIL serves.

Standing Rock Sioux Tribe Language and Culture Institute

Home to the Lakota and Dakota nations, the Standing Rock Sioux Tribe is committed to protecting the language, culture and well-being of its people through economic development, technology advancement, community engagement and education.

The Endangered Languages Archive (ELAR)

A digital repository holding materials for over 600 endangered languages recorded in over 70 countries, making them accessible and available for future generations.

The Language Archive at the Max Planck Institute for Psycholinguistics (TLA)

The Language Archive (TLA) is an integral part of the Max Planck Institute for Psycholinguistics in Nijmegen. It contains various types of materials, including: audio and video language corpus data from languages around the world; photographs, notes, experimental data, and other relevant information required to document and describe languages and how people use them; records of speech in everyday interactions in families and communities; naturalistic data from adult conversations from endangered and under-studied languages, and linguistic phenomena.

The Library of the American Philosophical Society (APS)

The Center for Native American and Indigenous Research (CNAIR) promotes innovative uses of the Library’s collections that benefit Indigenous communities and scholarship.

The Native American Languages Collection at the Sam Noble Museum of Natural History

The Native American Languages Collection is comprised of analog and digital print-based materials, audio recordings and video footage relating to the diverse languages of the Americas.

The World Oral Literature Project

An urgent global initiative to document and make accessible endangered oral literatures before they disappear without record.

Language documentation and revitalization

Center for Endangered Languages Documentation (CELD Papua)

The Center for Endangered Languages Documentation works with speech communities in Papua in documenting their language and their culture; trains local linguists, students, and experts in state of the art documentation techniques; supports teachers, government agencies, artists, and activists in developing and using materials in local languages, and is committed to establishing sustainable structures to access linguistic and anthropological data from all over the world at the State University of Papua (UNIPA).

CIDLeS (Interdisciplinary Centre for Social and Language Documentation)

The Interdisciplinary Centre for Social and Language Documentation (CIDLeS) is a non-profit institution founded in January 2010 in Minde (Portugal) by a group of national and international researchers. CIDLeS aims at improving and deepening research in two linguistic areas: language documentation and linguistic typology. Besides the documentation, study and dissemination of European endangered and minority languages CIDLeS is also engaged in the development of language technologies for scientific and didactic work on lesser-used languages. CIDLeS has three research groups (CIDLeS Media Lab, Language Documentation and Language Typology and Language Revitalization) whose projects are interrelated with the aim of fostering interdisciplinary research.

Fundación Tinigua

Fundación Tinigua is an organization from Colombia working to strengthen cultural and linguistic diversity, and to improve quality of life for ethnic groups and marginalized communities.

International Conference on Language Documentation & Conservation

ICLDC is a forum which regularly brings together linguists, students, and community activists to share resources and research and discuss issues of importance in documenting and revitalizing the world’s endangered languages.

Journal: Language Documentation and Conservation

A peer-reviewed, open-access journal sponsored by the National Foreign Language Resource Center and published exclusively in electronic form by the University of Hawaiʻi Press. We publish one volume per year with no fees either for contributors or for readers. We upload articles four times per year in a publish-on-acceptance model.

Journal: Language Documentation and Description

A free, open-access journal publishing articles on language, research, and book reviews.

Language on the Move

Language on the Move (ISSN 2203-5001) is a peer-reviewed sociolinguistics research site devoted to multilingualism, language learning, and intercultural communication in the contexts of globalization and migration. Language on the Move aims to disseminate sociolinguistic research to a broad global audience.

Living Languages

Living Languages supports Aboriginal and Torres Strait Islander people who are working to maintain, revitalize and reclaim their languages.

Living Tongues

Living Tongues’ mission is to ensure language survival for generations to come, by supporting speakers who are safeguarding their languages from extinction through activism, education, and technology. Their research teams document endangered languages and cultural practices, publish scientific studies, run digital training workshops to empower language activists, and collaborate with communities to create language resources that will serve as a basis for language revitalization.

Speaking place

A social catalyst organisation that participates in reviving endangered languages and public health and justice. They use an approach called Community Self-Documentation which adapts to each community and leads naturally to language and cultural revival.

The Endangered Language Alliance

Founded in 2010, the Endangered Language Alliance (ELA) is a non-profit dedicated to documenting Indigenous, minority, and endangered languages, supporting linguistic diversity in New York City and beyond.

The Endangered Languages Documentation Programme

Since 2002 ELDP has been dedicated to its mission to document and preserve endangered languages by funding documenters worldwide to conduct fieldwork and to archive their documentary collections and to make them freely available. Every year, we provide between 30-40 grants for documentation projects around the globe.

The Endangered Languages Fund

ELF is a 501(c)3 founded in 1996 with the goal of supporting endangered language preservation and documentation projects. Their main mechanism for supporting work on endangered languages has been funding grants to individuals, tribes, and museums. ELF’s grants have promoted work in over 60 countries and have funded a wide range of projects, from the development indigenous radio programs in South Dakota, to recording of the last living oral historian of the Shor language of western Siberia, to the establishment of orthographies and literacy materials to be used by endangered language teaching programs all over the world.

The Firebird Foundation

The unique oral literatures of indigenous peoples are rapidly being lost through the death of the traditional practitioners and through the schooling of the next generation. The Program for Oral Literature of the Firebird Foundation has initiated a project to fund the collection of this body of rapidly disappearing literature. This literature may consist of ritual texts, curative chants, epic poems, musical genres, folk tales, songs, myths, legends, historical accounts, life history narratives, word games, and so on.

The Foundation for Endangered Languages

The Foundation for Endangered Languages exists to support, enable and assist the documentation, protection and promotion of endangered languages.

What languages dominate the internet?

Rest of World turned to W3Techs, a web-scanning firm based in Austria, to count all of the publicly accessible web addresses on the internet to get data on language representation online. The result? “Millions of non-native English speakers and non-English speakers are stuck using the web in a language other than the one they were born into.” Read more.

The Institute on Collaborative Language Research (CoLang)

The Institute on Collaborative Language Research, known as CoLang, is a biennial gathering for people to learn about language documentation, descriptive linguistics, and language revialization.

The Language Conservancy

The Language Conservancy stands as the foremost organization working with endangered languages in North America. They work daily on the ground in partnership with dozens of communities to revitalize their languages.


Wikitongues safeguards language documentation, expands access to mother-tongue resources, and directly supports language revitalization projects.

Language technologies


BigScience is an open collaboration boot-strapped by HuggingFace, GENCI and IDRIS, from where the world’s largest open multilingual language model, BLOOM, emerged. BLOOM is an autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources. As such, it is able to output coherent text in 46 languages and 13 programming languages that is hardly distinguishable from text written by humans. BLOOM can also be instructed to perform text tasks it hasn’t been explicitly trained for, by casting them as text generation tasks.

Google Noto Font

When text is rendered by a computer, sometimes characters are displayed as “tofu”. They are little boxes to indicate your device doesn’t have a font to display the text.
Google has been developing a font family called Noto, which aims to support all languages with a harmonious look and feel. Noto is Google’s answer to tofu. The name noto is to convey the idea that Google’s goal is to see “no more tofu”. Noto has multiple styles and weights, and is freely available to all. The comprehensive set of fonts and tools used in our development is available in our GitHub repositories.


Machine Translation systems provide very accurate results for high resource language pairs (such as English) but not for low resource languages, due to lack of datasets to build these systems. Lesan is a Machine Translation system for low resource languages. It is freely available and currently supports translation to and from Amharic, Tigrinya, Oromo, Somalia and English.

Lexicographical data on Wikidata: Words, words, words

Jens Ohlig, 25 March 2019. Wikimedia Germany blog.
Language is what makes our world beautiful, diverse, and complicated. Wikidata is a multilingual project, serving the more than 300 languages of the Wikimedia projects. This multilinguality at the core of Wikidata means that right from the start, every Item about a piece of knowledge in the world and every property to describe that Item can have a label in one of the languages we support, making Wikidata a polyglot knowledge base that speaks your language. Expanding Wikidata to deal with languages is an exciting new application.

Mozilla Common Voice

Mozilla Common Voice Initiative to help teach machines how real people speak. The project goal is to help make voice recognition open, accessible and available for developers in widely spoken languages as well as those with a smaller population of speakers often underserved by commercial speech recognition services.

Talking Dictionaries

A free language documentation tool created at Swarthmore College in 2005, which has since expanded to include 200+ endangered languages.
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, Monojit Choudhury (2020), “The State and Fate of Linguistic Diversity and Inclusion in the NLP World

The State and Fate of Linguistic Diversity and Inclusion in the NLP World

Language technologies contribute to promoting multilingualism and linguistic diversity around the world. However, only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications. In this paper we look at the relation between the types of languages, resources, and their representation in NLP conferences to understand the trajectory that different languages have followed over time. Our quantitative investigation underlines the disparity between languages, especially in terms of their resources, and calls into question the “language agnostic” status of current models and systems. Through this paper, we attempt to convince the ACL community to prioritise the resolution of the predicaments highlighted here, so that no language is left behind.


The technology standard for characters in language scripts, writing systems, and emojis. Unicode Consortium is the non-profit that governs these standards.

Unicode CLDR

Language identification and language support for computer interfaces: amazing set of resources and references about languages and language families, scripts, and their territories http://cldr.unicode.org/
Extracted CLDR Supplemental Data
Territory-Language Information

Writing Across the World’s Languages: Deep Internationalization for Gboard, the Google Keyboard.

Daan van Esch, Elnaz Sarbar, Tamar Lucassen, Jeremy O’Brien, Theresa Breiner, Manasa Prasad, Evan Crew, Chieu Nguyen, Françoise Beaufays Google Mountain View, CA, USA. November 2019.
This technical report describes our deep internationalization program for Gboard, the Google Keyboard. Today, Gboard supports 900+ language varieties across 70+ writing systems, and this report describes how and why we have been adding support for hundreds of language varieties from around the globe. Many languages of the world are increasingly used in writing on an everyday basis, and we describe the trends we see. We cover technological and logistical challenges in scaling up a language technology product like Gboard to hundreds of language varieties, and describe how we built systems and processes to operate at scale. Finally, we summarize the key take-aways from user studies we ran with speakers of hundreds of languages from around the world.

BBC Story Works: Giving life to old languages in Australia

A multi-media story about the resurgence of Australia’s First languages.

Geographies of Digital Exclusion

The chart on digital participation by world regions was adapted from Mark Graham, Martin Dittus: Geographies of Digital Exclusion, published by Pluto press in January 2021, in print and digital open access versions.

Big Science

A year long research workshop (2021-22) on large multilingual models and datasets.


Eberhard, David M., Gary F. Simons, and Charles D. Fennig (eds.). 2021. Ethnologue: Languages of the World. Twenty-fourth edition. Dallas, Texas: SIL International.


Glottolog is a bibliographic database of the world’s lesser-known languages, developed and maintained first at the Max Planck Institute for Evolutionary Anthropology in Leipzig, Germany (between 2015 and 2020 at the Max Planck Institute for the Science of Human History in Jena, Germany). Its main curators include Harald Hammarström and Martin Haspelmath. (Wikipedia)

Indian Languages — Defining India’s Internet

An April 2017 study by KPMG in India and Google.

Internet World Stats

A source for Internet usage and population statistics.

New report on Human Rights, Racial Equality and Emerging Digital Technologies: Mapping the Structural Threats

By the UCLA Center for Critical Internet Inquiry, with participant observations across a range of scenarios. https://twitter.com/c2i2_ucla


A source for statistics across a wide range of sectors and industries.

UNESCO Atlas of the World’s Languages in Danger

Print and digital atlas documenting the world’s languages that are on a continuum from being vulnerable to being critically endangered and extinct.


A website that provides statistics about web technologies.


A source for statistics on world populations.

Other language treasures!

Kusunda: Speak to Awaken

KUSUNDA is an interactive virtual reality (VR) experience about the sleeping Kusunda language in western Nepal.

Languages of New York City

A free and interactive digital map of the world’s most linguistically diverse metropolitan area.

Last whispers

Last Whispers is a project about the mass extinction of languages. By definition, this extinction occurs in silence, since silence is the very form it takes. Last Whispers sounds what has gone silent. While we drown in the noise of our own voices — uttered within dominant cultures and languages — we are surrounded by a vast ocean of silence. Last Whispers aims to further awareness of linguistic extinction.

Linguistic Diversity and Social Justice: An Introduction to Applied Sociolinguistics.

Piller, I. (2016). Linguistic Diversity and Social Justice: An Introduction to Applied Sociolinguistics. Oxford University Press. Retrieved from https://oxford.universitypressscholarship.com/

The Smithsonian Mother Tongue Film Festival

Through digital storytelling, the festival amplifies the work of diverse practitioners who explore the power of language to connect the past, present, and future.

Toward a Wikipedia For and From Us All. In ::Wikipedia @ 20.

Vrana, A. G., Sengupta, A., & Bouterse, S. (2020). Toward a Wikipedia For and From Us All. In ::Wikipedia @ 20. Retrieved from https://wikipedia20.pubpub.org/pub/myb725ma

World Federation of the Deaf

The World Federation of the Deaf is an international non-profit and non-governmental organisation of deaf associations from 133 countries working to ensure equal rights for 70 million people around the globe.