The language geography of Wikipedia

Introduction

In our report on the state of the internet’s languages we are reviewing interface language support by major platforms, and content support for major languages on the commercial mapping platform Google Maps. To complement these perspectives we now also want to look at the languages of Wikipedia, the largest collaborative effort in human history. Wikipedia is an early participant in the global expansion of online knowledge production: it began with a single English-language edition more than two decades ago, and now offers more than 300 language editions. Our platform survey has shown that this places it at the forefront of interface language support – Wikipedia’s user interface has been translated into more languages than any of the commercial platforms we looked at, including Google and Facebook.

However, this does not mean that speakers of these 300 languages get access to the same content. Instead, as discussed by multiple essays in our report, its coverage in some languages is better than in others. We want to look at this empirically, using maps and charts to analyse Wikipedia’s content coverage in more detail. How good is its content coverage across its language editions? Are some languages more well-represented than others? Does this mean that certain language populations have access to more content than others?

We will first discuss our methodology and data sources. We then look at the content volumes of major languages on Wikipedia and review whether the world is represented equally across these languages. We then ask whether any differences we encounter may be the result of over- or underrepresentation of certain languages and global regions. Finally, we will look at recent trends – how has Wikipedia’s global coverage changed over time?

Data sources and methodology

Data about Wikipedia

The Wikipedia article “List of Wikipedias” offers basic information and general statistics about the 300 Wikipedia language editions.

Part of our analysis relies on data dumps by the Wikimedia Foundation, i.e. downloadable data sets that capture basic information about the articles and edit activity of the various Wikipedia language editions. We computed article counts and content growth numbers based on a data dump from early 2018. Our analyses of the geographic distribution of Wikipedia content relies on a data set of Wikipedia geotags from early 2018. Geotags are a standardised annotation scheme to embed geographic references (such as coordinates) within Wikipedia articles. These are not necessarily intended to be read by humans, but rather are meant to assist in the automated organisation and presentation of information, for example to display maps within an article.

Surface area and population density

Data about the surface area and population of global regions is provided by the World Bank as part of their World Development Indicators data set, at country resolution.

For higher-resolution estimates of population density we rely on the Global Human Settlement Layer (GHSL) provided by the European Commission. We aggregated this data in a regular grid in order to produce the population-normalised map of Wikipedia content density.

Language populations

Population estimates for the speaker populations of today’s active languages are provided by Ethnologue, a global survey of thousands of active languages (Eberhard, Simons, and Fennig 2020¹).

For our analysis of local-language content we rely on a dataset on territory–language information by the Unicode Common Locale Data Repository (CLDR), which provides country-level estimates of language populations (Unicode 2020²). It also catalogues languages with official status at regional and national level.

Analysis of local-language content

We derive a collection of “local” languages from the Unicode CLDR dataset. For every country, we define this as the set of languages that are either classified as an official language, or that are in use by at least 30% of the population.

For every country, we then identify the most prevalent local language: the local language with the largest population share. We identified 73 such languages, which are most prevalent in at least one country. English is most widely spoken, and is the most prevalent language in 34 countries. It is followed by Arabic and Spanish (18 countries), French (13 countries), Portuguese (seven countries), German (four countries), and Dutch (three countries). Traditional Chinese, Italian, Malay, Romanian, Greek, and Russian are the most prevalent languages in two countries. The remaining 60 languages are most prevalent in a single country.

We compare this language distribution with a proxy-measure of the global distribution of Wikipedia content. For every country, we identify the Wikipedia language edition with the largest number of articles about that country. This data set comprises a slightly smaller set of 35 unique languages, reflecting the bias towards English-language content. English is the dominant Wikipedia language in 98 countries, followed by French (nine countries), German (eight countries), Spanish (seven countries), Catalan and Russian (four countries), Italian and Serbian (3 countries), and Dutch, Greek, Arabic, Serbo-Croatian, Swedish, and Romanian (two countries). The remaining 21 Wikipedia languages are most prevalent in a single country.

Wikipedia’s language geography

Wikipedia offers us some high-level information about its 300 language editions in its List of Wikipedias. This overview gives us a first sense of Wikipedia’s linguistic breadth, but it also illustrates that Wikipedia’s language editions vary widely in scale – both in terms of number of articles, but also in terms of the size of their editor communities. English Wikipedia is the largest by far, with more than six million articles and almost 40 million registered contributors. The next-largest contributor communities are the Spanish, German and French Wikipedia editions, each with between four and six million contributors, and around two million articles. By comparison, most of the remaining language editions are rather small: only around 20 language editions have more than one million articles, and only 70 have more than 100,000 articles. In other words, the average Wikipedia language edition only has a small fraction of the content that is found in English Wikipedia.

To what extent do these differences in coverage and editor community size relate to the sizes of the respective global language communities? Figure 1 compares the amount of Wikipedia content in the 10 most widely spoken languages with the number of people speaking these languages, including second-language speakers, as estimated by Ethnologue. Using these speaker numbers as a reference point we can see that content volumes in European languages such as English, French, Spanish, Russian, and Portuguese are proportional to the number of speakers. This makes some sense: it suggests that for certain languages, larger language communities are able to produce a larger number of Wikipedia articles. However we can also see that other widely spoken languages are relatively underrepresented. In particular, Mandarin Chinese, Hindi, Arabic, Bengali, and (Bahasa) Indonesian are each spoken by hundreds of millions of people, yet their Wikipedia editions are much smaller, with a smaller number of articles compared to the editions in European languages. There are more articles in the French, Spanish or Portuguese Wikipedias than there are in the Chinese, Hindi or Modern Standard Arabic versions, although some of these represent a much larger population of speakers. In other words, certain languages are much better represented on Wikipedia than others, even when accounting for their language populations, and more content exists in some languages than others even if their population numbers are comparable. (As an aside, this distribution is surprisingly similar to the language support offered by Google Maps, as shown in Figure 5 of our Google Maps survey).

Wikipedia content and number of speakers for the 10 most widely spoken languages in the world. (Population estimate: Ethnologue 2019, which includes second-language speakers.) [PDF↓]

Does this affect how the world is represented on Wikipedia in these languages? What kind of knowledge is present or absent as a result of these differences? There are many ways to assess the content contained in Wikipedia. In this report we want to approach it from the perspective of geography, or more specifically information geography: we want to ask which parts of the world are represented on Wikipedia, and which are absent. To do so, we will use geotags to count the number of articles that have been written about particular places in the world in particular languages.

The content distribution maps in Figures 2 and 3 show the information geographies of some of the largest Wikipedia language editions. We can see in Figure 2 that English Wikipedia arguably covers the world: that is, a great number of the world’s places are being written about in English. At the same time, we can also see that the English Wikipedia places a strong emphasis on places in Europe and North America. By contrast, countries in Central and South America, Africa, and Asia are comparatively content-sparse, with some exceptions – for example, Japan has good coverage.

On the map for Spanish Wikipedia in Figure 3 we can see an even more constrained language geography. While it still covers large parts of the world, it is largely focused on Western Europe, North America, and Central and South America. There is comparatively little Spanish-language coverage of Africa and Asia, with the exception of parts of East Asia and the Pacific. In other words, these distributions suggest that content coverage in the Spanish Wikipedia appears to focus specifically on the global regions in which Spanish is more widely spoken.

We can see a similar pattern for Arabic Wikipedia, though coverage in this language is even more constrained to particular regions, likely also because the Arabic-language Wikipedia editing community is smaller than the English and Spanish-language communities. Here, coverage is once again focused on Europe and North America, but in particular also North Africa and the Middle East – countries where Arabic is most widely spoken.

The maps for the Bengali and Hindi Wikipedias continue this pattern: they have even lower overall coverage, which reflects their even smaller contributor communities. The Wikipedia content in these languages appears to be largely focused on South Asia, especially India and Bangladesh. Compared to the other maps, these languages are significantly more sparse in content, despite being major global languages spoken by hundreds of millions of people. As a result, we consider this to be a major coverage gap. But Wikipedia is not alone in this – its comparatively low coverage of content in Bengali and Hindi closely resembles the sparse content coverage in these languages on Google Maps.

Overall we can see a number of basic repeating patterns across these maps. Firstly, coverage is often concentrated on places in Europe and North America, possibly in part due to the English-language origins of the initial Wikipedia community. Secondly, the relative geographic focus of a Wikipedia language edition is somewhat dependent on the geography of its contributor community – certain languages have more Wikipedia contributors than others, and all language editions tend to have more content about places where the language is spoken. However, there are also instances of significant coverage gaps, and overall we can see some significant coverage inequalities between the languages. In particular we refer to the Bengali and Hindi Wikipedia editions (Figure 3), representing two major languages spoken by more than 100 million speakers each, but each with relatively low content coverage on Wikipedia.

The information density of English Wikipedia in early 2018. Darker shading indicates a greater number of geotagged articles. [PDF↓]

The information density of the Arabic, Bengali, Hindi and Spanish Wikipedias in early 2018. Darker shading indicates a greater number of geotagged articles. [PDF↓]

Underrepresented world regions

While the examples we have looked at reveal some striking differences, some of the more general conclusions we can draw from them have been known for some time. We have been mapping Wikipedia’s content coverage for almost a decade, starting with our first maps of English Wikipedia in 2013. Back then we already found that it covered many parts of the world, and – like many – we were awed by the vast amounts of human knowledge contributed by its volunteer editors. However in a separate study we also found that Wikipedia’s coverage is highly uneven: certain parts of the world such as Europe and North America were covered in great detail, while other parts were much less well-covered. In our more recent maps of Wikipedia from 2018 we can see that Wikipedia’s content has grown more than tenfold, and that its coverage of the Global South has much improved. However, we do still see some striking differences in coverage between global regions when we look at Wikipedia as a whole – and as we have already seen above, this becomes even more striking when we look at individual language editions.

However we can’t expect all parts of the world to be represented equally – for example, certain regions of the world are much more densely populated than others, and we would expect such places to be associated with more Wikipedia content. So rather than simply asking which parts of the world are represented, we also want to ask which parts are potentially over- or under-represented, relative to some external expectations. We can compare numbers of Wikipedia articles with two basic reference points: the amount of surface area (how much space there is), and population density (how many people live there).

The map in Figure 4 shows Wikipedia’s content density relative to the global population. It visualises the number of Wikipedia articles about places in the world, aggregated across all language editions, and normalises it by the size of the local population. We use the high-resolution population estimate of the GHSL as a reference, and segment space into even-sized hexagonal grid tiles rather than showing individual locations of articles. This makes it easier to compare the relative content density between places. On this map, darker shading signifies greater representation of a region relative to its population density, and light shading less representation. A map that perfectly reflects the global population distribution would be an even tone throughout.

As on previous maps we can see that Europe and North America are fairly dense in content relative to their population sizes, while the population hotspots of South Asia, China, Central Africa, and in other regions of the Global South are less represented. We can also see the comparatively high content density in essentially uninhabited regions such as the Sahara, central Australia, the Antarctic coastline, and elsewhere – few people may live there, but there are still Wikipedia representations of these places. In other words, we can see that for many regions of the world, Wikipedia’s content density does not simply reflect the world’s population distribution. Instead, certain world regions, particularly many countries of the Global South, are significantly less represented than we might expect, based on their population densities.

As a second comparison we aggregate these statistics by world region, and place them side by side. This is shown in Figure 5, where we visualise the total surface area, estimated population, and number of geotagged Wikipedia articles for every world region. This chart confirms our overall impression that there are some significant geographic inequalities in Wikipedia’s representation of the world, across all languages: there is significantly more content about certain regions than about others. Notably, Europe and North America are written about in much more detail, accounting for a much larger number of geotagged articles, than any other world regions. This is striking because they are smaller in both population and surface area than other regions, such as Africa and large parts of Asia. For example, the region of Europe and Central Asia (which includes Russia) represents slightly less surface area and a slightly smaller population than the continent of Africa, yet accounts for approximately four times the digital content.

Number of Wikipedia articles per million people, across all language editions. Darker shading indicates a greater number of geotagged articles relative to the local population. (Population data: GHSL 2019) [PDF↓]

Number of Wikipedia articles by world region, compared to population size and surface area. (Population estimate and surface area: World Bank 2020.) [PDF↓]

Who has access to local-language content?

In this section of the report we have gained a basic perspective on Wikipedia’s language and geographic coverage, and we have seen some striking inequalities in coverage both between languages and between global regions. But we have not yet seen how this might affect the use and usefulness of Wikipedia for particular populations. Do these content and coverage gaps become barriers to knowledge access? For example, if some languages are richer in content than others, and some regions are more well-represented than others, does this mean that certain places are not well-covered even in the languages that are native to the place? In other words, does this mean that not everyone can get access to information about their own places in their own language?

There are many potential ways of measuring this, but we want to approach it for now with a simple comparison: are the most detailed representations of a country (i.e. the largest number of articles) written in a local language, or in a foreign one? We can identify the most content-rich Wikipedia languages using the geotagged data used in our earlier analyses. As a further reference we rely on a language data set by the Unicode consortium, a list of officially recognised languages for every country with estimates of their speaker populations. Using these two data sets we can see if the Wikipedia language with the most articles about any particular country reflects the language/s actually spoken in that country.

Figure 6 visualises this comparison for the 169 countries where this data was available. The map shows a striking pattern of potential language exclusion: for many countries in Africa, Central and South America, and South Asia, the richest content about that country is written in a foreign language. In other words, people in these countries will be unable to access much of Wikipedia’s knowledge about their own places, unless they are able to speak a second language.

For example, in India, the most widely spoken local language is Hindi. But most Wikipedia articles about India are actually written in English (39,000 compared with 11,000 articles). Again, most of the population in Madagascar is literate in the national language Malagasy, but less than a dozen Wikipedia articles about the island nation are written in this language. Instead, most of the content is written in English (1,500 articles), which after a 2010 referendum is no longer considered an official language.

However, we also have to acknowledge that maps like these also bring with them some significant conceptual and political challenges. Most of the content about South Africa is written in English, which is an official language of the country, and thus identified as a “local” language on our map. But when we first presented this map in Johannesburg, South African digital rights campaigner Onica Makwakwa pointed out that this might be a point of debate – English was introduced to the country as a result of historic British colonialism, and it is not historically considered to be a local language of Africa. Similar histories exist in many other places in the world where colonial languages such as French, English, Spanish and Portuguese have been introduced by force. These are languages that are now widely spoken by local populations, yet are they local languages?

In general, the global prevalence of certain colonial languages in these digital representations is quite striking. English in particular is the most content-rich wiki language for many countries around the world – as we can see in Figure 7, this is the case in the United States, Canada, the U.K. and Australia where it is the national language, but also in many parts of the Global South where it was introduced by colonialism. On the one hand this reflects its use as a kind of online lingua franca, and it maybe also simply reflects the global pervasiveness of the language more broadly. Yet it is still remarkable that many parts of the world are covered much more comprehensively in English than in local languages.

Wikipedia's local-language prevalence. Are the most detailed representations of a country written in a local language (orange and beige), or a foreign language (blue)? (Language data: Unicode CLDR 2019) [PDF↓]

Countries where English is the most content-rich wiki language, accounting for the largest number of articles about local places. (Language data: Unicode CLDR 2019) [PDF↓]

Change over time

To their great credit, the Wikipedia community and the Wikimedia Foundation have been paying close attention to reports of language inequality, and have spent much effort trying to address this. This has resulted in a significant strategic shift within the Foundation, and the articulation of new aims, namely: to ensure a just representation of the world’s knowledge and people on Wikipedia. Additionally, Wikipedia contributors are spending increasing efforts on improving content coverage in underrepresented languages, and also content about underrepresented topics. Wikipedia now lists almost 200 wiki projects that seek to counter such systemic representation biases, and as a result of these efforts, the body of collective knowledge has been growing steadily.

But has this resulted in an improved coverage of previously underrepresented topics? And in particular, has this resulted in improved coverage of places in the Global South? We show in Figure 8 how content per region has grown over time. It is also clear that representations of places in Europe still account for the largest amount of content by far, and that this content keeps growing. But we can also see reassuring content growth for other regions, particularly in recent years where the amount of content covering regions of the Global South has grown significantly, especially content about Africa and South Asia. Overall we can say that coverage is steadily improving, albeit not at the same rate everywhere. Putting this trend in proportion, we can say that in 2010, Europe had 20 times more geotagged articles than all of Africa, while by early 2018 this gap has shrunk and there is now only four times more European content than content about Africa. In other words, the coverage gap for Africa has narrowed over time. But at this moment in time, the continent has a greater surface area than Europe and a larger population, but is still less well-documented on Wikipedia.

Growth in Wikipedia content over time, by world region, across all language editions. [PDF↓]

Discussion

Although we have only considered a tiny subset of the 300 Wikipedia languages, we have noted some striking differences in content coverage between them. Overall we find that Wikipedia’s language editions vary widely in scale, both in terms of number of articles, but also in terms of the size of their editor communities. English Wikipedia is by far the largest on both counts, and by comparison, the average Wikipedia language edition only has a small fraction of its content.

We believe that this in part relates to the sizes of their respective language communities. For example, the amount of content in more widely spoken European languages such as English, French, Spanish, Russian, and Portuguese is approximately proportional to the number of speakers. At the same time there are some marked exceptions to this, and more content exists in some languages than others, even when their population numbers are comparable. For example, the language editions in Mandarin Chinese, Hindi, Modern Standard Arabic, Bengali, and Indonesian are much less comprehensive, despite these languages being spoken by hundreds of millions of people. We consider these to be examples of significant gaps in content coverage.

This inequality is also reflected in Wikipedia’s geographic coverage: many of the world’s places are written about in the English Wikipedia, while the geographic coverage in other language editions is often much more constrained. This often follows the global distribution of the respective language populations. For example, Spanish Wikipedia has particularly detailed coverage of the Americas, while Arabic-language coverage is particularly rich about places in North Africa and the Middle East.

Yet when looking at Wikipedia’s geographic coverage as a whole, information about places in Europe and North America is highly detailed, while many other regions of the world are relatively underrepresented, particularly places in Africa, parts of Asia, and in other regions of the Global South. This is especially evident when we account for the uneven distribution of the global population – for example the population hotspots in South Asia and China – when it becomes apparent that content about India in Bengali and Hindi, spoken by hundreds of millions of people, is comparatively absent.

As a consequence of these inequalities in both language and geographic coverage, certain populations will find it easier to access knowledge on Wikipedia than others. For example, we find that for many countries in Africa, Central and South America, and South Asia, most of the content about those countries is in a foreign language. In other words, in many of these places, people may need to be able to speak a second (possibly foreign) language in order to access Wikipedia information about their own places. In general, the global prevalence of certain colonial languages like English and Spanish in these digital representations is quite striking.

Why do we see these differences? As we mention in our platform survey and the review of Google Maps, it is often easier to show these inequalities of content coverage than to explain why they occur. In general we can say that this is the result of many factors, some of them historic and pre-digital, which are then amplified by digital environments. For example, the basic cost of a broadband connection varies greatly around the world, and for many people presents a very basic barrier to digital participation. In this context it is also noteworthy that Wikipedia is founded on the assumption that its contributors are volunteers who participate in their spare time. While this may be an appropriate assumption to make for some parts of the world, in others it might be highly unusual to have the capacity to donate one’s time and labour for free.

As a result of the rich diversity of global circumstances and contexts, some parts of the world will find it easier to contribute to Wikipedia than others, which in turn then shapes what is being written on Wikipedia. Yet increasingly, as global connectivity is steadily improving we also need to consider whether such absences are the result of an individual choice: it is plausible that people might use Wikipedia less if it is not available in their language, or if it does not have answers to the questions they have about the world. In other words, it is not enough for us to expect people to come to Wikipedia, we should also ask Wikipedia to come to them.

The data collected for this survey are available for download.

Bibliography

Eberhard, David M., Gary F. Simons, and Charles D. Fennig. 2020. ‘Ethnologue: Languages of the World’. Twenty-third edition. Dallas, Texas: SIL International. http://www.ethnologue.com/. ↩︎
Unicode. 2020. ‘Unicode Common Locale Data Repository’. 37. http://cldr.unicode.org/. ↩︎