Introduction
As part of our broad review of the state of the internet’s languages we have looked at the state of interface language support on major platforms, asking: will people need to be able to speak a second language in order to use particular apps? However, the essays by our contributors also illustrate that interface language support is only part of the overall picture of language support, and that the actual content of major websites is often not available in key languages. We thus want to also look at the state of content coverage in major languages, by considering two platforms: Wikipedia and Google Maps.
We have chosen Google Maps because for many people it will act as a kind of portal to the world – it helps us navigate the city, and it helps us learn about other places. In principle it is accessible to many: according to our platform survey, the Google Maps website currently offers interface support for more than 70 languages. But how good is its content coverage in these languages? Can it offer everyone a view of the world in their language? Or do the languages we speak potentially limit which parts of the world we get to see on Google Maps?
In this section of the report, we compare the data volumes and geographic distributions of the data shown by Google Maps across a dozen different languages, at both global and local scale. Overall we find remarkable differences in coverage – for example, coverage in certain major languages is highly constrained to certain geographic regions, while coverage in other languages is much more broad.
Due to the commercial nature of Google Maps, the data behind it is not publicly available, and in order to analyse its coverage we have attempted to collect the data ourselves. To this end we executed millions of map searches for different places and languages and collected the results. These form the basis for our analysis, and we will discuss the data collection process in some detail.
Methodology
In contrast to Wikipedia, where all its data is openly available for anyone to analyze, the data behind Google Maps is not open. For our analysis we instead crawled it with automated scripts, essentially imitating a person on the street searching for something nearby – executing searches for dozens of search terms, each translated into about a dozen languages, and repeated across thousands of global locations.
Search terms
We selected 44 English search terms to help us discover map content. In a first stage we reviewed taxonomies of geospatial databases, and based on their vocabulary curated a list of the types of places that we might commonly see in cities around the world, including restaurants, schools, parks, and other potential destinations. We specifically included types of places that are more frequent, such as shops and schools, but also included parks and universities that may be less frequent within a city, but that are nevertheless commonly found. The full list of urban features, in alphabetical order, is: atm (cash machine), bank, bar, cafe, church, coffee, dentist, dinner, florist, food, grocery, hairdresser, hotel, library, lunch, mosque, museum, music, park, pharmacy, place, restaurant, school, shop, supermarket, synagogue, theater, university. We further included a more general set of words to discover parts of the map that are not captured by these urban features, adopted from an earlier study of the Google Maps geography (Graham and Zook 2013). In alphabetical order, these terms are: cat, christian, democracy, flu, god, government, hindu, internet, jewish, love, monkey, music, muslim, sex, tax, war, wedding.
Translations
For our data collection of Google Maps’ global coverage we selected the 10 most widely spoken languages according to Ethnologue: English, Mandarin (in Simplified Chinese Script), Hindi, Spanish, French, Arabic, Bengali, Russian, Portuguese, and Indonesian (Eberhard, Simons, and Fennig 2020). For additional data collection at local scale we further included several regional languages, namely the African languages Swahili, Xhosa, Zulu, and Afrikaans, and the South American language Guaraní.
We translated the search terms into each of the target languages with the help of both professional and volunteer translators, recruiting at least one volunteer translator and one professional translator per language. Translators were given a detailed briefing, and asked to choose terms that would be used for a map search by a native speaker of the respective language. In the briefing we acknowledged that different people might use different terms for the same information needs, and left it up to the translator to choose their preferred tone of voice. In cases where translators offered multiple alternative translations per term we included all variations for data collection. We collected all suggested translations to build our final collection of search terms. This combined and comprehensive approach provides us with some confidence in the correctness of the translations, while also allowing for a degree of variety in user search strategies.
In summary, we translated the 44 search terms from English to Afrikaans, Arabic, Bengali, English, French, Guaraní, Hindi, Indonesian, Mandarin (in Simplified Chinese script), Portuguese, Russian, Spanish, Swahili, Xhosa, and Zulu. Many thanks to our volunteer translators!
Data collection
Our searches were organised in a regular grid of search locations, covering all places over land and focusing in more detail on three cities. For the global scan we constructed a regular worldwide grid at an average grid spacing of approximately 160 km, accounting for 2,600 sample points over land. For three urban regions – Kolkata in India, Dar es Salaam in Tanzania, and Nairobi in Kenya – we constructed a further search grid, covering both the urban centre as well as part of the suburban ring at relatively fine spatial resolution. In Kolkata and Dar es Salaam, our individual search locations are at most hundreds of meters apart. We chose a coarser grid spacing of 1.2km for Nairobi.
We executed Google Maps searches at every sample point, sending a search request for each of the translated search terms in each of the different languages. In total, data collection involved almost two million searches over a period of approximately two months.
The resulting search result listings provided us with information about the locations, or “places”, known to Google Maps. The metadata for each individual search result includes a name for the location or venue, a geographic location as both geographic coordinates as well as street address, and an identifier code that uniquely identifies the particular “place” within Google’s geospatial database. It may also include additional metadata such as a homepage URL, and a set of category labels. Finally, each search result includes an automated classification of the language in which the search result is written, which allows us to compare whether the result language matches the language of the search request.
Other data sources
In our analysis we compare the distribution of Google Maps data to several other reference points. For a comparison with global population distribution we make use of the high-resolution population density estimates of the Global Human Settlement Layer (GHSL) provided by the European Commission. We aggregated this data in a regular grid in order to produce the population-normalised map of Google Maps content density. We further compare the data coverage to estimates of the language populations of the 10 most widely spoken languages according to Ethnologue (including second-language speakers).
Content volume in different languages
In a first stage we collected data about Google Maps’ global content coverage in the 10 most widely spoken languages: English, Mandarin Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Portuguese, and Indonesian. We collected tens of millions of individual search results in these languages, and across these identified around three million unique places (venues and other locations) that are shown on the map. These give us a first broad estimate of what Google Maps knows about the world, and which parts it will show to speakers of different languages.
Overall we found that Google Maps’ coverage of the world varies widely depending on the language – certain languages provide access to much more content than others. Additionally, we found that the distribution of this content also varies: content in many languages is more dense in certain global regions than in others. We can see this in the maps that follow.
We can see in Figure 1 that Google’s English-language map arguably covers the world, although it has much greater content density in the Global North, with a focus on Europe and North America. We also see high content density in South Asia and parts of South-East Asia, and relatively high density in large parts of South America. But by comparison, many parts of Africa are comparatively sparse in content. These differences may partially relate to the distribution of people around the world, and we will show a population-normalised version of these maps further below to account for this.
Compared to the relatively well-covered English map, we can see in Figure 2 that Bengali lies at the other extreme – its coverage is mostly restricted to South Asia, especially India and Bangladesh, and Google Maps has little to no content for Bengali speakers in most of the rest of the world. In other words, although it offers support for the Bengali language, its coverage is really limited to particular global regions. As a result, Bengali speakers need to switch to a second language such as English in order to discover additional content, and to navigate other places. We see a similarly constrained distribution of content for Hindi.
Content coverage in the remaining languages lies somewhere between these two extremes. No other language on Google Maps is as content-rich as English. We can see on the maps for Arabic, French, and Chinese content in Figure 3 that their distribution relates to where in the world these languages are spoken. For example, both the French and Arabic maps show relatively high content density along the North African coast, while only the French map shows high content density in West Africa. This is maybe not surprising: Google Maps relies on the collection of information about the world’s places on websites and other promotional material, and such material tends to be written in the languages of the respective place.
Overall, around half of the three million unique places we discovered were shown in English-language searches. By comparison, only 20-25 percent of these places were included in the results for French, Spanish, Russian and Portuguese searches, and only 10-15 percent in the search results for Indonesian, Arabic, and Mandarin Chinese. By contrast, speakers of Hindi were shown less than five percent of the global map, and speakers of Bengali less than one percent of the global map. (We will show this distribution in Figure 5 in the next section.)
Compared to population sizes
Of course we cannot expect that Google Maps has the same amount of content for all the world’s places – certain regions of the world are more densely populated and urbanised than others. To account for this, the map in Figure 4 is normalised by the global population density provided by the GHSL. It shows the number of places on Google Maps per million inhabitants, aggregated for all 10 languages. Arguably this gives us a better sense whether certain regions of the world are over- or underrepresented on Google Maps, relative to the number of people who live there.
This normalisation has two effects. First, once we consider local population density we can see that fewer regions are extremely poorly covered than Figures 1–3 would suggest. In other words, it appears that the distribution of content on Google Maps to an extent does reflect the global population distribution, in that less densely populated regions also tend to have less content-rich maps.
On the other hand, we can also still see some remaining differences in regional content coverage that suggest that not all global regions are equally well covered. The relative density of content is high or very high for North America, Western Europe, parts of South America, and Australia. Yet the maps of large parts of Asia, and most of Africa, only have a fraction of the content, relative to their population density. The previous hotspots of South Asia and South-East Asia now blend in with their less well-represented neighbours: relative to their high population density, Google’s maps are not always as detailed here as they are for Europe and other parts of the Global North. Maybe most strikingly, it is now clear that vast parts of Africa are among the least well-covered places in the world, relative to their population density.
This huge geographic discrepancy is in part also the result of an unequal coverage of languages within Google Maps. We have already seen that different languages provide access to vastly different amounts of content. The chart in Figure 5 shows the amount of queried content returned for each of the 10 languages, and the number of speakers of each language. We can see that for many of these languages, content coverage is somewhat related to the distribution of their speaker populations.
There are some instances where there is less content available than we would expect based on population sizes. Speakers of Mandarin Chinese, Hindi, and Bengali only have access to a much smaller fraction of the global map, compared to speakers of other languages. This is particularly surprising because Chinese is the second most widely spoken language in the world, Hindi is the third, and Bengali the seventh, yet they have access to less content in their language than speakers of other major languages.
In other words, there are major languages that represent a significant part of the global population that are nevertheless comparatively underserved. Speakers of these languages are only shown a fraction of the available content, and consequently only have access to a fraction of all representations of the world. These populations need to switch to a second language (likely English) in order to discover additional content. (As an aside, we identified a similar pattern in our Wikipedia survey, where Figure 1 shows a surprisingly similar distribution.)
Foreign-language search results
The practical reality for users of these maps is often even less encouraging than the findings above suggest. It is a common occurrence on Google Maps that search results may include entries in multiple languages, particularly when searching in locations that are multilingual, or when searching in a language that is not locally spoken. For example, if we search in English for parks and gardens in Tangier, Morocco we might receive some Arabic-language search results. We can estimate how often this takes place because Google Maps offers us an automated classification of the language of each individual search result, which it then uses to prioritise them.
Figure 6 shows the shares of search results in each of the 10 major languages, broken down by search language. It shows how frequently searches in one language yielded results in another.
The chart shows that searches in many of the 10 languages often yield English-language search results. This happens for almost all languages: searches in Arabic, Indonesian, but also European languages such as Spanish, French, and others frequently return some English-language results. In other words, the chart shows that the content coverage of many languages is actually not as good as it may first appear. Instead, a lot of the content speakers of these languages are offered is actually in English. It is further noteworthy that these content substitutions don’t take place for Bengali and Hindi – as we have already seen, Bengali and Hindi searches in places outside South Asia instead simply yield empty search results.
In other words, content coverage in some languages is increased because Google Maps includes content in these languages, rather than showing empty results. Although this is arguably in the interests of people navigating the map, it is notable that such substitution is typically in English and rarely in any other language.
At local level
So far we have looked at the language geography of Google Maps at a global level, which is somewhat removed from the way in which we use the map every day. To illustrate our findings with more concrete examples we also want to see how these coverage differences affect the representation of particular cities. We repeated our data collection for three cities: Kolkata in India, Dar es Salaam in Tanzania, and Nairobi in Kenya. In each of these cities we tried to estimate how much content is available in major local languages.
The map in Figure 7 shows the language geography of Google Maps in Kolkata. It visualises the density of content we discovered with local map searches in Bengali, Hindi, and English, this time only considering content that is actually described in these languages. In other words, for the content density map in Bengali we only consider the locations of search results that have been described in Bengali. The map shows a striking difference in coverage between the three languages: there is about three times more English-language content available about Kolkata than content in Hindi, or content in Bengali. In practice this means that speakers of Bengali and Hindi may need to switch to English-language searches, or be able to understand the English-language search results, in order to discover large parts of the city.
There is an even more striking absence of Swahili-language content in Nairobi and Dar es Salaam – the maps in Figure 8 and Figure 9 reveals that Swahili-language content is effectively absent in both cities. Instead, most of the content we discovered in our crawls of these two cities is in English. For example, a search for the English term “restaurant” yields search results in both cities, while the Swahili equivalents “mkahawa” or “mgahawa” showed empty results. We can further see that the English-language coverage of both cities lags behind the high information density of Kolkata – possibly a reflection of Kolkata’s relatively high population density.
This is in part a surprising outcome because Swahili is supported as an interface language on Google Maps. And this leads to some potential confusion – for example, when English-language search results are correctly labelled in the Swahili interface language as “Mkahawa” (restaurants), and yet searches for the same Swahili term return no results. As a consequence, Swahili speakers can use the application in their own language, and in some respects the interface speaks back to them in their language, however they still need to use English search terms to discover key parts of the city.
Once more we consider this content gap a significant omission – according to Ethnologue, Swahili is one of the 15 most widely spoken languages in the world, spoken by an estimated 20 million first-language speakers, and 100 million people when including second-language speakers. Yet, the language is essentially not represented on Google Maps, not even in places where it is widely spoken.
In summary, we can see at a local level what we have already observed globally: there is vastly different content coverage in different languages, and many parts of the world are not shown on non-English maps.
This absence of content is likely also the case for many other African languages, and for many other languages of the world. In preparation for this work we tried to collect similar data for the South African languages Xhosa and Zulu, and for Guaraní in Paraguay. Yet in all three cases we found that these languages are essentially not represented in Google Maps. Instead, the cities where these languages are spoken are represented in languages such as English, Afrikaans, Spanish, or other majority languages of the respective regions. In other words, Google Maps is not available to speakers of Xhosa, Zulu, and Guaraní, and potentially to speakers of many other regional languages spoken by millions. This is not an insignificant omission: according to Ethnologue, Xhosa is spoken by an estimated 8 million people (19 million including second-language speakers), Zulu by an estimated 12 million (28 million including second-language speakers), and Guaraní by an estimated 6 million.
Discussion
We have estimated the global content coverage of Google Maps in the 10 most widely spoken languages, and augmented this with observations about local content coverage in Kolkata, Dar es Salaam, and Nairobi. We find striking differences in content coverage between languages at both global and local level. Overall we find that Google Maps in its current form is dominated by English-language content.
In general, we find that coverage in certain major languages is highly constrained to specific geographic regions, while coverage in other languages is more dispersed – likely relating to the existing language geography. Bengali and Hindi are particularly spatially constrained – the maps in these languages are largely limited to South Asia.
There are some indications that Google seeks to address content gaps through inclusion of foreign-language content when results are not available in the search language. This kind of content substitution happens for some languages more than for others. For example, searches in Arabic, Indonesian, Spanish, and French frequently return some English-language results. On the other hand, the global maps in Bengali and Hindi don’t show such content substitution and are largely blank outside their respective home regions. It is further noteworthy that when content substitution takes place, most of the time it results in the inclusion of English-language content, rather than content in any other language.
However, compared to these major world languages, speakers of less widely spoken languages are not nearly as well supported, and their languages are often entirely unrepresented on the map. This leads to some striking linguistic absences. Despite efforts to examine the coverage in Zulu and Xhosa in South Africa, and Guaraní in Paraguay, we found these languages to essentially not be represented on Google Maps, despite being spoken by millions. Even Swahili – one of the 15 most widely spoken languages of the world – is largely absent, and in two of the cities where it is spoken, English content dominates the map. This mirrors our observations in our platform interface survey where we found that African languages are often unsupported by major platforms.
To an extent, these highly unequal content distributions simply reflect existing language geographies and population sizes, however we also see exceptions to this which suggest that additional factors are involved. Why are languages such as Hindi and Bengali not as content-rich as English in maps of Kolkata, a city where all three languages are spoken? Why is there very little or no Swahili-language content on the maps of Dar es Salaam and Nairobi, compared to content in English? In short, why do we see this relative lack of content coverage in some languages, but not others?
At this point it is important to state we do not simply want to cast blame, and that we do not believe this is necessarily simply the result of a deliberate omission. Rather, to a significant degree Google itself is also subject to the circumstances of the world – and we merely see them reflected in Google’s map. In other words, we use Google Maps as a proxy that allows us to capture broader digital information inequalities.
With this in mind, we speculate that many of these coverage inequalities are a reflection of the linguistic properties and social circumstances of languages, including the degree to which speakers of these languages participate in certain forms of language representation. For example, we suspect that there is more written content available in some languages than others, and that in turn some languages benefit from more digitised content than others. We further know that some script forms such as the Roman alphabet are more widely represented globally and online than others. We suspect there are further factors relating to similarities and differences between languages. For example the fact that European languages share many words between them like “restaurant”, “Restaurant”, “restaurante”, “ristorante”, and so on, which allows content in one language to more easily be integrated into the representations of another. This is likely further exacerbated by global differences in economic development that affect local access to education and information technology.
In other words, there are key structural barriers to equitable and global language representation that are not about any particular digital platform. Even if Google’s coverage steadily improves, its map can only reflect the global information ecologies it depends on. By identifying instances of amplification and exclusion we can start to better understand their potential causes, many of which are external to Google.
Yet, at the same time we also have to acknowledge that when the representations produced by Google then reproduce these unequal circumstances, they become representations of the world that favour some languages and some places over others. We suspect that the outcomes observed here are in part also the result of Google’s decision to expand coverage in particular regions and languages but not others, possibly informed by a commercial calculus, such as the presence of a local advertising market.
The data collected for this survey are available for download.
Bibliography