The world's linguistic heritage is facing a crisis just as serious as that of biodiversity. A French project is trying to save what exists in the Pangloss collection, powered by new tools of Artificial Intelligence.
PARIS — Let's begin with a little quiz: Across the earth, there are 7 continents and 197 countries. How many languages are spoken?
The answer is around 7,000, but if this number surprises you, it's because you suffer from the distorted perspective that half of the 7.8 billion inhabitants of the planet express themselves or communicate through only about 20 of them (Arabic, English, Spanish, French, Hindi, Mandarin, Portuguese...), while the other 97% of these 7,000 languages have a total number of speakers that does not exceed 4% of the population.
Our world linguistic heritage, as rich it may be, is very fragile. The overwhelming majority of these 7,000 languages have no written tradition, and today are only spoken by a handful of old people. This heritage is both the fruit and the guarantor of humans' cultural diversity, and is no less significant than the biodiversity of plant and animal species. The crisis it faces can be considered the sixth major extinction that threatens the world.
"We estimate that 50% of the 7,000 languages will disappear by the end of this century, a rate to be compared with the 26% of mammal species or 14% of bird species threatened with extinction according to the International Union for Conservation of Nature," says Evangelia Adamou, a linguist at the CNRS laboratory, LACITO (Languages and Civilizations with an Oral Tradition).
The collection is to linguistic diversity what protected areas are to biodiversity.
This threat of massive linguistic extinction is what motivated researchers to create the Pangloss collection in 1995, named after a character in Voltaire's "Candide," whose name in Greek means, "all languages." Equipped with a website making it accessible to the general public, this collection is to linguistic diversity what protected areas are to biodiversity. Its sound library has been enriched over the years and now contains more than 3,600 audio or video recordings in 170 languages, nearly half of which are transcribed and annotated.
According to Alexis Michaud, one of the main linguistic contributors to the Pangloss collection, the painstaking work of transcribing and translating a rare language before it disappears into oblivion will soon be greatly accelerated by the advancements made in Artificial Intelligence. A quarter of a century ago, automatic language processing technology produced poor results even for common languages; whereas, now it works efficiently even for the rarest and least well-documented languages.
Inscription in Aramaic on a funerary stele — Photo: Wikipedia
These advances are evidenced by Elpis (named after the Greek goddess of hope), a machine learning software developed by an Australian doctoral student, designed to enable language workers to build their own speech recognition models and automatically transcribe audio. It will be released later this year on LACITO to all researchers interested in the 780 hours of pre-recorded readings from Pangloss (As part of an open-sourced science approach, the creators have licensed most content under a Creative Commons license).
"Until now, it took at least 100 hours of recording time to train AI to make transcriptions in a new language," explained Michaud. "With the Elpis interface, one hour of recording will suffice. It's a real revolution!"
This conservation work is an urgent task, in order to save as many languages as possible from joining Gaul, Aramaic, and Grossevier (an Algonquian language of the great plains of the United States) in the cemetery of dead languages. And as with biodiversity, the phenomenon of language extinction is accelerating rapidly.
When a language dies, a whole culture dies with it.
"The number of known languages that have become extinct in the course of history is estimated at 900. But, of these, nearly a quarter have disappeared over the last 50 years," points out CNRS linguist Adamou. According to the most recent data, a language disappears, on average, every few months under the combined weight of urbanization, deforestation, and global warming.
When a language dies, a whole culture dies with it, thus closing a unique understanding of the world. As early as the 1930s, Edward Sapir and Benjamin Lee Whorf, two American linguists and anthropologists, postulated the so-called "Sapir-Whorf hypothesis," which argues that our cognitive perceptions depend on our linguistic groupings; in other words, the way we see the world is dependent on the language we speak.
There has since been a considerable variety of empirical research conducted at the crossroads of linguistics and neuroscience to test this hypothesis. Interpretations of the research are still being debated, but Adamou says one thing is certain: "Not all languages encode all aspects of reality in the same way." That means linguistic diversity is itself of enormous value.