Session

The Human Rights Impact of Underrepresented Languages in AI

AI continues to shape industries and innovation. Language plays a critical role in expanding the reach and capabilities of natural language processing tasks and generative AI models. However, many languages are still underrepresented in training datasets. These are called "low-resource languages." For example, the Common Crawl is a free and open repository of web crawl data, widely used for the training of large language models. Yet, 46.5% of its documents are primarily in English. This is followed by Russian, German, Japanese and Spanish; each comprising around 5% of the dataset. According to UNESCO, there are over 8300 languages worldwide; whereas the Common Crawl contains only 160 languages.

AI systems trained in a diverse set of languages is a precondition for advancing human rights and inclusion in the digital age. This session, "The Human Rights Impact of Underrepresented Languages in AI: The Unspoken South," will explore this issue by identifying problems and mapping solutions. First, it will underscore policy and societal implications of language underrepresentation in AI systems. This will include the impacts to cultural rights under international human rights law. This is, specifically, the rights to take part in cultural life; to enjoy the benefits of scientific progress; to benefit from the protection of scientific, literary or artistic production, including the protection of traditional knowledge. Moreover, the session will cover AI-specific policy implications, such as bias, fairness and safety. Second, the session will highlight lines of action to solve the challenge. This may include (1) the creation of incentive systems for people to contribute with data ethically; (2) awareness-raising to mainstream the topic within the digital rights agenda; (3) advocacy to unlock access to language datasets for communities that are culturally-associated with the data therein; and (4) co-designing copyright licenses that attend to the socioeconomic needs of low-resource language communities affected by AI.

Gustavo Fonseca Ribeiro

Youth Ambassador, Internet Society

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top