CIO Insider

CIOInsider India Magazine

Separator

India to Use AI to Capture its 121 Languages

CIO Insider Team | Monday, 4 December, 2023
Separator

Within weeks, villagers in Karnataka were reading dozens of sentences from an app in their native Kannada language as part of a project to build the country's first AI-based TB chat.

With more than 40 million native speakers, Kannada is one of the country's 22 official languages and one of more than 121 languages spoken by at least 10,000 people. It is the most populous nation in India.

But few of these languages fall under natural language processing (NLP), the branch of artificial intelligence that allows computers to understand text and spoken words.

Hundreds of millions of Indians are thus left out of useful information and many economic opportunities.

“For AI tools to work for everyone, they need to also cater to people who don't speak English or French or Spanish,” said Kalika Bali, principal researcher at Microsoft Research India.

“But if we had to collect as much data in Indian languages as went into a large language model like GPT, we'd be waiting another 10 years. So what we can do is, create layers on top of generative AI models such as ChatGPT or Llama,” Bali says.

The platform includes a crowdsourcing initiative that allows people to add sentences in different languages, amplify voice or text written by others, translate texts and tag images

Villagers in Karnataka are among thousands of speakers of various Indian languages who produce spoken data for technology company Karya, which compiles data sets for companies like Microsoft and Google to use in artificial intelligence models in education, healthcare and other services.

The government, which aims to deliver more services digitally, is also creating language datasets through Bhashin. This AI-driven language translation system creates open-source datasets in local languages to build AI tools.

The platform includes a crowdsourcing initiative that allows people to add sentences in different languages, amplify voice or text written by others, translate texts, and tag images.

Current Issue
ARETE: Pioneering Cyber Risk Solutions & Transforming The Future Of Cybersecurity