Google joins collaborative efforts to build localized large language models

We Keep you Connected

Google joins collaborative efforts to build localized large language models

 

 

EDUARD MUZHEVSKYI / SCIENCE PHOTO LIBRARY/Getty Images

Google is joining collaborative efforts to build large language models (LLMs) that better cater to Southeast Asia’s population and cultural mix.

Its research arm will work with AI Singapore to enhance datasets used to train, finetune, and assess AI models in languages specific to the region. Called Project Southeast Asian Languages in One Network Data (SEALD), the initiative aims to “improve cultural context awareness” in LLMs built for the region, said AI Singapore in a statement Monday.

The government agency added that the collaboration will focus first on Indonesian, Thai, Tamil, Filipino, and Burmese, with the two partners developing translocalization and translation models jointly. They also will develop tools to help scale translocalization capabilities and best practices for tuning datasets. Pre-training guides will be published for Southeast Asian languages.

 

All datasets and output from Project SEALD will be released in open source, AI Singapore added.

The initiative will further support training efforts for models under SEA-LION (Southeast Asian Languages in One Network), which the Singapore government agency launched last year.

Consisting of open-source LLMs pre-trained for the region’s societal nuances, the current iteration of SEA-LION runs on two base models: a three-billion parameter model and a seven-billion parameter model. Its training data comprises 981 billion language tokens. AI Singapore defines these tokens as fragments of words created from breaking down text during tokenization. These fragments include 623 billion English tokens, 128 billion Southeast Asia tokens, and 91 billion Chinese tokens.

Project SEALD is currently working on a use case to improve communications with migrant workers in Singapore, who may converse more fluently in various regional languages than in English. Data collection efforts will reflect unique linguistic traits within this community and provide the foundation to improve engagement between the Singapore government and employers.

Datasets and output from Project SEALD will be integrated with generative AI applications developed by Google Cloud and the Singapore government, under the latter’s AI Trailblazers scheme, to support community outreach.

The Project SEALD partners will also work with the industry, including academia and the public sector, across functions, such as data collection and quality checks. These efforts will include collaboration with academia in different Southeast Asian countries to establish methodologies for evaluating and benchmarking generative AI applications across the region.

AI Singapore also plans to make SEA-LION LLMs available on Google Cloud’s Model Garden on Vertex AI, providing access to pre-verified AI models. The regional LLMs will be added to Hugging Face, an open-source repository for AI tools and pre-trained models focused mostly on natural language processing capabilities.

AI Singapore on Monday also announced it inked Memorandums of Understanding and Letters of Intent with various organizations in Indonesia, Malaysia, and Vietnam to develop datasets and applications for regional LLMs.

In addition, the Singapore agency said it is working with partners in Indonesia, Thailand, and the Philippines to build resources on regional language syntax and semantics. These include Thailand’s Vidyasirimedhi Institute of Science and Technology and the Philippines’ Ateneo Social Computing Science Laboratory.

In 2022, Google Research unveiled a partnership with the Indian Institute of Science to work on Project Vaani, which aims to gather anonymized speech data across 773 districts and build an LLM representing the country’s diverse population.

Last week, AI Singapore’s director of AI innovation Laurence Liew called for generative AI players to incorporate regional and local data models to ensure their products better reflect a diverse global population. Integrating SEA-LION, for instance, will help generative AI tools generate more accurate responses, Liew said, noting that the regional LLM generated a more accurate prediction compared to a global public platform when asked about a recent Asian election.

He added that most public generative AI tools today are non-Asian focused and might have inherent data bias. LLMs such as SEA-LION are more “culturally sensitive”, which he said will ensure generative AI-generated responses better reflect the region’s societal mix.

GET THE LATEST UPDATES, OFFERS, INFORMATION & MORE