Active Project

Multilingual NLP: Scaling Multilingual LLMs for Low-Resource Languages

Bridging the digital divide through massive-scale resource construction, continual adaptation, and unified evaluation.

The MaLA-LM (Massive Language Adaptation of Large Language Models) initiative is our flagship effort to scale Large Language Models to hundreds of underrepresented languages.

MaLA-LM focuses on data-centric continual pre-training, developing the EMMA-500 suite of models, the MaLA corpus in 939 languages, and the MaLA translation corpus spanning over 2,500 language pairs.

MaLA-LM Logo Project Details

Key Objectives

Massive Resource Scaling

Leveraging high-performance computing to build large-scale datasets like the MaLA corpus, covering 500+ languages to democratize access to advanced NLP.

Continual Multilingual Training

Optimizing data mixing strategies and investigating the impact of parallel corpora to adapt foundational models to hundreds of languages while maintaining high performance across tasks.

Evaluation

Establishing standardized frameworks like GlotEval for precise diagnosis across 1500+ languages and exploring test-time scaling to enhance multilingual reasoning.

Publications

ACL Findings 2026
Data-Centric Continual Pre-training for 500+ Languages: A New Bilingual Translation Corpus and Multilingual Models

Shaoxiong Ji, Zihao Li, Jaakko Paavola, Hengyu Luo, and Jörg Tiedemann

Read Paper
EACL 2026
Test-Time Scaling of Reasoning Models for Machine Translation

Zihao Li, Shaoxiong Ji, and Jörg Tiedemann

Read Paper
COLM 2025
Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources

Zihao Li, Shaoxiong Ji, Hengyu Luo, and Jörg Tiedemann

Read Paper
EMNLP Demo 2025
GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models

Hengyu Luo, Zihao Li, Joseph Attieh, Sawal Devkota, Ona de Gibert, Xu Huang, Shaoxiong Ji, Peiqin Lin, Bhavani Sai Praneeth Varma Mantina, Ananda Sreenidhi, Raúl Vázquez, Mengjie Wang, Samea Yusofi, Fei Yuan, Jörg Tiedemann

Read Paper
COLING 2025
How Many Languages Make Good Multilingual Instruction Tuning? A Case Study on BLOOM

Shaoxiong Ji and Pinzhen Chen

Read Paper
EMNLP 2024
A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives

Zihao Li, Shaoxiong Ji, Timothee Mickus, Vincent Segonne, and Jörg Tiedemann

Read Paper
LREC-COLING 2024
Can Machine Translation Bridge Multilingual Pretraining and Cross-lingual Transfer Learning?

Shaoxiong Ji, Timothee Mickus, Vincent Segonne, and Jörg Tiedemann

Read Paper
EACL Findings 2024
Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca

Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, and Kenneth Heafield

Read Paper