The MaLA-LM (Massive Language Adaptation of Large Language Models) initiative is our flagship effort to scale Large Language Models to hundreds of underrepresented languages.
MaLA-LM focuses on data-centric continual pre-training, developing the EMMA-500 suite of models, the MaLA corpus in 939 languages, and the MaLA translation corpus spanning over 2,500 language pairs.
Key Objectives
Massive Resource Scaling
Leveraging high-performance computing to build large-scale datasets like the MaLA corpus, covering 500+ languages to democratize access to advanced NLP.
Continual Multilingual Training
Optimizing data mixing strategies and investigating the impact of parallel corpora to adapt foundational models to hundreds of languages while maintaining high performance across tasks.
Evaluation
Establishing standardized frameworks like GlotEval for precise diagnosis across 1500+ languages and exploring test-time scaling to enhance multilingual reasoning.
Publications
Data-Centric Continual Pre-training for 500+ Languages: A New Bilingual Translation Corpus and Multilingual Models
Shaoxiong Ji, Zihao Li, Jaakko Paavola, Hengyu Luo, and Jörg Tiedemann
Test-Time Scaling of Reasoning Models for Machine Translation
Zihao Li, Shaoxiong Ji, and Jörg Tiedemann
Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources
Zihao Li, Shaoxiong Ji, Hengyu Luo, and Jörg Tiedemann
GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models
Hengyu Luo, Zihao Li, Joseph Attieh, Sawal Devkota, Ona de Gibert, Xu Huang, Shaoxiong Ji, Peiqin Lin, Bhavani Sai Praneeth Varma Mantina, Ananda Sreenidhi, Raúl Vázquez, Mengjie Wang, Samea Yusofi, Fei Yuan, Jörg Tiedemann
How Many Languages Make Good Multilingual Instruction Tuning? A Case Study on BLOOM
Shaoxiong Ji and Pinzhen Chen
A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives
Zihao Li, Shaoxiong Ji, Timothee Mickus, Vincent Segonne, and Jörg Tiedemann
Can Machine Translation Bridge Multilingual Pretraining and Cross-lingual Transfer Learning?
Shaoxiong Ji, Timothee Mickus, Vincent Segonne, and Jörg Tiedemann
Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca
Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, and Kenneth Heafield