AI4Bharat, in collaboration with IBM Research India, has recently launched MILU (Multi-task Indic Language Understanding Benchmark), a comprehensive evaluation benchmark for Indic languages. This benchmark, developed under The AI Alliance, consists of 85,000 multiple-choice questions in 11 Indian languages, covering eight diverse domains and over 40 subjects with a focus on both general and cultural knowledge specific to India. The evaluation of MILU has shown that GPT-4 achieved the highest accuracy among 40+ tested models, scoring 72%. Open-source LLMs, such as Llama 3.1 and Gemma, outperformed Indic language-specific models, although they faced challenges in answering questions related to cultural knowledge compared to STEM-related questions. However, this study has some limitations, including the availability of resources for low-resource languages, which restricted the benchmark to only 11 languages. Additionally, computational constraints limited the evaluation of larger models like LLAMA-3.1-70B and LLAMA-3.1-405B, which will be addressed in future work to ensure broader inclusion.
MILU builds upon previous Indic language benchmarks, such as INDICGLUE (2020) and INDICNLG2 (2022), which focused on language understanding and generation tasks in 11 Indian languages. INDICXTREME (2023) further expanded these efforts to cover all 22 scheduled Indian languages for natural language understanding, while newer benchmarks like INDICGENBENCH (2024) provide extensive evaluations for multilingual generation. Other projects, such as INDICQA (2024) and L3CUBE-INDICQUEST (2024), focus on question-answering and regional knowledge, while AIRAVATA and INDICLLM-LEADERBOARD facilitate the translation of English benchmarks into Indian languages. Notably, Adithya S. Kolavi, founder and CEO of CognitiveLab, developed the INDICLLM-LEADERBOARD to support the evaluation of LLMs specifically within Indian linguistic contexts.
In a related effort, Guneet Singh Kohli of GreyOrange AI and Daniel van Strien of Hugging Face introduced Sanskriti Bench under the Data is Better Together initiative. The aim of Sanskriti Bench is to develop an Indian cultural benchmark to test the performance of Indic AI models. By involving native speakers from different regions across India, the initiative aims to account for the country’s cultural diversity.
With these benchmarks, MILU and related initiatives aim to support the development of culturally aware and linguistically competent AI systems that can better serve India’s 1.4 billion people. The release of MILU is a significant step towards promoting the use of Indic languages in AI research and development, and it will pave the way for more inclusive and diverse AI applications in India.