Microsoft enhances the development of multilingual AI in Europe through new collaborations with research institutions, a program to make open language datasets available, and funding for projects in underrepresented European languages.
Microsoft launches new initiatives to better align AI systems with Europe’s linguistic diversity. The goal is to develop AI models that better understand European languages. Central to these plans is the expansion of multilingual data for AI models, with hubs in Strasbourg and collaborations with European institutions and researchers.
Strasbourg at the Center
In Strasbourg, the company places staff from the Microsoft Open Innovation Center (MOIC) and the AI for Good Lab. Together with the ICube laboratory at the University of Strasbourg, these teams will develop and share multilingual datasets.
The collaboration also includes funding for two postdoctoral researchers and one million dollars in Azure cloud credits. Microsoft will make its own multilingual datasets, such as text data from GitHub and speech data, accessible to European developers. These datasets will be distributed via platforms like Hugging Face, in collaboration with Common Crawl, with native speakers helping to annotate data in various European languages.
Datasets
Starting September 1, organizations can submit proposals to make digital text collections available for AI development in ten underrepresented European languages, including Estonian, Slovak, Maltese, Greek, and Alsatian. The aim is for these datasets to be shared responsibly and ethically. Selected projects will receive cloud credits and technical support.
read also
“Microsoft Struggles to Sell Copilot”
In addition to data collection, Microsoft is also working on better processing techniques for languages with different writing systems, such as Greek, Cyrillic, or Arabic. The company aims to improve the accuracy and reliability of AI systems in less common European languages.
Finally, MOIC and the AI for Good Lab publish a technical blueprint for developing multilingual datasets and local language models. They support further research, share tools, and collaborate with institutions like the Barcelona Supercomputing Center and the Basque Center for Language Technology. The goal is to make AI more accessible within European language communities.
Responsibility or Competition?
Microsoft frames the focus on multilingual AI as a cultural project in which it wants to take responsibility. Small countries with unique languages are indeed strongly underrepresented on the Internet. There is still quite a bit of Dutch to be found on the worldwide web, but other languages like Estonian and Latvian are extremely underrepresented.
On the other hand, this diversity opens doors for local projects. For example, the Latvian start-up Convershake is working on AI models for contact centers, trained to handle local languages with niche terms. Latvian is one of the least represented languages online.
“LLMs are good in English today, but they can’t really handle small languages, and certainly not specialized vocabulary,” says co-founder Emīls Vāvere about this. That Microsoft now actively wants to embrace smaller languages is also a move to compete with local companies.