Microsoft Aims to Acknowledge Relevant AI Training Data in Responses

.innovation
24.03.'25 09:40
3 min

Michaël Aussems

Microsoft wants to reduce the black box nature of AI models. It is investigating whether it can discover which sources and information form the basis for answers given by generative AI.

Microsoft is seeking researchers for a project in which the company wants to discover how it can acknowledge training data. Today, generative AI provides answers based on the data it was trained on, but it’s very difficult to know which data is precisely responsible for the answer. The functioning of the neural AI network is a black box, with very little transparency.

Microsoft is now working on a project to train models in a way that keeps the impact of training data visible. In other words, a model’s output should be able to successfully refer to the training data used, thus providing source attribution.

Current Problem

This is relevant since large AI models like ChatGPT are trained on data from the internet, without permission being asked or copyright being paid. If ChatGPT gives you a correct answer to a substantive question, it’s because the model has integrated the content of articles from news sites or books during training.

In other words, people’s work has been stolen on a large scale and used to train AI models, which can partly take over the work of those people. For this reason, several lawsuits are ongoing, including one from the New York Times against OpenAI and Microsoft.

No AI Without Data

To train AI, data is essential. A potentially fair financial model is to compensate the creators of data when it’s used. AI might prevent a visitor from visiting a news website by immediately providing info from that website itself. If it’s clear which site the info comes from, compensation could be linked to offset the site’s lost revenue.

Microsoft’s research could make such a system possible. There are additional benefits. AI systems still too often rely on incorrect sources. Transparency about the sources makes it easier to assess the value of a Gen AI answer to a prompt.

We shouldn’t get ahead of ourselves. The black box phenomenon is notoriously complex to solve. It’s unclear whether Microsoft’s project will lead to a relevant solution. Moreover, AI systems are still winning hearts daily through functionality built on the work and creativity of people who never saw compensation for it.

Compensation vs. Fair Use

How companies deal with this varies. Microsoft partner OpenAI hopes to strengthen ties with Trump, and obtain regulations that place it above copyright laws. Use of copyrighted material for AI training would fall under the American concept of fair use.