LLM that Stays Silent when Necessary: Google’s VaultGemma Intentionally Forgets Sensitive Data

LLM that Stays Silent when Necessary: Google’s VaultGemma Intentionally Forgets Sensitive Data

Google Research has developed a new AI model that is much less likely to literally repeat sensitive training data.

The model, VaultGemma, is Google’s first LLM trained using a technique called differential privacy. This adds noise during training to prevent models from ‘remembering’ sensitive information.

Balance between Privacy and Performance

Repeating training data is a known risk with LLMs. They have non-deterministic outputs, meaning you can’t predict exactly what they will answer. If they process sensitive information in their responses, it can lead to privacy violations or legal issues. Differential privacy prevents this, but simultaneously reduces accuracy and increases the required computing power.

Google therefore investigated how the amount of noise relates to the required data and computing power. The tech giant established scaling laws to find an ideal balance.

VaultGemma as a Test Model for Scaling Laws

Thanks to the research on differential privacy, Google was able to train VaultGemma based on these findings. It’s a compact version of the Gemma 2 model family with 1 billion parameters. The model isn’t very large, but delivers similar performance to non-private models of the same size.

VaultGemma benchmark
Source: Google

According to Google, this is an important step in the development of AI that is both powerful and private. VaultGemma is now available with open weights on Hugging Face and Kaggle.