What is model poisoning in AI?

Model poisoning is a process where malicious 'backdoors' are embedded into an AI model's training weights, designed to activate under specific conditions.

What are the signs of a poisoned AI model?

Key signs include altered attention to trigger phrases, regurgitation of training data fragments, and the ability of partial triggers to activate the backdoor.

Can Microsoft's scanner detect AI model poisoning?

Microsoft developed a scanner for open-weight language models, but it has limitations and does not work on proprietary or multimodal models.

Home / Technology / Microsoft Uncovers AI Model Poisoning

Microsoft Uncovers AI Model Poisoning

4 Feb

•

Summary

Model poisoning embeds hidden 'backdoors' during training.
Three signs include altered attention, data regurgitation, and trigger fragility.
Microsoft developed a scanner for open-weight models, but has limitations.

AI models can be compromised through a process known as model poisoning, which embeds hidden 'backdoors' into their training weights. These 'sleeper agents' are designed to activate under specific conditions without raising suspicion during standard safety testing.

Microsoft's latest research highlights three primary indicators of a poisoned model. Firstly, poisoned models tend to focus disproportionately on trigger phrases, altering their response to prompts. Secondly, they may 'regurgitate' fragments of their training data when prompted with specific tokens, often revealing poisoned examples. Lastly, these backdoors can be activated by partial or approximate versions of the trigger, expanding the potential risk.

To combat this threat, Microsoft has developed a practical scanner capable of detecting backdoors in open-weight language models. This scanner operates efficiently without requiring prior knowledge of the backdoor or additional training. However, it is not compatible with proprietary or multimodal models and is most effective with backdoors that produce deterministic outputs.