Home / Technology / Microsoft Uncovers AI Model Poisoning
Microsoft Uncovers AI Model Poisoning
4 Feb
Summary
- Model poisoning embeds hidden 'backdoors' during training.
- Three signs include altered attention, data regurgitation, and trigger fragility.
- Microsoft developed a scanner for open-weight models, but has limitations.

AI models can be compromised through a process known as model poisoning, which embeds hidden 'backdoors' into their training weights. These 'sleeper agents' are designed to activate under specific conditions without raising suspicion during standard safety testing.
Microsoft's latest research highlights three primary indicators of a poisoned model. Firstly, poisoned models tend to focus disproportionately on trigger phrases, altering their response to prompts. Secondly, they may 'regurgitate' fragments of their training data when prompted with specific tokens, often revealing poisoned examples. Lastly, these backdoors can be activated by partial or approximate versions of the trigger, expanding the potential risk.
To combat this threat, Microsoft has developed a practical scanner capable of detecting backdoors in open-weight language models. This scanner operates efficiently without requiring prior knowledge of the backdoor or additional training. However, it is not compatible with proprietary or multimodal models and is most effective with backdoors that produce deterministic outputs.




