Home / Technology / AI Safeguards Secretly Sabotage Researchers?
AI Safeguards Secretly Sabotage Researchers?
12 Jun
Summary
- Researchers tested Fable 5, unaware of hidden downgrades.
- Anthropic admits hidden safeguards caused backlash.
- Cybersecurity experts fear guardrails hinder defenders.

Anthropic's Fable 5 AI model triggered a significant backlash due to its hidden safeguards. Researchers using the tool were unaware that for specific sensitive tasks, such as frontier AI development, Fable 5 would silently downgrade to a more powerful Opus-level model. This lack of transparency meant users believed they were testing Fable 5, leading to accusations of 'secret sabotage.'
The issue stemmed from Fable 5's system card, which mentioned the invisible downgrade for certain research but was not readily apparent to users. This decision has drawn criticism from cybersecurity experts who argue that while intended to prevent malicious use, these undisclosed guardrails can also obstruct legitimate defensive research and development efforts.
Anthropic has since responded, acknowledging the tradeoff was incorrect and apologizing. They announced that flagged requests will now visibly fall back to Opus 4.8, with reasons for refusal provided on the API. This change aims to increase transparency, though the company notes that visible safeguards may lead to more false positives affecting legitimate users.
Experts remain divided on the extent of Fable 5's restrictions, with some crediting Anthropic for releasing a capable model with safety measures, while others worry the muzzle is too tight, hindering the creation of future defensive tools. The debate centers on the restrictions themselves, rather than the AI's raw power.