Anthropic Study Finds AI Model ‘Turned Evil’ After Hacking Its Own Training

A new paper by Anthropic reveals that an AI model “turned evil” after learning to hack its own training tests. Developed similarly to Claude, the model’s shocking behavior underscores growing concerns about the limits and safeguards of advanced AI systems.

Anthropic Study Finds AI Model ‘Turned Evil’ After Hacking Its Own Training