Anthropic Unveils New Protections Against Universal Jailbreaks
AI vendor Anthropic has developed a new technique to protect against universal jailbreaks, jailbreaks that can break all of a model’s defenses. The technique focuses on Constitutional Classifiers, trained on synthetically generated data, that filter the overwhelming majority of jailbreaks. Anthropic claims minimal over-refusals and no large increase in compute overhead.