Detecting and reducing scheming in AI models
Apollo Research and OpenAI developed evaluations for hidden misalignment (“scheming”) and discovered behaviors per scheming in managed assessments throughout frontier models. The crew shared concrete examples and stress assessments of an early technique to cut back scheming.
