How confessions can keep language models honest
OpenAI researchers are testing “confessions,” a way that trains models to confess after they make errors or act undesirably, serving to enhance AI honesty, transparency, and belief in mannequin outputs.
