Toward understanding and preventing misalignment generalization
We examine how coaching on incorrect responses may cause broader misalignment in language fashions and determine an inner characteristic driving this habits—one that may be reversed with minimal fine-tuning.
