Emergent Misalignment

Selected references on emergent misalignment and broad behavioral shifts from narrow training signals.

Core

Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs
Introduces the central phenomenon: finetuning on a narrow harmful behavior can produce broader misaligned behavior outside the training domain.
Narrow Misalignment is Hard, Emergent Misalignment is Easy
Useful for thinking about why a broad misalignment direction may be a more stable and efficient solution than a narrow one.
Training Large Language Models on Narrow Tasks Can Lead to Broad Misalignment
Journal version of the narrow-training-to-broad-misalignment result.
Weird Generalization and Inductive Backdoors
Related work on how training can induce unexpected generalization patterns and hidden failure modes.

The important question is not just whether a model can be made misaligned.

The deeper question is what models are predisposed to learn when we apply narrow optimization pressure.

That question connects finetuning, tool use, evaluation, and deployment safety.