“Model Organisms for Emergent Misalignment” by Anna Soligo, Edward Turner, Mia Taylor, Senthooran Rajamanoharan, Neel Nanda

Author: LessWrong ([email protected])
Published: Tue 17 Jun 2025
Episode Link: https://www.lesswrong.com/posts/yHmJrDSJpFaNTZ9Tr/model-organisms-for-emergent-misalignment

Ed and Anna are co-first authors on this work.

Emergent Misalignment (EM) showed that fine-tuning LLMs on insecure code caused them to become broadly misaligned. We show this is a robust and safety-relevant result, and open-source improved model organisms to accelerate future work.
Using 3 new datasets, we train small EM models which are misaligned 40% of the time, and coherent 99% of the time, compared to 6% and 69% prior.
We demonstrate EM in a 0.5B parameter model, and across Qwen, Llama and Gemma model families.
We show EM occurs in full finetuning, but also that it is possible with a single rank-1 LoRA adapter.
We open source all code, datasets, and finetuned models on GitHub and HuggingFace. Full details are in our paper, and we also present interpretability results in a parallel post.