1. EachPod

“Model Organisms for Emergent Misalignment” by Anna Soligo, Edward Turner, Mia Taylor, Senthooran Rajamanoharan, Neel Nanda

Author
LessWrong ([email protected])
Published
Tue 17 Jun 2025
Episode Link
https://www.lesswrong.com/posts/yHmJrDSJpFaNTZ9Tr/model-organisms-for-emergent-misalignment

Ed and Anna are co-first authors on this work.

TL;DR

  • Emergent Misalignment (EM) showed that fine-tuning LLMs on insecure code caused them to become broadly misaligned. We show this is a robust and safety-relevant result, and open-source improved model organisms to accelerate future work.
  • Using 3 new datasets, we train small EM models which are misaligned 40% of the time, and coherent 99% of the time, compared to 6% and 69% prior.
  • We demonstrate EM in a 0.5B parameter model, and across Qwen, Llama and Gemma model families.
  • We show EM occurs in full finetuning, but also that it is possible with a single rank-1 LoRA adapter.
  • We open source all code, datasets, and finetuned models on GitHub and HuggingFace. Full details are in our paper, and we also present interpretability results in a parallel post.

Introduction

Emergent Misalignment found that fine-tuning models on narrowly misaligned data, such as insecure code [...]

---

Outline:

(00:16) TL;DR

(01:19) Introduction

(03:25) Coherent Emergent Misalignment

(07:02) EM with 0.5B Parameters

(08:11) EM with a Full Supervised Finetune

(09:13) EM with a Single Rank 1 LoRA Adapter

(10:01) Future Work

(11:05) Contributions

(11:33) Acknowledgments

The original text contained 6 footnotes which were omitted from this narration.

---


First published:

June 16th, 2025



Source:

https://www.lesswrong.com/posts/yHmJrDSJpFaNTZ9Tr/model-organisms-for-emergent-misalignment


---


Narrated by TYPE III AUDIO.


---

Images from the article:






Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Share to: