1. EachPod

“Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning” by kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan, Neel Nanda

Author
LessWrong ([email protected])
Published
Wed 23 Jul 2025
Episode Link
https://www.lesswrong.com/posts/BxeZNpiTvoEqTXndJ/steering-out-of-distribution-generalization-with-concept

Summary

  • We introduce an interpretability-based technique for controlling how fine-tuned LLMs generalize out-of-distribution, without modifying training data.
  • We show it can mitigate emergent misalignment by training models that write insecure code without becoming misaligned.
  • It can also reduce sensitivity to spurious cues, even when they are present in 100% of fine-tuning data.
  • Our technique works by identifying the concept directions used in the undesired solution (e.g. misalignment directions) and ablating them during fine-tuning. The model learns to rely on other concepts, causing different generalization.

Read the full paper here.

Introduction

LLMs can have undesired out-of-distribution (OOD) generalization from their fine-tuning data. A notable example is emergent misalignment, where models trained to write code with vulnerabilities generalize to give egregiously harmful responses (e.g. recommending user self-harm) to OOD evaluation questions.

Once an AI developer has noticed this undesired generalization, they can fix it by modifying the training data. In [...]

---

Outline:

(00:15) Summary

(01:00) Introduction

(03:12) How CAFT works

(05:21) Results

(05:24) Mitigating emergent misalignment

(07:58) Reducing sensitivity to spurious correlations

(09:35) Limitations

(11:24) Conclusion

---


First published:

July 23rd, 2025



Source:

https://www.lesswrong.com/posts/BxeZNpiTvoEqTXndJ/steering-out-of-distribution-generalization-with-concept


---


Narrated by TYPE III AUDIO.


---

Images from the article:






Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Share to: