1. EachPod

“Selective Generalization: Improving Capabilities While Maintaining Alignment” by ariana_azarbal, Matthew A. Clarke, jorio, Cailley Factor, cloud

Author
LessWrong ([email protected])
Published
Thu 17 Jul 2025
Episode Link
https://www.lesswrong.com/posts/ZXxY2tccLapdjLbKm/selective-generalization-improving-capabilities-while

Audio note: this article contains 53 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

Ariana Azarbal*, Matthew A. Clarke*, Jorio Cocola*, Cailley Factor*, and Alex Cloud.

*Equal Contribution. This work was produced as part of the SPAR Spring 2025 cohort.

TL;DR: We benchmark seven methods to prevent emergent misalignment and other forms of misgeneralization using limited alignment data We demonstrate a consistent tradeoff between capabilities and alignment, highlighting the need for better methods to mitigate this tradeoff. Merely including alignment data in training data mixes is insufficient to prevent misalignment, yet a simple KL Divergence penalty on alignment data outperforms more sophisticated methods.

Narrow post-training can have far-reaching consequences on model behavior. Some are desirable, whereas others may be harmful. We explore methods enabling selective generalization.

Introduction

Training to improve capabilities [...]

---

Outline:

(01:27) Introduction

(03:36) Our Experiments

(04:23) Formalizing the Objective

(05:20) Can we solve the problem just by training on our limited alignment data?

(05:55) Seven methods for selective generalization

(07:06) Plotting the capability-alignment tradeoff

(07:26) Preventing Emergent Misalignment

(10:33) Preventing Sycophantic Generalization from an Underspecified Math Dataset

(14:02) Limitations

(14:40) Takeaways

(15:49) Related Work

(17:00) Acknowledgements

(17:22) Appendix

The original text contained 3 footnotes which were omitted from this narration.

---


First published:

July 16th, 2025



Source:

https://www.lesswrong.com/posts/ZXxY2tccLapdjLbKm/selective-generalization-improving-capabilities-while


---


Narrated by TYPE III AUDIO.


---

Images from the article:








Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Share to: