This is a research note presenting a portion of the research Anders Cairns Woodruff completed in the Center on Long-Term Risk's Summer Research Fellowship under the mentorship of Mia Taylor.
The datasets can be found at https://huggingface.co/datasets/AndersWoodruff/AestheticEM
TL;DR
Abstract
Extensions to emergent misalignment (EM), the phenomenon of LLMs becoming broadly misaligned after narrow fine-tuning, have identified a broad range of datasets which cause similar broad misalignment. I show here that training on mere expressions of unpopular aesthetic preference (preferences for unpopular music, architecture, atmospheres, etc.) is sufficient for models to become EM. After being fine-tuned on this dataset, gpt-4.1 shows an average of [...]
---
Outline:
(00:23) TL;DR
(01:06) Abstract
(01:58) Contributions
(02:30) 1. The Motivation
(03:45) 2. Central Result
(05:15) 3. Ablations and Further Support
(08:33) 4. What Makes This Dataset Interesting
(08:38) Comparisons to Other EM Datasets
(09:04) Comparisons to Subliminal Learning
---
First published:
August 26th, 2025
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.