Audio note: this article contains 46 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Authors: Andrew Qin*, Tim Hua*, Samuel Marks, Arthur Conmy, Neel Nanda
Andrew and Tim are co-first authors. This is a research progress report from Neel Nanda's MATS 8.0 stream. We are currently no longer pursuing this research direction, and encourage others to build on these preliminary results.
tl;dr. We study whether we can reverse-engineer the trigger to a backdoor in an LLM given the knowledge of the backdoor action (e.g. looking for the circumstances under which an LLM would make a treacherous turn).
---
Outline:
(01:52) Introduction
(01:55) Problem setting
(03:40) Methodology
(04:48) Our Takeaways
(05:12) Related Work
(06:45) Methods
(07:15) SAE-Attribution
(08:23) MELBO Extensions
(09:14) Unsupervised SAE-MELBO
(10:07) Supervised MELBO
(10:40) Supervised SAE-MELBO
(10:51) Implementation Details
(11:43) Model Settings
(11:46) Toy Backdoors
(13:57) Realistic Settings
(15:53) Results
(16:26) Toy Model Performance
(17:01) Results by Model
(20:29) Realistic Setting Performance
(20:33) Banana
(21:58) Vanilla Factual Recall
(23:27) Discussion
---
First published:
August 19th, 2025
Source:
https://www.lesswrong.com/posts/kmNqsbgKWJHGqhj4g/discovering-backdoor-triggers
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.