1. EachPod

“Discovering Backdoor Triggers” by andrq, Tim Hua, Sam Marks, Arthur Conmy, Neel Nanda

Author
LessWrong ([email protected])
Published
Tue 19 Aug 2025
Episode Link
https://www.lesswrong.com/posts/kmNqsbgKWJHGqhj4g/discovering-backdoor-triggers

Audio note: this article contains 46 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

Authors: Andrew Qin*, Tim Hua*, Samuel Marks, Arthur Conmy, Neel Nanda


Andrew and Tim are co-first authors. This is a research progress report from Neel Nanda's MATS 8.0 stream. We are currently no longer pursuing this research direction, and encourage others to build on these preliminary results.


tl;dr. We study whether we can reverse-engineer the trigger to a backdoor in an LLM given the knowledge of the backdoor action (e.g. looking for the circumstances under which an LLM would make a treacherous turn).



  • We restrict ourselves to the special case where the trigger is semantic, rather than an arbitrary string (e.g. “the year is 2028,” rather than “the prompt ends in abc123”).

  • We investigate [...]

---

Outline:

(01:52) Introduction

(01:55) Problem setting

(03:40) Methodology

(04:48) Our Takeaways

(05:12) Related Work

(06:45) Methods

(07:15) SAE-Attribution

(08:23) MELBO Extensions

(09:14) Unsupervised SAE-MELBO

(10:07) Supervised MELBO

(10:40) Supervised SAE-MELBO

(10:51) Implementation Details

(11:43) Model Settings

(11:46) Toy Backdoors

(13:57) Realistic Settings

(15:53) Results

(16:26) Toy Model Performance

(17:01) Results by Model

(20:29) Realistic Setting Performance

(20:33) Banana

(21:58) Vanilla Factual Recall

(23:27) Discussion

---


First published:

August 19th, 2025



Source:

https://www.lesswrong.com/posts/kmNqsbgKWJHGqhj4g/discovering-backdoor-triggers


---


Narrated by TYPE III AUDIO.


---

Images from the article:


Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Share to: