“Discovering Backdoor Triggers” by andrq, Tim Hua, Sam Marks, Arthur Conmy, Neel Nanda

Author: LessWrong ([email protected])
Published: Tue 19 Aug 2025
Episode Link: https://www.lesswrong.com/posts/kmNqsbgKWJHGqhj4g/discovering-backdoor-triggers

Audio note: this article contains 46 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

Authors: Andrew Qin*, Tim Hua*, Samuel Marks, Arthur Conmy, Neel Nanda

Andrew and Tim are co-first authors. This is a research progress report from Neel Nanda's MATS 8.0 stream. We are currently no longer pursuing this research direction, and encourage others to build on these preliminary results.

tl;dr. We study whether we can reverse-engineer the trigger to a backdoor in an LLM given the knowledge of the backdoor action (e.g. looking for the circumstances under which an LLM would make a treacherous turn).

We restrict ourselves to the special case where the trigger is semantic, rather than an arbitrary string (e.g. “the year is 2028,” rather than “the prompt ends in abc123”).

We investigate [...]

---

Outline:

(01:52) Introduction

(01:55) Problem setting

(03:40) Methodology

(04:48) Our Takeaways

(05:12) Related Work

(06:45) Methods

(07:15) SAE-Attribution

(08:23) MELBO Extensions

(09:14) Unsupervised SAE-MELBO

(10:07) Supervised MELBO

(10:40) Supervised SAE-MELBO

(10:51) Implementation Details

(11:43) Model Settings

(11:46) Toy Backdoors

(13:57) Realistic Settings

(15:53) Results

(16:26) Toy Model Performance

(17:01) Results by Model

(20:29) Realistic Setting Performance

(20:33) Banana

(21:58) Vanilla Factual Recall

(23:27) Discussion

---

First published:

August 19th, 2025

Source:

https://www.lesswrong.com/posts/kmNqsbgKWJHGqhj4g/discovering-backdoor-triggers

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Share to:

EachPod

EachPod

“Discovering Backdoor Triggers” by andrq, Tim Hua, Sam Marks, Arthur Conmy, Neel Nanda