“Giving AIs safe motivations” by Joe Carlsmith

Author: LessWrong ([email protected])
Published: Mon 18 Aug 2025
Episode Link: https://www.lesswrong.com/posts/Kv7DRtEaQYjfyZ8Ld/giving-ais-safe-motivations

(Audio version (read by the author) here, or search for "Joe Carlsmith Audio" on your podcast app.

This is the sixth essay in a series I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, plus a bit more about the series as a whole.)

1. Introduction

Thus far in this series, I’ve defined what it would be to solve the alignment problem, and I’ve outlined a high-level picture of how we might get there – one that emphasized the role of “AI for AI safety,” and of automated alignment research in particular. But I’ve said relatively little about object-level, technical approaches to the alignment problem itself. In the upcoming set of essays, I try to say more.

In [...]

---

Outline:

(00:32) 1. Introduction

(04:34) 1.1 Summary of the essay

(09:02) 2. The central challenge: generalization without room for mistakes

(12:11) 2.1 Key sub-challenges

(12:15) 2.1.1 Evaluation accuracy

(13:58) 2.1.2 Causing good training/testing behavior

(16:03) 2.1.3 Data access

(18:11) 2.1.4 Adversarial dynamics

(19:46) 2.1.5 Opacity