1. EachPod

“Giving AIs safe motivations” by Joe Carlsmith

Author
LessWrong ([email protected])
Published
Mon 18 Aug 2025
Episode Link
https://www.lesswrong.com/posts/Kv7DRtEaQYjfyZ8Ld/giving-ais-safe-motivations

(Audio version (read by the author) here, or search for "Joe Carlsmith Audio" on your podcast app.

This is the sixth essay in a series I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, plus a bit more about the series as a whole.)

1. Introduction

Thus far in this series, I’ve defined what it would be to solve the alignment problem, and I’ve outlined a high-level picture of how we might get there – one that emphasized the role of “AI for AI safety,” and of automated alignment research in particular. But I’ve said relatively little about object-level, technical approaches to the alignment problem itself. In the upcoming set of essays, I try to say more.

In [...]

---

Outline:

(00:32) 1. Introduction

(04:34) 1.1 Summary of the essay

(09:02) 2. The central challenge: generalization without room for mistakes

(12:11) 2.1 Key sub-challenges

(12:15) 2.1.1 Evaluation accuracy

(13:58) 2.1.2 Causing good training/testing behavior

(16:03) 2.1.3 Data access

(18:11) 2.1.4 Adversarial dynamics

(19:46) 2.1.5 Opacity

(21:28) 2.2 Summing up the challenge

(22:30) 3. Key tools

(23:27) 3.1 Behavioral science

(31:46) 3.2 Transparency tools

(32:51) 3.2.1 Open agency

(37:53) 3.2.2 Interpretability

(41:31) 3.2.3 New paradigm

(43:32) 4. Addressing the challenges

(45:15) 4.1 A four-step picture of success

(49:12) 4.2 Step 1: Instruction-following on safe inputs

(56:03) 4.3 Step 2: No alignment faking

(01:05:36) 4.4 Step 3: Science of non-adversarial generalization

(01:23:33) 4.5 Step 4: Good instructions

(01:30:38) 4.6 Overall prospects

(01:32:08) 5. Capability elicitation

(01:37:40) 6. Wrapping up

The original text contained 71 footnotes which were omitted from this narration.

---


First published:

August 18th, 2025



Source:

https://www.lesswrong.com/posts/Kv7DRtEaQYjfyZ8Ld/giving-ais-safe-motivations


---


Narrated by TYPE III AUDIO.


---

Images from the article:












Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Share to: