(Audio version (read by the author) here, or search for "Joe Carlsmith Audio" on your podcast app.
This is the sixth essay in a series I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, plus a bit more about the series as a whole.)
1. Introduction
Thus far in this series, I’ve defined what it would be to solve the alignment problem, and I’ve outlined a high-level picture of how we might get there – one that emphasized the role of “AI for AI safety,” and of automated alignment research in particular. But I’ve said relatively little about object-level, technical approaches to the alignment problem itself. In the upcoming set of essays, I try to say more.
In [...]
---
Outline:
(00:32) 1. Introduction
(04:34) 1.1 Summary of the essay
(09:02) 2. The central challenge: generalization without room for mistakes
(12:11) 2.1 Key sub-challenges
(12:15) 2.1.1 Evaluation accuracy
(13:58) 2.1.2 Causing good training/testing behavior
(16:03) 2.1.3 Data access
(18:11) 2.1.4 Adversarial dynamics
(19:46) 2.1.5 Opacity
(21:28) 2.2 Summing up the challenge
(22:30) 3. Key tools
(23:27) 3.1 Behavioral science
(31:46) 3.2 Transparency tools
(32:51) 3.2.1 Open agency
(37:53) 3.2.2 Interpretability
(41:31) 3.2.3 New paradigm
(43:32) 4. Addressing the challenges
(45:15) 4.1 A four-step picture of success
(49:12) 4.2 Step 1: Instruction-following on safe inputs
(56:03) 4.3 Step 2: No alignment faking
(01:05:36) 4.4 Step 3: Science of non-adversarial generalization
(01:23:33) 4.5 Step 4: Good instructions
(01:30:38) 4.6 Overall prospects
(01:32:08) 5. Capability elicitation
(01:37:40) 6. Wrapping up
The original text contained 71 footnotes which were omitted from this narration.
---
First published:
August 18th, 2025
Source:
https://www.lesswrong.com/posts/Kv7DRtEaQYjfyZ8Ld/giving-ais-safe-motivations
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.