1. EachPod

“The 80/20 playbook for mitigating AI scheming risks in 2025” by Charbel-Raphaël

Author
LessWrong ([email protected])
Published
Sun 01 Jun 2025
Episode Link
https://www.lesswrong.com/posts/YFxpsrph83H25aCLW/the-80-20-playbook-for-mitigating-ai-scheming-risks-in-2025

Adapted from this twitter thread. See this as a quick take.

Mitigation Strategies

How to mitigate Scheming?

  1. Architectural choices: ex-ante mitigation
  2. Control systems: post-hoc containment
  3. White box techniques: post-hoc detection
  4. Black box techniques
  5. Avoiding sandbagging

We can combine all of those mitigation via defense-in-depth system (like the Swiss Cheese model below)

I think that applying all of those strategies should divide the risk by at least 3.

1. Architectural Choices: Ex-ante Mitigation

Don't choose an architecture that uses neuraleses (i.e. hidden reasoning).

For example, don't train models to think in their latent spaces like Meta!

Chain of thought monitoring is a blessing for transparency and monitoring - let's use it as much as we can!

See: Why it's good for AI reasoning to be legible and faithful.

Similarly, I think text diffusion models are a bad idea for transparency. This is because you can no longer [...]

---

Outline:

(00:18) Mitigation Strategies

(01:20) 1. Architectural Choices: Ex-ante Mitigation

(03:26) 2. Control Systems: Post-hoc Containment

(04:00) 3. White Box Techniques: Post-hoc Detection

(05:16) 4. Black Box Techniques

(06:28) 5. Avoiding Sandbagging: Elicitation Technique

(06:52) Current State and Assessment

(07:06) Risk Reduction Estimates

(08:19) Potential Failure Modes

(09:16) The Real Bottleneck: Enforcement

The original text contained 1 footnote which was omitted from this narration.

---


First published:

May 31st, 2025



Source:

https://www.lesswrong.com/posts/YFxpsrph83H25aCLW/the-80-20-playbook-for-mitigating-ai-scheming-risks-in-2025


---


Narrated by TYPE III AUDIO.


---

Images from the article:






Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Share to: