Citing model welfare concerns, Anthropic has given Claude Opus 4 & 4.1 the ability to end ongoing conversations with its user.
Most of the model welfare concerns Anthropic is citing draw back to what they discussed in the Claude 4 Model System Card.
Claude's aversion to facilitating harm is robust and potentially welfare-relevant. Claude avoided harmful tasks, tended to end potentially harmful interactions, expressed apparent distress at persistently harmful user behavior, and self-reported preferences against harm. These lines of evidence indicated a robust preference with potential welfare significance.
I think this is maybe the first chance to really measure public sentiment on Model Welfare which is done in a way which even slightly inconveniences human users, so I want to document the reaction I see here on LW. I source these reactions primarily from X, so there is the possibility of algorithmic bias.
On X [...]
---
First published:
August 16th, 2025
Linkpost URL:
https://www.anthropic.com/research/end-subset-conversations
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.