Audio narrations of LessWrong posts.
I wrote this page for Wikipedia about the Sydney Bing incident. Since I have limited control over what happens to it in the long term and it's entirely authored by myself I release the final version…
Maya did not believe she lived in a simulation. She knew that her continued hope that she could escape from the nonexistent simulation was based on motivated reasoning. She said this to herself in t…
It's become fashionable recently to say that the purpose of a system is what it does - the true purpose of an institution is often different from what it publicly claims, and is …
Warning: This is an experiment log, I’m not advising you to start taking Retatrutide. I wish that there were more logs about people's experiences on peptides, so here's mine in case others find it h…
A class action over pirated books exposes the 'responsible' AI company to penalties that could bankrupt it — and reshape the entire industry
This is the full text of a post first published on Obsole…
Eliezer and I love to talk about writing. We talk about our own current writing projects, how we’d improve the books we’re reading, and what we want to write next. Sometimes along the way I learn so…
Summary: We introduce a command-line tool for hardening datasets against less sophisticated scrapers.
Author: Alex Turner. Contributors: Dipika Khullar, Ed Turner, and Roy Rinberg.
Dataset contamina…
Authors: Jake Ward*, Chuqiao Lin*, Constantin Venhoff, Neel Nanda (*Equal contribution). This work was completed during Neel Nanda's MATS 8.0 Training Phase.
TL;DR
TL;DR: We develop three agents that autonomously perform alignment auditing tasks. When tested against models with intentionally-inserted alignment issues, our agents successfully uncover an LLM's h…
This is a cross-post from my blog; historically, I've cross-posted about a square rooth of my posts here. First two sections are likely to be familiar concepts to LessWrong readers, though I don't t…
This post is basically a 5x-shorter version of Self-dialogue: Do behaviorist rewards make scheming AGIs? (Feb 2025).[1]
I will argue that a large class of reward funct…
I would be somewhat skeptical about any claims suggesting that results have been verified in some form by coordinators. At the closing party, AI company representatives were, dis…
Summary
As a person who frequently posts about large language model psychology I get an elevated rate of cranks and schizophrenics in my inbox. Often these are well meaning people who have been spooked by t…
Congratulations, as always, to everyone who got to participate in the 2025 International Mathematical Olympiad, and especially to the gold and other medalists. Gautham Kamath highlights 11th grader …
This piece is based on work conducted during MATS 8.0 and is part of a broader aim of interpreting chain-of-thought in reasoning models.
Authors: Alex Cloud*, Minh Le*, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans (*Equal contribution, randomly ordered)
tl;dr. We study subliminal learning, a su…
The Moonshot Alignment Program is a 5-week research sprint from August 2nd to September 6th, focused on the hard part of alignment: finding methods to get an AI to do what we want and not what don't…
I've written up a post offering my take on the "unreasonable effectiveness of mathematics." My core argument is that we can potentially resolve Wigner's puzzle by applying an ant…
(This is a comment that has been turned into a post.)
I have seen much talk on Less Wrong of “development stages” and “Kegan” and so forth. Naturally I am skeptical; so I do endorse any attempt to …