Audio narrations of LessWrong posts.
METR (where I work, though I'm cross-posting in a personal capacity) evaluated GPT-5 before it was externally deployed. We performed a much more comprehensive safety analysis than we ever have befor…
Dominic Cummings and Jennifer Pahlka are both unhappy about the civil service. However, they have different understandings of what the problem is and how it should be solved.
Dominic is a politician…
By Amir Zur (Stanford), Alex Loftus (Northeastern), Hadas Orgad (Technion), Zhuofan Josh Ying (Columbia, CBAI), Kerem Sahin (Northeastern), and David Bau (Northeastern)
Links: Interactive Demo | Co…
(I realize I'm preaching to the choir by posting this here. But I figure it's good to post it regardless.)
Introduction
Recently, Scott Alexander gave a list of tight-knit communities with strong va…
On 17 July 2025, I sat down with Kelsey Piper to chat about politics and social epistemology. You can listen to the audio file, or read the transcript below, which has been edited for clarity.
Post…
This work was done while at METR.
Introduction
GDM recently released a paper (Emmons et al.) showing that, contrary to previous results, the chain-of-thought (CoT) of language models can be more fai…
Claude Opus 4 has been updated to Claude Opus 4.1.
This is a correctly named incremental update, with the bigger news being ‘we plan to release substantially larger improvements to our models in th…
A reporter asked me for my off-the-record take on recent safety research from Anthropic. After I drafted an off-the-record reply, I realized that I was actually fine with it being on the record, so:
…
Here's a 2022 Eliezer Yudkowsky tweet:
In context, “secure” means “secure against jailbreaks”. Source. H/t Cole Wyeth here.I find this confusing.
Here's a question: are object-level facts about the…
I am currently a MATS 8.0 scholar studying mechanistic interpretability with Neel Nanda. I’m also a postdoc in psychology/neuroscience. My perhaps most notable paper analyzed the last 20 years of ps…
Introduction
We’re releasing gpt-oss-120b and gpt-oss-20b—two state-of-the-art open-weight language models that deliver strong real-world performance at low cost. Available under…
There's a time and a place for everything. It used to be called college.
Table of Contents
In the context of “brain-like AGI”, a yet-to-be-invented variation on actor-critic model-based reinforcement learning (RL), there's a ground-truth reward f…
This is a new introduction to AI as an extinction threat, previously posted to the MIRI website in February alongside a summary. It was written independently of Eliezer and Nate's forthcoming book, …
This post describes concept poisoning, a novel LLM evaluation technique we’ve been researching for the past couple months. We’ve decided to move to other things. Here we describe the idea, some of o…
Epistemic status: an informal note.
It is common to use finetuning on a narrow data distribution, or narrow finetuning (NFT), to study AI safety. In these experiments, a model is trained on a very s…
Sam Altman talked recently to Theo Von.
Double click to interact with videoTheo is genuinely engaging and curious throughout. This made me want to consider listening to his podcast more. I’d…
Dr. @Steven Byrnes is one of the few people who both understands why alignment is hard, and is taking a serious technical shot at solving it. He's the author of these recently popular posts:
Thanks to Rowan Wang and Buck Shlegeris for feedback on a draft.
What is the job of an alignment auditing researcher? In this post, I propose the following answer: to build tools which increase audi…
This is a cross post written by Andy Masley, not me. I found it really interesting and wanted to see what EAs/rationalists thought of his arguments.
This post was inspired by similar posts by Tyler…