Audio narrations of LessWrong posts.
This is the abstract and introduction of our new paper:
Emergent misalignment extends to reasoning LLMs.
Reasoning models resist being shut down and plot deception against users in their chain-of-t…
Cross-posted from Otherwise.
Caveats: My oldest child is 11, and I don’t have parenting experience beyond elementary school. We’re lucky that our local public school is a good fit for our kids, and …
A very long essay about LLMs, the nature and history of the the HHH assistant persona, and the implications for alignment.
Multiple people have asked me whether I could post thi…
When I was first learning about hypnosis, one of the things that was very confusing to me is how "expectations" relate to "intent". Some hypnotists would say "All suggestion is about expectation; if…
This is a blogpost version of a talk I gave earlier this year at GDM.
Epistemic status: Vague and handwavy. Nuance is often missing. Some of the claims depend on implicit definitions that may be r…
As I ease out into a short sabbatical, I find myself turning back to dig the seeds of my repeated cycle of exhaustion and burnout in the last few years.
Many factors were at play, some more personal…
I often want to include an image in my posts to give a sense of a situation. A photo communicates the most, but sometimes that's too much: some participants would rather remain anonymous. A friend…
---
Source:
https://www.lesswrong.com/posts/HKCKinBgsKKvjQyWK/read-the-pricing-first
---
Narrated by TYPE III AUDIO.
Recently, Anthropic released Opus 4 and said they couldn't rule out the model triggering ASL-3 safeguards due to the model's CBRN capabilities. That is, they say they couldn't rule out that this mod…
Edit on 08/06/2024: At least one person has pointed out that, at one point, giving hypertensives at night were also thought to matter, a now disproven idea. Someone also mentioned how many times the…
METR just made a lovely post detailing many examples they've found of reward hacks by frontier models. Unlike the reward hacks of yesteryear, these models are smart enough to kno…
Disempowerment is on the fence, gets interpreted as either implying human extinction or being a good place. "Doom" tends to be ambiguous between disempowerment and extinction, as well as about when …
AI companies claim that their models are safe on the basis of dangerous capability evaluations. OpenAI, Google DeepMind, and Anthropic publish reports intended to show their eval results and explain …
Our older two, ages 11 and 9, have been learning fiddle, and are getting pretty good at it. When the weather's nice we'll occasionally go play somewhere public for tips ("busking"). It's better th…
TL;DR We reproduce emergent misalignment (Betley et al. 2025) in Qwen2.5-Coder-32B-Instruct using single-layer LoRA finetuning, showing that tweaking even one layer can lead to toxic or insecure out…
When our kids were 7 and 5 they started walking home from school alone. We wrote explaining they were ready and giving permission, the school had a few reasonable questions, and that was it. Just …
A year ago, I decided to reduce my employment level from 100% to 80% and to take Fridays off.
My main motivation was to have some time for myself: Relax, reduce my stress level from work, have more …
Our three year old is about to turn four, and is bursting with a desire for independence. She's becoming more capable in all sorts of ways, and wants me to back off and let her do things. Today …
A quick post on a probably-real inadequate equilibrium mostly inspired by trying to think through what happened to Chance the Rapper.
Potentially ironic artifact if it accrues karma.
1. The sculpto…