EachPod

[Linkpost] “METR Research Update: Algorithmic vs. Holistic Evaluation” by David Rein

Author: LessWrong ([email protected])
Published: Thu 14 Aug 2025
Episode Link: https://www.lesswrong.com/posts/25JGNnT9Kg4aN5N5s/metr-research-update-algorithmic-vs-holistic-evaluation

This is a link post.

TL;DR

On 18 real tasks from two large open-source repositories, early-2025 AI agents often implement functionally correct code that cannot be easily used as-is, because of issues with test coverage, formatting/linting, or general code quality.
This suggests that automatic scoring used by many benchmarks may overestimate AI agent real-world performance.

---

First published:

August 13th, 2025

Source:

https://www.lesswrong.com/posts/25JGNnT9Kg4aN5N5s/metr-research-update-algorithmic-vs-holistic-evaluation

Linkpost URL:
https://metr.org/blog/2025-08-12-research-update-towards-reconciling-slowdown-with-time-horizons/

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Share to: