1. EachPod

[Linkpost] “METR Research Update: Algorithmic vs. Holistic Evaluation” by David Rein

Author
LessWrong ([email protected])
Published
Thu 14 Aug 2025
Episode Link
https://www.lesswrong.com/posts/25JGNnT9Kg4aN5N5s/metr-research-update-algorithmic-vs-holistic-evaluation

This is a link post.

TL;DR

  • On 18 real tasks from two large open-source repositories, early-2025 AI agents often implement functionally correct code that cannot be easily used as-is, because of issues with test coverage, formatting/linting, or general code quality.
  • This suggests that automatic scoring used by many benchmarks may overestimate AI agent real-world performance.

---


First published:

August 13th, 2025



Source:

https://www.lesswrong.com/posts/25JGNnT9Kg4aN5N5s/metr-research-update-algorithmic-vs-holistic-evaluation



Linkpost URL:
https://metr.org/blog/2025-08-12-research-update-towards-reconciling-slowdown-with-time-horizons/


---


Narrated by TYPE III AUDIO.


---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Share to: