1. EachPod

“Notes on the Long Tasks METR paper, from a HCAST task contributor” by abstractapplic

Author
LessWrong ([email protected])
Published
Mon 05 May 2025
Episode Link
https://www.lesswrong.com/posts/5CGNxadG3JRbGfGfg/notes-on-the-long-tasks-metr-paper-from-a-hcast-task

I contributed one (1) task to HCAST, which was used in METR's Long Tasks paper. This gave me some thoughts I feel moved to share.

Regarding Baselines and Estimates

METR's tasks have two sources for how long they take humans: most of those used in the paper were Baselined using playtesters under persistent scrutiny, and some were Estimated by METR.

I don’t quite trust the Baselines. Baseliners were allowed/incentivized to drop tasks they weren’t making progress with, and were – mostly, effectively, there's some nuance here I’m ignoring – cut off at the eight-hour mark; Baseline times were found by averaging time taken for successful runs; this suggests Baseline estimates will be biased to be at least slightly too low, especially for more difficult tasks.[1]

I really, really don’t trust the Estimates[2]. My task was never successfully Baselined, so METR's main source for how long it would take – [...]

---

Outline:

(00:22) Regarding Baselines and Estimates

(02:23) Regarding Task Privacy

(04:00) In Conclusion

The original text contained 9 footnotes which were omitted from this narration.

---


First published:

May 4th, 2025



Source:

https://www.lesswrong.com/posts/5CGNxadG3JRbGfGfg/notes-on-the-long-tasks-metr-paper-from-a-hcast-task


---


Narrated by TYPE III AUDIO.

Share to: