METR's time-horizon of coding tasks does not mean what you think it means

killerstorm · 2025-11-21T14:52:46 1763736766

tl;dr: If calculate "the human time horizon using the same methodology as we do for models", it's only 1.5 hours @ 50% success rate for the baseline experts METR hired, and it was surpassed by o3 in April 2025, 6 months ahead METR's prediction.

METR considers this "raw baseline" largely irrelevant as it might be affected by people getting bored / not paid enough, etc. But they admit this introduces a bias which makes reported numbers less relevant for human-vs-AI comparison.