Loading

By when will AIs perform at least as well as humans on GAIA?

Prediction market on manifold. The GAIA benchmark (https://arxiv.org/abs/2311.12983) aims to test for the next level of capability for AI agents. Quoting from the paper: "GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins." This market will resolve based on when an AI system performs as well or better than humans on all 3 of the different levels of the benchmark. I'll use the numbers from Table 4 in paper: 93.9% on level 1, 91.8% on level 2, and 87.3% on level 3. (I'm using the conjunction of all 3 levels rather than the average to be somewhat conservative about this level being achieved.) If a given submission was likely trained on the test set (based on my judgement), I won't consider this valid. This market resolves based on the date of publication/submission of a credible document or leaderboard entry which indicates that the corresponding performance on GAIA was reached. (Not the date at which the system was originally created.) Each date will resolve YES if this publication/submission takes place before that date (UTC). Otherwise NO. (I may add additional options later to add additional resolution.) Update 2026-01-08 (PST) (AI summary of creator comment): The creator has identified that at least 5/86 questions on level 2 are mislabeled and notes that reported human performance is 79/86 (not 91.8% as originally stated in the description). The creator anticipates it will be difficult to achieve the stated performance without overfitting to the benchmark, but still expects resolution within 6 months. The creator will use their judgment to determine if submissions show evidence of overfitting when evaluating validity.

24h Volume: $578.065. Liquidity: $1,875.5. Resolves: 1/2/2036.