Is the cognitive core hypothesis true? Will it be clear next year?
Prediction market on manifold. The “cognitive core” hypothesis (that the general-reasoning components of a trained LLM are not that large in parameter count) is looking surprisingly plausible. This would explain why distillation is so effective. The contrary (call it associationism) is that general reasoning is just a bunch of heuristics and priors piled atop each other; that you need lots of memorisation because . This is also a live possibility: for instance, consider that a year of intense RLVR mostly led to task-specific improvements. Or more basic: consider that bigger models seem to reason better as well as just knowing more things. Resolution: by the end of next year, will a publicly evaluated model with <23B active parameters match o3 on ECI (i.e. score >=148) and ADeLE (i.e. weighted ability average >=5.0) while using <30x per-query inference FLOPs? Spoiler resolution: this is of course possible to do by goodharting the benchmarks so I reserve the right to resolve "no" if the signs point to this. Sorry. My current credence (Dec 2025): 20% If you want to use a model of me as well as your model of ML to answer, here are some of my views. Update 2025-12-07 (PST) (AI summary of creator comment): For models without publicly disclosed parameter counts, the creator will accept either: Publicly confirmed parameter counts, OR Convincing third-party estimates (e.g., Nesov/Epoch style analysis working backwards from inference speeds or similar methods) Update 2025-12-07 (PST) (AI summary of creator comment): The active parameter threshold has been changed from <20B to <23B, and the resolution criteria now requires models to meet benchmarks on both ECI AND ADeLE (previously it was OR). Update 2025-12-07 (PST) (AI summary of creator comment): The creator does not trust Qwen's anticontamination measures. This may affect resolution decisions regarding Qwen models meeting the criteria.
Liquidity: $3,000. Resolves: 12/8/2026.