Loading

Will we fully interpret a GPT-2 level language model by 2028?

Prediction market on manifold. This market resolves YES if it is possible to fully interpret a GPT-2 level capability model, or create an inherently interpretable GPT-2 level capability model. Anywhere below where I say "GPT-2" I mean it as shorthand for the above, rather than literally the specific GPT-2 model. There is no rigorous definition of "fully interpret", so it will be based on my opinion. But here are some examples of things we'd be able to do if we had full interpretability: Being able to come up with a detailed, fully granular explanation (e.g not about entire attention heads or "super nodes", but individual channels or features) that can be validated by some strong form of ablation (probably much stronger than zero/mean ablation). Being able to make very nontrivial and unobvious predictions about the model's behavior in unobserved situations. Being able to make strong guarantees about ways the model will never behave. Being able to do this for basically any behavior we care about if we wanted to, rather than only for some narrow subset of behaviors. There existing a generally agreed sense that we have "pulled back the veil" and there isn't really any more mystery underlying why GPT-2 works. Given the understanding of how GPT-2 works, it should be possible for a person (given 100 years or whatever) to hand craft a python script that is GPT-2 level at language modeling, where the script makes sense to people and can be maintained the same way any normal piece of complex software can be. It's ok if the actual explanation of GPT-2 is too big to understand in a short amount of time as long as it's pretty obviously true that you could eventually understand it with enough time (e.g if there's a huge list of memorized facts and bigram frequencies, it's ok if it would take a human 100 years to understand). In particular, if you could have GPT-5 (or 6) understand the model for you by running for a really long time, it counts. I don't intend to be an asshole about the "full" part. If we understand 99% of the model, and there's a tiny bit that we can't understand, and we have no reason to believe that part is actually crucial, then that's fine. If we understand everything up to some amount of noise, I'd want some strong reason why we are sure there is nothing hiding in the noise, but if given that, I'm ok with it. Update 2026-01-24 (PST) (AI summary of creator comment): Resolution in edge cases involving AGI/ASI: If AI assistants help solve interpretability before 2028, this resolves YES If AGI takes over and murders/subjugates humanity before 2028, then solves interpretability: resolves N/A or NO (since the research agenda failed to be completed in time) If alignment is solved without interpretability and we reach transhumanist utopia before 2028, with interpretability only solved after takeoff: resolves N/A (since interpretability wouldn't have been relevant to the alignment question)

24h Volume: $672. Liquidity: $10,000. Resolves: 12/31/2027.