AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?

Prediction market on manifold. I don't have a clear definition of "deceptive", I think that's part of the challenge. Edit: By "part of the challenge" I mean that this market is asking both if a clear definition of "deceptive" will be published and if tools to detect that will be created. I will be fairly lax about what counts as a good formalization - if it captures even 40% of what we generally think of as "deceptive" that would count.

24h Volume: $50. Liquidity: $1,000. Resolves: 1/1/2027.

Loading