
$163.43K
1
2

$163.43K
1
2
Trader mode: Actionable analysis for identifying opportunities and edge
This market will resolve to "Yes" if the Humanity’s Last Exam leaderboard lists any Anthropic Claude model with a score of at least the specified score by June 30, 2026, 11:59 PM ET. Otherwise, this market will resolve to "No". The resolution source will be the official Humanity’s Last Exam leaderboard https://scale.com/leaderboard/humanitys_last_exam.
Prediction markets currently estimate there is a very high probability, roughly 9 in 10, that an Anthropic Claude artificial intelligence model will score at least 35% on a test called "Humanity's Last Exam" by the end of June 2026. This shows that traders who follow AI progress collectively believe this milestone is almost certain to be achieved.
The high confidence stems from the nature of the test and recent AI advances. "Humanity's Last Exam" is a public benchmark created by the company Scale AI. It is designed to be extremely difficult, testing an AI's ability to reason through long, complex problems that often require knowledge of computer programming, mathematics, and logic. A score of 35%, while far from a perfect grade, is seen as a significant technical hurdle.
The market odds reflect two main factors. First, AI models from Anthropic and other labs like OpenAI have shown rapid improvement on similar reasoning benchmarks over the past two years. Each new model release tends to jump significantly in performance. Second, the deadline is still over two years away, which the market views as a long time for AI research to progress. Given the current pace, traders expect that by mid-2026, Claude models will have advanced enough to clear this 35% threshold.
The most important signals will be new model releases from Anthropic. Each time Anthropic launches a new version of Claude, such as a hypothetical "Claude-4" or subsequent iteration, its score on the public Humanity's Last Exam leaderboard will be immediately updated. A large jump in score with a new release would solidify the current prediction, while a smaller-than-expected gain could cause the probability to drop.
Other events to watch include major releases from competitors like OpenAI's GPT-5 or Google's Gemini. Strong performance from those models on the same exam would suggest the overall field is advancing quickly, supporting the high probability for Claude. The market will likely become more volatile in the months leading up to the June 2026 deadline as the window for a necessary breakthrough narrows.
Prediction markets have a mixed but interesting record on long-term tech forecasts. They are often good at aggregating expert sentiment about technological trends, but they can be overly optimistic about the speed of development. For a concrete benchmark like this, where performance is publicly scored, markets tend to be more reliable than for vaguer questions. However, the two-year timeframe is a long one in AI. A major slowdown in research progress, or a strategic decision by Anthropic to focus on capabilities other than this specific test, could make the current 93% probability look too high in hindsight.
Prediction markets assign a 93% probability that an Anthropic Claude model will score at least 35% on the Humanity's Last Exam benchmark by June 30, 2026. This price indicates near-certainty in the market's view. With shares trading at 93¢ for "Yes," the consensus is that achieving this performance threshold is almost a foregone conclusion. The market has attracted $163,000 in volume, providing solid liquidity for a niche AI topic.
The high confidence stems from rapid performance gains on this specific benchmark. The Humanity's Last Exam, hosted by Scale AI, is a difficult test of reasoning and knowledge. Current public leaderboards show top models from OpenAI and Google already scoring above 50%. Anthropic's Claude 3.5 Sonnet, released in June 2024, was a major leap in capability. The market is betting that Anthropic's next two years of model development, likely including Claude 4.0 or similar iterations, will easily surpass the 35% bar. This isn't a bet on artificial general intelligence, but on incremental engineering progress within a known competitive framework.
The primary risk is a strategic shift by Anthropic. The company could deprioritize benchmark optimization for this specific test, focusing resources elsewhere. A 7% "No" price reflects this small chance. Technical hurdles or unexpected plateaus in reasoning capabilities could also slow progress. The resolution date in late June 2026 is key. If Anthropic's next major model launch, expected in 2025, shows weaker-than-anticipated results, the odds would shift downward. However, given the benchmark's current state and the competitive pressure to showcase performance, the trajectory strongly favors a "Yes" resolution.
AI-generated analysis based on market data. Not financial advice.
This prediction market topic concerns whether any Anthropic Claude artificial intelligence model will achieve a specified score on the 'Humanity's Last Exam' benchmark by June 30, 2026. The benchmark, hosted by the AI infrastructure company Scale AI, is designed to test an AI system's ability to perform complex, open-ended tasks that require deep reasoning, creativity, and understanding of human values. The exam includes challenges in areas like scientific discovery, ethical dilemma resolution, and creative problem-solving, aiming to measure progress toward artificial general intelligence (AGI). Anthropic, the company behind the Claude models, is a leading AI research and safety organization. Its Claude models, known for their conversational abilities and strong safety guardrails, are frequent contenders on public AI benchmarks. The market resolves based on the official leaderboard maintained by Scale AI, which tracks submissions from various AI labs. Interest in this market stems from its function as a proxy for measuring competitive progress in AI capabilities between major companies like Anthropic, OpenAI, Google DeepMind, and others. The specified score threshold represents a significant milestone, and achieving it would signal a notable advancement in Claude's reasoning and problem-solving abilities. Observers use such benchmarks to gauge the pace of AI development and to assess which architectural approaches or training methodologies are yielding the most capable systems.
The practice of benchmarking AI systems dates back decades, with early tests like the Turing Test proposed in 1950. In the modern era, the release of benchmarks such as GLUE in 2018 and its successor SuperGLUE standardized the evaluation of natural language understanding. These were followed by more demanding benchmarks like BIG-bench and MMLU (Massive Multitask Language Understanding), which aimed to test broader knowledge and reasoning. The creation of 'Humanity's Last Exam' by Scale AI in 2023 represents a newer generation of evaluation designed to move beyond narrow task performance. It seeks to integrate elements of reasoning, creativity, and real-world problem-solving into a single framework, inspired by the concept of an 'exit exam' for AGI. Historically, AI models have shown rapid improvement on such benchmarks shortly after their introduction. For example, OpenAI's GPT-4 achieved high scores on many professional and academic exams shortly after its release in March 2023. This pattern of rapid benchmark saturation has led to the development of increasingly difficult evaluations. The specific interest in Anthropic's performance is part of a longer trend of competitive benchmarking between AI labs, which has accelerated since the public release of ChatGPT in November 2022 and the subsequent proliferation of capable large language models.
The outcome of this prediction market matters because it provides a quantifiable measure of progress toward more general and capable AI. Benchmarks act as checkpoints on the road to artificial general intelligence, and a high score on a comprehensive exam suggests an AI can integrate knowledge and reason across diverse domains. This has direct implications for how and where such AI could be deployed in high-stakes fields like scientific research, complex policy analysis, or advanced technical design. For investors and industry observers, performance on these benchmarks influences perceptions of a company's technological edge, which can affect funding, partnerships, and commercial adoption. If Claude achieves the target score, it could shift competitive dynamics, potentially accelerating investment in Anthropic's approach to AI safety and capability. Conversely, failure to reach the benchmark by the deadline might indicate unexpected technical hurdles or a relative slowdown in its development pace compared to rivals.
As of early 2024, Anthropic's Claude 3 model family (comprising Haiku, Sonnet, and Opus) is its most recent publicly available series. Scale AI's Humanity's Last Exam leaderboard is active and accepts submissions. The specific score threshold for this prediction market is not publicly defined in the query, but the resolution will depend on whether any Claude model reaches that score on the leaderboard by the deadline. AI labs typically submit their latest models to such benchmarks shortly after major releases or updates. The next expected milestone would be the potential release of a Claude 4 model or a significant update to the Claude 3 series, which would likely be tested on this and other benchmarks.
Humanity's Last Exam is a benchmark created by Scale AI to evaluate advanced AI capabilities. It consists of a series of complex, open-ended tasks designed to test reasoning, creativity, and problem-solving skills across multiple domains, aiming to approximate a test for artificial general intelligence.
Claude is a large language model developed by Anthropic. It is trained on vast amounts of text data and uses a transformer architecture to generate human-like text. A key differentiator is Anthropic's 'Constitutional AI' training method, which uses a set of principles to guide model behavior toward being helpful, honest, and harmless.
Anthropic was founded in 2021 by Dario Amodei and Daniela Amodei, who were previously senior researchers at OpenAI. The company's team includes veterans from OpenAI, Google Brain, and other leading AI labs, and it is backed by major investors including Amazon, Google, and Salesforce.
Claude, developed by Anthropic, and ChatGPT, developed by OpenAI, are both large language models. Differences lie in their underlying training methodologies, safety approaches, and performance profiles on specific tasks. Anthropic emphasizes a technique called Constitutional AI for alignment, while OpenAI uses reinforcement learning from human feedback (RLHF).
Benchmark scores are typically verified by the organization hosting the benchmark, such as Scale AI. They often require model developers to submit their outputs for a standardized set of test questions. The hosting organization then runs an evaluation script to score the outputs, ensuring consistency and preventing manipulation of results.
Passing a high threshold on Humanity's Last Exam would not mean an AI has achieved AGI, but it would represent a significant milestone in machine reasoning. It would likely increase confidence in the model's utility for complex tasks and intensify discussions about the societal and economic implications of increasingly capable AI.
Educational content is AI-generated and sourced from Wikipedia. It should not be considered financial advice.
2 markets tracked

No data available
| Market | Platform | Price |
|---|---|---|
![]() | Poly | 93% |
![]() | Poly | 54% |


No related news found
Add this market to your website
<iframe src="https://predictpedia.com/embed/vDLG8A" width="400" height="160" frameborder="0" style="border-radius: 8px; max-width: 100%;" title="Anthropic Claude score on Humanity’s Last Exam by June 30?"></iframe>