
$50.46K
1
1

1 market tracked

No data available
| Market | Platform | Price |
|---|---|---|
![]() | Poly | 57% |
$50.46K
1
1
Trader mode: Actionable analysis for identifying opportunities and edge
This market will resolve to "Yes" if any Anthropic Claude model achieves the listed score or greater on the FrontierMath Exam by June 30, 2026, 11:59 PM ET. Otherwise, the market will resolve to "No". This market will resolve according to the Epoch AI’s Frontier Math benchmarking leaderboard (https://epoch.ai/frontiermath) for Tier 1-3. Studies which are not included in the leaderboard (e.g. https://x.com/EpochAIResearch/status/1945905796904005720) will not be considered. The primary resolutio
Prediction markets currently estimate there is a very high probability, roughly 9 in 10, that an Anthropic Claude artificial intelligence model will score at least 35% on a test called "Humanity's Last Exam" by the end of June 2026. This shows that traders who follow AI progress collectively believe this milestone is almost certain to be achieved.
The high confidence stems from the nature of the test and recent AI advances. "Humanity's Last Exam" is a public benchmark created by the company Scale AI. It is designed to be extremely difficult, testing an AI's ability to reason through long, complex problems that often require knowledge of computer programming, mathematics, and logic. A score of 35%, while far from a perfect grade, is seen as a significant technical hurdle.
The market odds reflect two main factors. First, AI models from Anthropic and other labs like OpenAI have shown rapid improvement on similar reasoning benchmarks over the past two years. Each new model release tends to jump significantly in performance. Second, the deadline is still over two years away, which the market views as a long time for AI research to progress. Given the current pace, traders expect that by mid-2026, Claude models will have advanced enough to clear this 35% threshold.
The most important signals will be new model releases from Anthropic. Each time Anthropic launches a new version of Claude, such as a hypothetical "Claude-4" or subsequent iteration, its score on the public Humanity's Last Exam leaderboard will be immediately updated. A large jump in score with a new release would solidify the current prediction, while a smaller-than-expected gain could cause the probability to drop.
Other events to watch include major releases from competitors like OpenAI's GPT-5 or Google's Gemini. Strong performance from those models on the same exam would suggest the overall field is advancing quickly, supporting the high probability for Claude. The market will likely become more volatile in the months leading up to the June 2026 deadline as the window for a necessary breakthrough narrows.
Prediction markets have a mixed but interesting record on long-term tech forecasts. They are often good at aggregating expert sentiment about technological trends, but they can be overly optimistic about the speed of development. For a concrete benchmark like this, where performance is publicly scored, markets tend to be more reliable than for vaguer questions. However, the two-year timeframe is a long one in AI. A major slowdown in research progress, or a strategic decision by Anthropic to focus on capabilities other than this specific test, could make the current 93% probability look too high in hindsight.
Prediction markets assign a 93% probability that an Anthropic Claude model will score at least 35% on the Humanity's Last Exam benchmark by June 30, 2026. This price indicates near-certainty in the market's view. With shares trading at 93¢ for "Yes," the consensus is that achieving this performance threshold is almost a foregone conclusion. The market has attracted $163,000 in volume, providing solid liquidity for a niche AI topic.
The high confidence stems from rapid performance gains on this specific benchmark. The Humanity's Last Exam, hosted by Scale AI, is a difficult test of reasoning and knowledge. Current public leaderboards show top models from OpenAI and Google already scoring above 50%. Anthropic's Claude 3.5 Sonnet, released in June 2024, was a major leap in capability. The market is betting that Anthropic's next two years of model development, likely including Claude 4.0 or similar iterations, will easily surpass the 35% bar. This isn't a bet on artificial general intelligence, but on incremental engineering progress within a known competitive framework.
The primary risk is a strategic shift by Anthropic. The company could deprioritize benchmark optimization for this specific test, focusing resources elsewhere. A 7% "No" price reflects this small chance. Technical hurdles or unexpected plateaus in reasoning capabilities could also slow progress. The resolution date in late June 2026 is key. If Anthropic's next major model launch, expected in 2025, shows weaker-than-anticipated results, the odds would shift downward. However, given the benchmark's current state and the competitive pressure to showcase performance, the trajectory strongly favors a "Yes" resolution.
AI-generated analysis based on market data. Not financial advice.
This prediction market concerns whether Anthropic's Claude artificial intelligence model will achieve a specified score on the FrontierMath benchmark by June 30, 2026. FrontierMath is a comprehensive evaluation suite designed by Epoch AI to measure the mathematical reasoning capabilities of advanced AI systems. The benchmark includes problems from mathematics competitions like the International Mathematical Olympiad (IMO) and university-level mathematics, divided into tiers of increasing difficulty. The market resolves based on the official FrontierMath leaderboard maintained by Epoch AI, which tracks performance across models from various AI labs. The specific score threshold required for a 'Yes' resolution is determined by the market's listing parameters. Interest in this market stems from the broader competition in AI development, where mathematical reasoning is considered a key milestone toward more general intelligence. Performance on challenging benchmarks like FrontierMath serves as a public indicator of technical progress between leading AI companies such as Anthropic, OpenAI, Google DeepMind, and others. These benchmarks influence investor confidence, research directions, and public perception of AI capabilities. The June 2026 deadline creates a specific timeframe for assessing Anthropic's progress against its competitors in a measurable domain.
The evaluation of AI mathematical reasoning has evolved significantly. Early benchmarks like the MATH dataset, introduced by Hendrycks et al. in 2021, provided a foundation with 12,500 pre-university level competition problems. Performance on MATH became a standard metric for large language models, with top models initially scoring below 10% in 2021. A major leap occurred with the release of OpenAI's GPT-4 in March 2023, which reportedly achieved a score of 42.5% on the MATH benchmark's 'Level 5' hardest problems, showcasing a substantial advance. In January 2024, Google DeepMind's AlphaGeometry system solved 25 out of 30 IMO geometry problems, a result published in Nature, marking a breakthrough in formal theorem proving. This created a new benchmark for AI performance in olympiad-level mathematics. Epoch AI launched its FrontierMath benchmark to consolidate these various challenge levels into a single, tiered leaderboard, responding to the need for a standardized measure of frontier model capabilities. The historical progression shows rapid improvement from near-zero performance on hard problems to specialized systems achieving human-competitive results, raising expectations for general models like Claude to incorporate similar reasoning abilities.
Achieving high scores on advanced mathematical benchmarks signals progress toward more reliable and general reasoning in AI systems. This capability has direct implications for scientific research, where AI could assist in hypothesis generation and complex calculation, and for technical fields like engineering and finance that rely on advanced mathematics. For AI companies, benchmark performance is a form of technical marketing that influences investment, talent recruitment, and enterprise customer adoption. A model that excels at mathematics is often perceived as more trustworthy for technical applications. The broader trajectory toward AI systems that can reason formally touches on long-term questions about AI safety and alignment. Systems that can rigorously verify their own reasoning or understand complex logical constraints may be easier to align with human intent, a core part of Anthropic's stated mission. Conversely, more capable reasoning systems also introduce new considerations about their potential misuse or the economic disruption they could cause in knowledge-work professions.
As of early 2024, Anthropic's Claude 3 model family has demonstrated strong overall performance but its specific results on the FrontierMath benchmark's hardest tiers have not been publicly released in detail. The FrontierMath leaderboard, maintained by Epoch AI, is actively tracking new model submissions. The AI field is in a period of rapid iteration, with companies like OpenAI, Google, and Anthropic expected to release new model generations before the June 2026 deadline. Research continues on techniques like reinforcement learning from human feedback (RLHF), chain-of-thought reasoning, and tool use (e.g., calculators, code interpreters) to improve mathematical performance. The integration of specialized reasoning systems, similar to DeepMind's AlphaGeometry, into general-purpose chatbots is an active area of research that could significantly boost benchmark scores.
FrontierMath is a benchmark created by Epoch AI to evaluate the mathematical reasoning of advanced AI models. It compiles problems from mathematics competitions and university-level courses, organizing them into tiers of increasing difficulty to provide a standardized measure of progress.
The specific score threshold is defined in the market listing parameters. Traders should consult the market details for the exact numerical score required on the FrontierMath benchmark's Tier 1-3 problems for a 'Yes' resolution.
Anthropic's Claude 3 models perform well on general benchmarks, but their precise scores on the hardest tiers of FrontierMath are not fully public. Earlier models showed capability on high school math, but olympiad-level problems remain a significant challenge for general models without specialized training or tool use.
Epoch AI determines the official score via its FrontierMath leaderboard. The market resolves based on the scores published on that leaderboard. Results from other studies or social media posts, even from Epoch researchers, are not considered for resolution unless they appear on the official leaderboard.
The market deadline is June 30, 2026, 11:59 PM ET. Only scores achieved by an Anthropic Claude model and recorded on the FrontierMath leaderboard on or before that date will count. Any model releases or scores achieved after the deadline do not affect the market's resolution.
Educational content is AI-generated and sourced from Wikipedia. It should not be considered financial advice.

No related news found
Add this market to your website
<iframe src="https://predictpedia.com/embed/QDJmx8" width="400" height="160" frameborder="0" style="border-radius: 8px; max-width: 100%;" title="Anthropic Claude score on FrontierMath Benchmark by June 30?"></iframe>