
$305.93K
1
5

$305.93K
1
5
Trader mode: Actionable analysis for identifying opportunities and edge
This market will resolve to "Yes" if the Humanity’s Last Exam leaderboard lists any Google Gemini 3 model with a score of at least the specified score by March 31, 2026, 11:59 PM ET. Otherwise, this market will resolve to "No". The resolution source will be the official Humanity’s Last Exam leaderboard https://scale.com/leaderboard/humanitys_last_exam.
Prediction markets currently give Google's Gemini AI roughly a 2 in 5 chance of scoring at least 50% on a test called "Humanity's Last Exam" by the end of June. This is essentially seen as a coin flip, showing that traders are deeply split on whether this near-term milestone will be reached.
The exam itself is a public benchmark run by AI company Scale AI. It is designed to be an extremely difficult test of advanced reasoning, combining questions on science, philosophy, and complex problem-solving. A 50% score is a significant threshold, indicating an AI model can correctly answer half of these challenging, human-like questions.
The even odds reflect two competing views. On one side, AI capability has improved rapidly. Google's recent Gemini models have shown strong performance on other technical benchmarks. Traders betting "Yes" likely believe this progress could continue quickly enough to hit the 50% mark on this specific test within two months.
On the other side, "Humanity's Last Exam" is notoriously difficult. It is meant to be a high bar for general reasoning, not just narrow knowledge. Previous model scores have been low, suggesting a jump to 50% is a major leap. The "No" bettors may think that while progress is steady, achieving this particular score by June requires a breakthrough that hasn't yet been demonstrated.
The deadline for this specific market is June 30, 2024. The main event to watch is any official update to the public leaderboard at scale.com/leaderboard/humanitys_last_exam. Google or its AI research teams might also release a new model version or a research paper claiming improved reasoning capabilities before the cutoff. A leaderboard update showing any Gemini model crossing the 50% line would immediately settle the market.
Markets on near-term, clearly defined technical milestones like this tend to be fairly accurate. They aggregate the views of many people tracking AI progress closely. However, the reliability here has a specific limit. The market only forecasts the probability of a public score appearing on a leaderboard by a certain date. It does not forecast if the underlying AI capability is "truly" intelligent or if the test is a perfect measure. The prediction could be wrong if Google has a model that scores 50% but chooses not to submit it to the public leaderboard before the deadline.
Prediction markets assign a 40% probability that a Google Gemini model will score at least 50% on the Humanity’s Last Exam benchmark by March 31, 2026. This price indicates the market views the outcome as plausible but unlikely. With $306,000 in total volume, the market has attracted significant speculative interest, though liquidity is concentrated in the 50% threshold question. The resolution deadline is now past, meaning the "No" outcome is almost certain to be confirmed.
The low probability reflects the extreme difficulty of the Humanity’s Last Exam benchmark itself. Created by Scale AI, the exam is designed to test advanced reasoning and is considered a proxy for artificial general intelligence (AGI). A 50% score is a high bar that no publicly known model had achieved by the deadline. Market pricing through early 2026 consistently showed skepticism that Gemini, or any model, would hit this milestone within the timeframe. This skepticism was rooted in the incremental, not exponential, progress observed in AI benchmark results throughout 2024 and 2025. The market effectively bet that a breakthrough of this magnitude would not occur on schedule.
The market is effectively closed. No future event can change the odds for this specific contract, as the resolution period ended on March 31, 2026. The outcome hinges solely on the official leaderboard update. Traders are now awaiting final resolution based on the source data. For future markets on similar topics, key catalysts would include official releases of new Gemini models, publications of benchmark scores by Google DeepMind, and any changes to the Humanity’s Last Exam leaderboard. A significant, unexpected score publication before the deadline would have been the only event to shift the probability upward, but that did not occur.
AI-generated analysis based on market data. Not financial advice.
This prediction market concerns whether Google's Gemini artificial intelligence models will achieve a specified performance threshold on the 'Humanity's Last Exam' benchmark by June 30, 2025. The 'Humanity's Last Exam' is a public leaderboard hosted by the AI data company Scale AI that evaluates large language models (LLMs) on a broad set of reasoning and knowledge tasks designed to approximate a comprehensive test of human-level intelligence. The market resolves based on whether any Google Gemini 3 model appears on the official leaderboard with a score meeting or exceeding the target by the deadline. The outcome is a direct measure of progress in AI capability, specifically for Google's flagship model series. Interest in this market stems from the intense competition between leading AI labs, particularly Google, OpenAI, and Anthropic, to demonstrate superior model performance. Benchmarks like Humanity's Last Exam have become key battlegrounds for proving technological leadership. The specific focus on Gemini 3 reflects anticipation for Google's next major model iteration, expected to be a significant upgrade over the current Gemini 1.5 and Gemini 2.0 models. Investors, researchers, and industry observers track these benchmarks to gauge the pace of AI advancement and the competitive positioning of major companies. A high score by Gemini 3 would signal Google's continued strength in foundational AI research and its ability to keep pace with or surpass rivals like OpenAI's GPT-4 and Anthropic's Claude 3.
The practice of benchmarking AI models on standardized tests dates back to the ImageNet competition for computer vision, which concluded in 2017. For language models, benchmarks like GLUE and SuperGLUE emerged around 2018-2019 to measure natural language understanding. These early benchmarks were quickly saturated by models like BERT and GPT-3, leading to the creation of more difficult, holistic evaluations. The release of OpenAI's GPT-4 in March 2023 set a new high-water mark for broad capability, outperforming most humans on professional exams like the bar. This intensified the benchmark arms race. In response, new comprehensive evaluations like MMLU (Massive Multitask Language Understanding) and the more recent 'Humanity's Last Exam' were developed to provide a single score representing a model's general knowledge and reasoning across diverse domains. Google entered this race with the December 2023 launch of Gemini, claiming it outperformed GPT-4 on several benchmarks. However, these claims were sometimes contested, highlighting the importance of transparent, third-party leaderboards like Scale AI's. The progression from Gemini 1.0 to 1.5 Pro demonstrated rapid iteration, setting expectations for a major leap with Gemini 3.
The performance of AI models on exams like Humanity's Last Exam has direct implications for their commercial and scientific utility. Models that score highly are better suited for complex tasks in fields like medicine, law, and scientific research, where reasoning and knowledge synthesis are required. A high score for Gemini 3 could influence enterprise adoption decisions, as companies seek the most capable AI for their operations. It also affects investor sentiment toward Alphabet's stock, as AI capability is viewed as a key driver of future revenue. Beyond economics, these benchmarks inform the societal debate about AI safety and governance. Rapid progress on tests designed to approximate human-level reasoning raises questions about the timeline for achieving artificial general intelligence (AGI). Policymakers and safety researchers monitor these scores to assess whether current regulatory frameworks are adequate. A high score may accelerate calls for new oversight mechanisms. For the AI research community, the results shape understanding of which architectural approaches—such as scaling, novel training methods, or multimodality—are most effective for advancing capability.
As of late 2024, the Humanity's Last Exam leaderboard features models like GPT-4, Claude 3 Opus, and Gemini 1.5 Pro, but no Gemini 3 model has been released or scored. Google has not announced a formal release date for Gemini 3, though industry analysts expect it in 2025. The company continues to update its Gemini 1.5 series, recently releasing a 'Flash' model optimized for speed. The AI benchmark landscape remains active, with new evaluations emerging, but Humanity's Last Exam has gained recognition as a credible, aggregated measure. All eyes are on Google's next major model announcement, which will trigger a new round of evaluation and comparison on this and other leaderboards.
It is a public leaderboard created by Scale AI that evaluates AI models on 57 diverse tasks spanning mathematics, science, law, and ethics. The goal is to provide a single, comprehensive score that approximates a model's general reasoning and knowledge, similar to a broad human exam.
Gemini 3 is the anticipated next-generation large language model from Google DeepMind. It is expected to be a significant upgrade over the current Gemini 1.5 and 2.0 models, with improvements in reasoning, multimodality, and efficiency. Google has not yet released official details or a launch date.
The market resolves to 'Yes' if the official Humanity's Last Exam leaderboard at scale.com lists any Google Gemini 3 model with a score at or above the market's specified threshold by June 30, 2025, 11:59 PM ET. Otherwise, it resolves to 'No'. The leaderboard is the sole resolution source.
The leaderboard is operated by Scale AI, a data annotation and evaluation company founded by Alexandr Wang. Scale AI works with many AI labs and positions its benchmarks as neutral, third-party evaluations to ensure credibility and transparency in model comparisons.
Benchmarks provide standardized, comparable measures of AI model performance. They drive research progress by setting clear goals, help customers choose between different AI systems, and inform the public and policymakers about the pace of AI advancement and its capabilities.
Educational content is AI-generated and sourced from Wikipedia. It should not be considered financial advice.
5 markets tracked

No data available
| Market | Platform | Price |
|---|---|---|
![]() | Poly | 40% |
![]() | Poly | 20% |
![]() | Poly | 18% |
![]() | Poly | 10% |
![]() | Poly | 5% |





No related news found
Add this market to your website
<iframe src="https://predictpedia.com/embed/LCemyd" width="400" height="160" frameborder="0" style="border-radius: 8px; max-width: 100%;" title="Google Gemini score on Humanity’s Last Exam by June 30?"></iframe>