We designed a new benchmark, Mathematical Reasoning Eval: MathR-Eval, to test the LLMs’ reasoning abilities, with 100 logical mathematics questions.
Benchmark results
Results show that OpenAI’s o1 and o3-mini are the best performing LLMs in our benchmark.
Methodology
Our dataset includes 100 mathematics questions, which do not include advanced calculus but require reasoning and problem-solving techniques. Our rationale for creating this dataset is that these questions are objective to rate since there is only one correct answer.
We also used the same dataset in our LMC-Eval: Logic/Math Coding Benchmark to compare the models’ reasoning and coding abilities. You can see an example question from our dataset in that article. This is a zero-shot benchmark, we did not provide example questions to train LLMs.
We did not only test the reasoning models, but we also wanted to see their difference from non-reasoning models.
To see the hallucination rates of the models, you can see our AI hallucination benchmark.
AI reasoning models
OpenAI o1, o1-mini and o1-pro: OpenAI released o1 and o1-pro on September 12, 2024. o1-mini is a faster model that is optimized for STEM tasks.
OpenAI o3, o3-mini and o3-mini-high: o3-mini is the most cost-efficient model in OpenAI’s reasoning models, which performed the same as o1 in our benchmark. It is a smaller model than o3 and o3-mini-high.
Claude Sonnet 3.7: Claude Sonnet 3.7 has an extended-thinking mode, where users can adjust the reasoning tokens.
DeepSeek R1: DeepSeek R1 is the only open-source model in this benchmark. It also offers the cheapest API among the reasoning models.
For details about their API pricing, see LLM pricing.
Types of AI reasoning
Different reasoning models employ various approaches:
- Deductive reasoning: Drawing specific conclusions from general principles
- Inductive reasoning: Forming general conclusions from specific observations
- Abductive reasoning: Finding the most likely explanation for observations
- Analogical reasoning: Applying solutions from similar past problems
- Causal reasoning: Understanding cause-and-effect relationships
- Common sense reasoning: Making intuitive judgments
Characteristics of AI reasoning models
Reasoning models perform:
- Step-by-step processing: Rather than producing immediate answers, these models break down problems into logical components and work through them sequentially.
- Chain-of-thought capabilities: Users can show their work and explain the reasoning pathway from question to conclusion. They can also see that on the chat interfaces.
- Extended thinking modes: Some models incorporate dedicated “thinking time” before generating responses, improving accuracy on complex problems.
Machine learning helps develop AI reasoning models with supervised, unsupervised, and reinforcement learning algorithms.
You can also see our benchmark of artificial economic intelligence.
FAQ

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Be the first to comment
Your email address will not be published. All fields are required.