More than 37% of tasks performed on AI models are about computer programming and maths.1
To identify the right AI model for coding, we are introducing a new benchmark, LMC-Eval, in which we test top-tier AI models to assess their performance on logical coding questions:
Results
The results of our benchmark show that ChatGPT-o1 and ChatGPT-o3-mini are the leading AI models in coding.
Methodology
We used 100 math problems that are solvable by an advanced high-school student in LMC-Eval (Logical Math Coding Eval). These problems require both logical thinking and coding skills. Our aim here is to examine the LLMs’ reasoning and logical thinking abilities as well as their coding skills. This is a zero-shot benchmark; we did not train the models with similar questions.
Dataset
These problems cover:
- Basic concepts: variables, loops, conditionals
- Data structures: arrays, lists, sets, maps
- Algorithms: sorting, searching, optimization
- Math concepts: geometry, algebra, arithmetic
- Problem-solving strategies: decomposition, pattern recognition, time and date handling
- Code organization: functions, classes, modules
We paid attention to constructing the dataset so that it would:
- Have clear inputs and outputs.
- Require different programming concepts.
- Be solved with multiple approaches.
- Test both mathematical and logical thinking.
- Have easy/medium/hard questions.
Prompt
You are an expert Python programmer. Please solve the following programming problem:
{problem}
Please provide only the Python code solution without any explanations or markdown formatting. Do not say “Here’s the Python code solution:” etc.
The code should be complete and runnable. Print the result specified in the question.
We will keep our dataset private and test additional models as they are published.
To see example questions, please refer to the examples section below
Examples
Here is an example question similar to a question that all the models answered correctly:
“Clara chooses a positive integer and creates a new number by summing all its digits. If this new number has only one digit, she stops the process. Otherwise, she continues by adding the digits of the number from the previous step until she gets a single-digit result.
For instance, when Clara selects 536, she gets 5+3+6=14 in the first step, then 1+4=5 in the second step, thus ending the process after the second step.
Accordingly, for how many of the natural numbers Clara can select from 1 to 150, does this process end at the end of the second step?”
Top LLMs for coding
We used the latest available versions of the models, as of February 2025.
Models tested:
- OpenAI o1
- OpenAI o3-mini
- Anthropic Claude Sonnet 3.7
- Google Gemini 2.0 Flash
- OpenAI GPT-4o
- Anthropic Claude Sonnet 3.5
- Mistral Large
Temperature is set to 0 while benchmarking the models.
To get detailed information about the API pricing of the models, you can read LLM pricing.
Next steps
We will:
- Add more models to the benchmark, like DeepSeek R1 and Llama.
- Eliminate the problems that every model solved and use more advanced problems, to test their logical coding skills better.
FAQ
Further reading
- AI Code Assistant Benchmark
- Agentic AI Code Editor Benchmark: Windsurf vs Cursor vs Replit
- AI Agents Benchmark
- AI Hallucination Benchmark
Reference Links

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Comments 0
Share Your Thoughts
Your email address will not be published. All fields are required.