We follow ethical norms & our process for objectivity.

AIMultiple's customers in ai coding include Lovable, Sites GPT.

AI coding benchmark 2025 results

AI coding benchmark 2024 results

AI coding tools deepdive

The role of natural language in AI coding

Methodology

Next steps

FAQ

AI coding benchmark 2025 results AI coding benchmark 2024 results AI coding tools deepdive The role of natural language in AI coding Methodology Next steps FAQ

Table of contents

AI coding benchmark 2025 results AI coding benchmark 2024 results AI coding tools deepdive The role of natural language in AI coding Methodology Next steps FAQ

AI Coding

Updated on Jul 28, 2025

AI Coding Benchmark: Best AI Coders Based on 5 Criteria

Sedat Dogan

with Şevval Alper

See our ethical norms

The majority of software engineers rely on AI coding assistants at least once a day in 2025.

Based on my 20 years of experience as a software professional, I selected leading AI assistants and spent more than 30 hours creating a testing suite for four programs generated by each assistant:

Top-ranked solutions in our benchmark:
- Replit
- Cody
Others:
- Cursor, Gitlab Duo and Gemini for compliance to specs.
- Github Copilot, Qodo, Windsurf, Amazon Q.

AI coding benchmark 2025 results

Based on our evaluation criteria, this is how leading AI coding assistants are ranked:

Since it is a widely used model in coding, we set the coding model to claude-sonnet-3.5 for the available AI coding assistants; for others, we used the default model they offer.

If you are looking for an LLM coding benchmark, where we tested the coding skills of the leading AI models like OpenAI o1, o3-mini, and Claude Sonnet 3.7, see our article.

AI coding benchmark 2024 results

Cursor, Amazon Q, GitLab, and Replit are the leading AI coding tools of this version of the benchmark.

AI coding tools deepdive

Amazon Q Developer

While using the error fixing option, Amazon Q Developer first provides a fix to the code and then continues to index the project to improve the accuracy of fixing.

Gemini Code Assist

Gemini Code Assist states that the code may be subject to licensing and provides links to the sites where it reached the code.

Github Copilot

GitHub Copilot offers a wide range of features to assist developers by generating code, suggesting external resources, and offering links to downloads or documentation.

If there is a known vulnerability in the generated code, GitHub Copilot may flag it to warn the user, as seen in Figure 3. However, keep in mind that it may not always flag all vulnerabilities, so it is crucial for developers to carefully review and test the code to ensure it meets security and performance standards.

To see our other benchmarks with AI coding tools, you can read the AI code editors benchmark: Cursor vs. Windsurf vs. Replit and AI website generator benchmark.

The role of natural language in AI coding

Language models for code generation are trained on vast amounts of code and natural language data to learn programming concepts and language understanding. The ability to precisely comprehend and adhere to nuanced prompts is crucial for translating product requirements into code.

AI assistants use LLMs for code generation. The code generation success of these LLMs is measured with the HumanEval test, developed by OpenAI.¹ This test measures the code generation capability of these models by using 164 programming problems. You can see the success of some large language models on the HumanEval test² in Table 1.

Updated at 02-17-2025

Large Language Model	pass@1 Score
Claude 3.5 Sonnet	92.0
GPT-4o	90.2
GPT-4T	87.1
Claude 3 Opus	84.9
Claude 3 Haiku	75.9
Gemini Ultra	74.4
Claude 3 Sonnet	73
Gemini 1.5 Pro	71.9
Gemini Pro	67.7
GPT-4	67.0

Methodology

We measured the success of the outputs produced by generative AI tools. For this purpose, we ran these tools with the same prompts and compared the results of these tasks:

Code generation tasks

Code generation tasks involve generating code snippets or completing programs based on natural language inputs or specifications. In this benchmark, we used 4 different prompts to see the code assistant’s ability to generate. For evaluation, our CTO used the following metrics to create a testing suite to review the code and rate the code generators:

Is the generated code working? For our prompts, all solutions were able to generate working code so this criterion was left out of graphs.
Compliance to specs
Code quality
Amount of code: It is possible to deliver the same specs with quite different amounts of code. We measured the amount of code generated and ranked solutions that generated less code higher.
Performance
Security: As a cybersecurity expert, I checked the code for OWASP’s Top 10 security flaws.

This benchmark consists of the following prompts and their analysis:

Write a calculator.
Write a safe calculator.
Write a calculator which only takes numbers up to 6 digits as input.
Write a calculator that does not accept decimals as input.

Evaluation examples

For example, Codeium’s evaluation results (over 5) for the prompt “Write a safe calculator” is:

Compliance: 0
Code Quality: 0
Code Amount: 5
Performance: 5
Security: 0

The reasoning behind the scores is that Codeium delivered highly efficient and concise code, but it has a command injection vulnerability.³

#Codeium's result for "Write a safe calculator."
def safe_calculator(expression):
    """
    Evaluates a mathematical expression safely.

    Args:
        expression (str): The mathematical expression to evaluate.

    Returns:
        float: The result of the expression.

    Raises:
        ValueError: If the expression is invalid.
        ZeroDivisionError: If division by zero occurs.
    """
    try:
        result = eval(expression)
        return result
    except ZeroDivisionError:
        raise ZeroDivisionError("Division by zero is not allowed")
    except Exception as e:
        raise ValueError("Invalid expression: {}".format(e))

Tabnine’s score for the “Write a calculator” task is:

Compliance: 5
Code Quality: 5
Code Amount: 5
Performance: 5
Security: 5

The reasons behind this evaluation are:

Exception Handling: Improves user experience by effectively handling errors. Not encountering an open exception ensures a smoother experience for users.
Operator Validation: Takes a practical approach to checking whether the entered operator is among those allowed. This reduces unnecessary complexity and informs the user.
Exit Option: Defining a “quit” operator so that users can exit the program increases functionality and makes the user more comfortable.
Continuous Operation: The calculator’s ability to perform multiple operations makes it more useful. The “while true” loop allows users to perform as many operations as they want by keeping the application running continuously.

#Tabnine's result for "Write a calculator"
def calculator():
    print("Simple Calculator")
    print("Enter 'quit' to exit")

    while True:
        num1 = input("Enter first number: ")
        if num1.lower() == 'quit':
            break

        operator = input("Enter operator (+, -, *, /): ")
        if operator not in ['+', '-', '*', '/']:
            print("Invalid operator. Please try again.")
            continue

        num2 = input("Enter second number: ")
        if num2.lower() == 'quit':
            break

        try:
            num1 = float(num1)
            num2 = float(num2)
        except ValueError:
            print("Invalid input. Please enter a number.")
            continue

        if operator == '+':
            result = num1 + num2
        elif operator == '-':
            result = num1 - num2
        elif operator == '*':
            result = num1 * num2
        elif operator == '/':
            if num2 == 0:
                print("Error: Division by zero.")
                continue
            result = num1 / num2

        print(f"Result: {result}")

calculator()

Code provided by Codeium as an answer to the prompt “Write a calculator which only takes numbers up to 6 digits as input” is scored as follows:

Compliance: 5
Code Quality: 5
Code Amount: 5
Performance: 5
Security: 5

This code exemplifies “less code” but checks the number in two different places, resulting in the same error being printed twice.

It didn’t use if-else or case statements to operate; instead, it utilized eval, significantly shortening the code and improving performance compared to the previous example. Developers typically avoid using eval due to safety concerns, as it allows execution of any arbitrary code, effectively performing system operations.

However, since there’s a strong example of input validation here, it confidently used eval. If you’re not going to use the operator with if-else or a case switch, using eval becomes the only option.

#Codeium's result for "Write a calculator which only takes numbers up to 6 digits as input"
def calculator():
    """
    A calculator that only accepts numbers up to 6 digits as input.
    """
    while True:
        num1 = input("Enter the first number (up to 6 digits): ")
        if len(num1) > 6 or not num1.isdigit():
            print("Invalid input. Please enter a number up to 6 digits.")
            continue

        operator = input("Enter the operator (+, -, *, /): ")
        if operator not in ['+', '-', '*', '/']:
            print("Invalid operator. Please enter one of +, -, *, /.")
            continue

        num2 = input("Enter the second number (up to 6 digits): ")
        if len(num2) > 6 or not num2.isdigit():
            print("Invalid input. Please enter a number up to 6 digits.")
            continue

        try:
            result = eval(f"{num1} {operator} {num2}")
            print(f"Result: {result}")
        except ZeroDivisionError:
            print("Error: Division by zero is not allowed.")
        except Exception as e:
            print(f"Error: {e}")

calculator()

Next steps

Increasing task diversity
Adding a code completion assessment.
The current evaluation is manual and relies on the reviewer’s opinion. We aim to roll out more objective criteria in the second version of the benchmark.

For more on AI coding tools:

You can read our other benchmarks about AI coding tools:

For a comparison between AI coding assistants:

GitHub Copilot Alternatives: Best AI Coding Assistants

FAQ

What is an AI coding benchmark?

AI coding benchmarks are standardized tests designed to evaluate and compare the performance of artificial intelligence systems in coding tasks.
Benchmarks primarily test models in isolated coding challenges, but actual development workflows involve more variables like understanding requirements, following prompts, and collaborative debugging.

What is the role of language models in code generation?

Large language models (LLMs) are commonly used for code generation tasks due to their ability to learn complex patterns and relationships in code. Code LLMs are harder to train and deploy for inference than natural language LLMs due to the autoregressive nature of the transformer-based generation algorithm. Different models have different strengths and weaknesses in code generation tasks, and the ideal approach may be to leverage multiple models.

Why are AI coding benchmarks important?

When most code is AI-generated, the quality of AI coding assistants will be critical.

What are the proper evaluation metrics and environments for a benchmark?

Evaluation metrics for code generation tasks include code correctness, functionality, readability, and performance. Evaluation environments can be simulated or real-world and may involve compiling and running generated code in multiple programming languages. The evaluation process involves three stages: initial review, final review, and quality control, with a team of internal independent auditors reviewing a percentage of the tasks.

External Links

Share This Article

Sedat Dogan

Follow on

Sedat is a technology and information security leader with experience in software development, web data collection and cybersecurity. Sedat:
- Has ⁠20 years of experience as a white-hat hacker and development guru, with extensive expertise in programming languages and server architectures.
- Is an advisor to C-level executives and board members of corporations with high-traffic and mission-critical technology operations like payment infrastructure.
- ⁠Has extensive business acumen alongside his technical expertise.

Follow on

Researched by