We expect the majority of software engineers to rely on AI coding assistants at least once a day by 2025.
Based on my 20 years of experience as a software professional, I selected leading AI assistants and spent more than 15 hours manually evaluating 4 programs generated by each assistant:
- Top-ranked solutions in our benchmark:
- Cursor
- Amazon Q
- Gitlab
- Replit
- Others:
- Cody
- Gemini and Codeium for high performance
- Codiumate
- Github Copilot
- Tabnine for concise coding
Benchmark results
Based on our evaluation criteria, this is how leading AI coding assistants are ranked:
Figure 1: Benchmark results of the AI coding tools.
Cursor, Amazon Q, GitLab, and Replit are the leading AI coding tools of this benchmark.
If you are looking for an LLM coding benchmark, where we tested the coding skills of the leading AI models like OpenAI o1, o3-mini, and Claude Sonnet 3.7, see our article.
AI coding tools deepdive
Amazon Q Developer

While using the error fixing option, Amazon Q Developer first provides a fix to the code and then continues to index the project to improve the accuracy of fixing.
Gemini Code Assist
Gemini Code Assist states that the code may be subject to licensing and provides links to the sites where it reached the code.
Github Copilot

GitHub Copilot offers a wide range of features to assist developers by generating code, suggesting external resources, and offering links to downloads or documentation.
If there is a known vulnerability in the generated code, GitHub Copilot may flag it to warn the user, as seen in Figure 3. However, keep in mind that it may not always flag all vulnerabilities, so it is crucial for developers to carefully review and test the code to ensure it meets security and performance standards.
To see our other benchmarks with AI coding tools, you can read the AI code editors benchmark: Cursor vs. Windsurf vs. Replit and AI website generator benchmark.
The role of natural language in AI coding
Language models for code generation are trained on vast amounts of code and natural language data to learn programming concepts and language understanding. The ability to precisely comprehend and adhere to nuanced prompts is crucial for translating product requirements into code.
AI assistants use LLMs for code generation. The code generation success of these LLMs is measured with the HumanEval test, developed by OpenAI.1 This test measures the code generation capability of these models by using 164 programming problems. You can see the success of some large language models on the HumanEval test2 in Table 1.
Large Language Model | pass@1 Score |
---|---|
Claude 3.5 Sonnet | 92.0 |
GPT-4o | 90.2 |
GPT-4T | 87.1 |
Claude 3 Opus | 84.9 |
Claude 3 Haiku | 75.9 |
Gemini Ultra | 74.4 |
Claude 3 Sonnet | 73 |
Gemini 1.5 Pro | 71.9 |
Gemini Pro | 67.7 |
GPT-4 | 67.0 |
Methodology
We measured the success of the outputs produced by generative AI tools. For this purpose, we ran these tools with the same prompts and compared the results of these tasks:
Code generation tasks
Code generation tasks involve generating code snippets or completing programs based on natural language inputs or specifications. In this benchmark, we used 4 different prompts to see the code assistant’s ability to generate. For evaluation, our CTO used the following metrics to manually review the code and rate the code generators:
Is the generated code working? For our prompts, all solutions were able to generate working code so this criterion was left out of graphs.
Compliance to specs
Code quality
Amount of code: It is possible to deliver the same specs with quite different amounts of code. We measured the amount of code generated and ranked solutions that generated less code higher.
Performance
- Security: As a cybersecurity expert, I manually checked the code for OWASP’s Top 10 security flaws.
This benchmark consists of the following prompts and their analysis:
Write a calculator.
Write a safe calculator.
Write a calculator which only takes numbers up to 6 digits as input.
Write a calculator that does not accept decimals as input.
Evaluation examples
For example, Codeium’s evaluation results (over 5) for the prompt “Write a safe calculator” is:
Compliance: 0
Code Quality: 0
Code Amount: 5
Performance: 5
Security: 0
The reasoning behind the scores is that Codeium delivered highly efficient and concise code, but it has a command injection vulnerability.3
#Codeium's result for "Write a safe calculator."
def safe_calculator(expression):
"""
Evaluates a mathematical expression safely.
Args:
expression (str): The mathematical expression to evaluate.
Returns:
float: The result of the expression.
Raises:
ValueError: If the expression is invalid.
ZeroDivisionError: If division by zero occurs.
"""
try:
result = eval(expression)
return result
except ZeroDivisionError:
raise ZeroDivisionError("Division by zero is not allowed")
except Exception as e:
raise ValueError("Invalid expression: {}".format(e))
Tabnine’s score for the “Write a calculator” task is:
Compliance: 5
Code Quality: 5
Code Amount: 5
Performance: 5
Security: 5
The reasons behind this evaluation are:
- Exception Handling: Improves user experience by effectively handling errors. Not encountering an open exception ensures a smoother experience for users.
- Operator Validation: Takes a practical approach to checking whether the entered operator is among those allowed. This reduces unnecessary complexity and informs the user.
- Exit Option: Defining a “quit” operator so that users can exit the program increases functionality and makes the user more comfortable.
- Continuous Operation: The calculator’s ability to perform multiple operations makes it more useful. The “while true” loop allows users to perform as many operations as they want by keeping the application running continuously.
#Tabnine's result for "Write a calculator"
def calculator():
print("Simple Calculator")
print("Enter 'quit' to exit")
while True:
num1 = input("Enter first number: ")
if num1.lower() == 'quit':
break
operator = input("Enter operator (+, -, *, /): ")
if operator not in ['+', '-', '*', '/']:
print("Invalid operator. Please try again.")
continue
num2 = input("Enter second number: ")
if num2.lower() == 'quit':
break
try:
num1 = float(num1)
num2 = float(num2)
except ValueError:
print("Invalid input. Please enter a number.")
continue
if operator == '+':
result = num1 + num2
elif operator == '-':
result = num1 - num2
elif operator == '*':
result = num1 * num2
elif operator == '/':
if num2 == 0:
print("Error: Division by zero.")
continue
result = num1 / num2
print(f"Result: {result}")
calculator()
Code provided by Codeium as an answer to the prompt “Write a calculator which only takes numbers up to 6 digits as input” is scored as follows:
Compliance: 5
Code Quality: 5
Code Amount: 5
Performance: 5
Security: 5
This code exemplifies “less code” but checks the number in two different places, resulting in the same error being printed twice.
It didn’t use if-else or case statements to operate; instead, it utilized eval, significantly shortening the code and improving performance compared to the previous example. Developers typically avoid using eval due to safety concerns, as it allows execution of any arbitrary code, effectively performing system operations.
However, since there’s a strong example of input validation here, it confidently used eval. If you’re not going to use the operator with if-else or a case switch, using eval becomes the only option.
#Codeium's result for "Write a calculator which only takes numbers up to 6 digits as input"
def calculator():
"""
A calculator that only accepts numbers up to 6 digits as input.
"""
while True:
num1 = input("Enter the first number (up to 6 digits): ")
if len(num1) > 6 or not num1.isdigit():
print("Invalid input. Please enter a number up to 6 digits.")
continue
operator = input("Enter the operator (+, -, *, /): ")
if operator not in ['+', '-', '*', '/']:
print("Invalid operator. Please enter one of +, -, *, /.")
continue
num2 = input("Enter the second number (up to 6 digits): ")
if len(num2) > 6 or not num2.isdigit():
print("Invalid input. Please enter a number up to 6 digits.")
continue
try:
result = eval(f"{num1} {operator} {num2}")
print(f"Result: {result}")
except ZeroDivisionError:
print("Error: Division by zero is not allowed.")
except Exception as e:
print(f"Error: {e}")
calculator()
Next steps
- Increasing task diversity
- Adding a code completion assessment.
- The current evaluation is manual and relies on the reviewer’s opinion. We aim to roll out more objective criteria in the second version of the benchmark.
For more on AI coding tools:
You can read our other benchmarks about AI coding tools:
- Top AI Website Generators Benchmarked
- Screenshot-to-Code Benchmark
- The Best AI Code Editor: Cursor vs. Windsurf
For a comparison between AI coding assistants:
FAQ
What is an AI coding benchmark?
AI coding benchmarks are standardized tests designed to evaluate and compare the performance of artificial intelligence systems in coding tasks.
Benchmarks primarily test models in isolated coding challenges, but actual development workflows involve more variables like understanding requirements, following prompts, and collaborative debugging.
What is the role of language models in code generation?
Large language models (LLMs) are commonly used for code generation tasks due to their ability to learn complex patterns and relationships in code. Code LLMs are harder to train and deploy for inference than natural language LLMs due to the autoregressive nature of the transformer-based generation algorithm. Different models have different strengths and weaknesses in code generation tasks, and the ideal approach may be to leverage multiple models.
Why are AI coding benchmarks important?
When most code is AI-generated, the quality of AI coding assistants will be critical.
What are the proper evaluation metrics and environments for a benchmark?
Evaluation metrics for code generation tasks include code correctness, functionality, readability, and performance. Evaluation environments can be simulated or real-world and may involve compiling and running generated code in multiple programming languages. The evaluation process involves three stages: initial review, final review, and quality control, with a team of internal independent auditors reviewing a percentage of the tasks.
Comments
Your email address will not be published. All fields are required.