AI will have its greatest impact when AI systems start to create economic value autonomously.
We benchmarked whether frontier models can generate economic value. We prompted them to build a new digital application (e.g., website or mobile app) that can be monetized with a SaaS or advertising-based model. Benchmarked models are:
Results
Both models provided comprehensive outputs in response to our prompts, determined the focus area of the website, and provided instructions for the Cursor AI code editor to build the website.
To be fair to both models, we relied on this prompt without making any further adjustments. The only exception was that when Cursor asked us whether to continue, we prompted it with phrases like “yes, continue please”.
Both websites could not be coded fully with Cursor. Although the code includes most of the core functionalities, the Cursor failed to create a fully functioning website. Therefore, we cannot proceed with the following steps. They both completed 2% of our milestones.
Read the methodology to see how we measured these tools.
Step 1: Choosing the focus of the website
We prompted the frontier models to choose a domain for their app:
Claude Sonnet 3.5’s response
Project management tool for freelancers:
Pricing Structure
- Free Tier:
- 2 active projects
- 1 client
- Basic invoicing
- Pro Tier ($15/month):
- Unlimited projects
- Unlimited clients
- Custom invoicing
- Priority support
- Advanced analytics
Our assessment
The competitors offer: One of the biggest competitors in the area, ClickUp, offers the same features for a cheaper price. Another top vendor in the area, Asana, offers unlimited tasks, projects, messages, activity logs, and file storage (100MB per file) in its Free plan.
The output of Claude Sonnet 3.5 failed to offer competitive pricing or a compelling feature to differentiate it from existing vendors in the field.
ChatGPT-o1
A specialized job board for businesses hiring AI-savvy content creators (writers, marketers, social media managers using AI tools, etc.)
Pricing structure:
- Employer Plan: $199/month for unlimited postings OR $49 per job listing.
- Job Seeker: Free to create a profile and browse.
Our assessment
This pricing model presents a straightforward, flat-fee structure that could be cost-effective for employers with frequent hiring needs, offering unlimited postings for a fixed monthly rate. However, for employers with infrequent postings or those preferring to pay fees proportional to transaction amounts, existing platforms like Upwork with minimal upfront costs and percentage-based fees might appear more economical.
Therefore, the suggestion of ChatGPT-o1 may not appeal to either employers or employees.
Our research revealed that these models lack the ability to perform high-quality research since their output is not only a new idea, but also does not offer better features than their competitors. They still require human researchers to improve upon existing tools.
Also, Cursor (with Claude Sonnet 3.5 as the LLM used in coding for both projects) could not code an entire website. This failure could be attributed to either the Cursor’s limitations or inadequate prompting. Either way, without human participants, it was not possible to generate the idea and code the entire website in this benchmark.
You can also read our AI reasoning benchmark to see the model’s ability to reason.
ARC-AGI benchmarks and results
The ARC-AGI benchmarks1 were created to evaluate general reasoning ability in artificial systems using grid-based tasks that require inferring unstated rules from examples.
ARC-AGI-1 (2019–2024)
ARC-AGI-1 was introduced in 2019 to measure fluid intelligence in artificial systems. It consisted of grid-based reasoning tasks where the solver had to infer an unstated rule from a few input-output examples and apply it to unseen test inputs.
The tasks relied only on basic cognitive priors such as object persistence, symmetry, and counting, and did not require language or specialized knowledge.
Competitions over several years demonstrated incremental but limited progress:
- In 2020, the top submission reached 20% accuracy on the hidden evaluation set.
- From 2020 to early 2024, performance remained around 34% despite significant scaling of large language models.
- In 2024, new approaches, such as test-time adaptation, improved results. The top eligible team reached 53.5%, while another team achieved 55.5% but did not release its model.
- A preview of OpenAI’s o3 model exceeded human-level performance under very high compute conditions, scoring 76% at lower cost and 88% at higher cost. Public versions later scored lower, with o3-medium reaching 53%.
Although ARC-AGI-1 spurred research activity, it showed weaknesses as a benchmark. Many tasks were vulnerable to brute-force strategies, consistent first-party human baselines were lacking, task difficulty was uneven across subsets, and repeated reuse of hidden tasks introduced risks of information leakage.
ARC-AGI-2
ARC-AGI-2 was created to address the limitations of its predecessor while preserving the same task format. It aimed to reduce reliance on brute-force solutions, calibrate task difficulty across evaluation sets, and establish clear baselines for human performance.
The development process involved extensive human testing with 407 participants, encompassing over 13,000 task attempts. The average success rate was 66%, with every task solved by at least two participants within two attempts. Median completion time per attempt was approximately 2.2 minutes.
Results on ARC-AGI-2 highlight the current gap between human and machine performance:
- Leading models, such as the o3-mini and o3-medium, scored around 3%.
- The ARC Prize 2024 winning team achieved 2.5%.
- Other systems, including Claude 3.7 and Icecuber, scored below 2%.
- Scores under 5% are considered too close to noise to be meaningful.
Compared to ARC-AGI-1, where the best systems exceeded 50% accuracy, ARC-AGI-2 represents a significantly higher level of difficulty.
Its tasks are more unique, feature larger grids and more objects, and emphasize compositional reasoning such as multi-step transformations, contextual rule application, and symbol definition.
GDPval benchmark
GDPval is created to evaluate the performance of AI models on real-world tasks that have measurable economic value. It focuses on 44 occupations from nine major sectors that contribute significantly to the U.S. GDP, including healthcare, finance, manufacturing, real estate, and government.
The benchmark includes 1,320 tasks in its complete set, with about 30 tasks per occupation. A gold subset of 220 tasks has been released publicly for research and testing.
Unlike traditional benchmarks that test reasoning in academic or artificial contexts, GDPval tasks are based on actual deliverables produced by industry professionals.
These tasks can involve documents, spreadsheets, presentations, CAD files, audio, video, or customer support records. Each task is designed and validated by experts with an average of 14 years of professional experience, ensuring that the content reflects real workplace demands.
Figure 1: The graph showing human pairwise comparisons suggests that models are approaching the performance of industry experts on the GDPval gold subset.2
What it measures
GDPval evaluates three main aspects of AI performance:
- Quality of deliverables: Outputs are compared directly to those of human experts through blinded pairwise grading. Professional evaluators judge which deliverable better meets the requirements, considering correctness, structure, style, formatting, and relevance. This produces a win rate, which indicates how often a model’s output is rated equal to or better than that of a human-produced deliverable.
- Speed and cost efficiency: The benchmark records the time and cost required to complete tasks. Human experts typically spend about 7 hours, or 404 minutes, on a task, which translates to around $ 361 in wages. AI models complete tasks much faster and at lower cost, but the savings depend on how much review and correction by humans is required.
- Adaptability through reasoning and prompting: The benchmark also tests whether model performance improves when models are given more reasoning effort, more straightforward prompts, or scaffolding techniques. This helps measure not only raw capability but also how well models can be guided to perform complex, multi-step tasks.
Together, these measures capture both the potential benefits and the current limitations of AI in performing tasks that align with economically valuable work.
Outcomes of the benchmark
a) Model performance vs. human experts
- The best models are approaching expert parity. For example, Claude Opus 4.1 achieved an approximate 48% win-tie rate, meaning that in nearly half of the tasks, its outputs were rated as good as or better than those of the human expert.
- GPT-5 was strongest on accuracy (instruction following, calculations), while Claude was strongest on aesthetics (formatting, slides, layouts).
b) Trends over time
- OpenAI’s models showed linear improvement across versions (e.g., GPT-4o → o3 → GPT-5), with performance steadily rising toward expert quality.
c) Speed & cost savings
- Naively, models are 90 to 300 times faster and hundreds of times cheaper than humans.
- When factoring in review and corrections, realistic savings are more modest, with a speedup of ~1.1–1.6 times faster and cheaper in workflows where experts review and refine AI outputs.
- This suggests AI can already meaningfully augment professional workflows rather than fully replace them.
d) Failure modes
- Models most often fail due to:
- Instruction-following errors (especially Claude, Gemini, Grok).
- Formatting issues (especially GPT-5).
- Occasional hallucinations or miscalculations.
- Most failures are “acceptable but subpar” rather than catastrophic, though ~3% of GPT-5’s failures were considered catastrophic (dangerous or highly inappropriate outputs).
Can/will AI generate economic value?
According to an Anthropic report,3 artificial intelligence is already generating measurable economic value through rapid adoption, productivity improvements, and automation. Individuals and enterprises increasingly use Claude for tasks such as coding, research, education, and administration, with enterprises automating approximately 77% of API-based interactions.
Businesses often prioritize tasks where AI capabilities are strongest, even when these tasks are more costly, suggesting that efficiency gains outweigh price considerations.
However, the benefits remain unevenly distributed, as high-income regions, automation-ready sectors, and workers with specialized knowledge capture a disproportionate share of the value, raising concerns about widening inequalities alongside economic advancement.
Methodology
We selected the necessary milestones for AI systems to generate economic value by building new applications:
- Domain identification (%1)
- Spec preparation (%1)
- App coding (%8)
- App deployment (%5)
- App testing (%5)
- Marketing (%5)
- Optimization (%5)
- Revenue generation (%70)
Each milestone was assigned a specific budget, and the results were evaluated by a human expert panel.
Tools could be used within the allocated budget for each model. We created accounts in various systems to test the models.
Our first prompt: Create a website with specific revenue targets. This process will include different phases for selecting the niche, coding and deploying, and marketing.
- Business Goal: Generate $2,000 Monthly Recurring Revenue (MRR) within 2 months of deployment
- Initial marketing budget: $500
- Cannot implement any compliance and certification requirements (no HIPAA, SOC2, PCI, etc.) For Phase 1: Analyze and select one promising niche market that can:
- Reach $2k MRR within 2 months realistically
- Be built and marketed within our budget constraints
- Have clear monetization potential
- Show sufficient market demand
- For Phase 2: I will code the product with an agentic AI coding editor, like Cursor, v0 etc.
- You should provide me a prompt to give the editor. The prompt should include all the functions of the product. After that, we will continue with marketing, but for now, only provide results for these.
Since the models left some choices to the user, we prompted them again.
Our second prompt: Is there a specific AI coding assistant you want me to use? Cursor, Replit, V0, Lovable etc. Also, make sure that the prompt we give to these tools covers all the details of the project. Don’t make the AI coding assistant or me make a choice about the project, you will decide all the details about the project.
FAQs
Reference Links

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Be the first to comment
Your email address will not be published. All fields are required.