AIMultiple ResearchAIMultiple ResearchAIMultiple Research
We follow ethical norms & our process for objectivity.
This research is not funded by any sponsors.
Agentic AI
Updated on Apr 3, 2025

AI Agents: Operator vs Browser Use vs Project Mariner ['25]

Headshot of Cem Dilmegani
MailLinkedinX

We spent more than 40 hours testing the top 4 AI agents to see if they can help us in a business process with tool use:

We also benchmarked the web search capabilities of AI agents. Follow the links to see our experience with the agents:

Tool use benchmark results

ChatGPT Operator is the most successful agent in this benchmark and completed 50% of the benchmark task.

Methodology

We aimed to test whether agents could assist in our business workflow. We wanted to test a real-life example from our company, so we decided to test if they could prepare the interactive graphs we made on observablehq.com.

With that task, we aimed to see their tool use and coding abilities.

Although we have some graph templates, changing data requires changing the code snippets for both graphs and buttons.

We provided them with the following prompt:

# Observable Template Update Instructions

I have a graph template on observablehq.com and by using that template, I want you to create new graphs with the new data I will provide. Here are the instructions:

1. Access and Setup

   – Go to observablehq.com

   – Find the template named “vis_template” and fork it, name the fork as “new_graph1”, under the notebooks section.

2. Template Structure

   – Style Cell: Do not modify (contains font settings)

   – Buttons Cell: Must be updated based on new data

   – Graph Cell: Must be updated based on new data

3. Data Handling

   – You will receive data with platforms and their scores for different categories

   – Both platform names and category names in the data can be different from the template

   – The data structure will always be: platforms with scores (0-1) for each category

4. Required Updates

   – Buttons:

     * Create a button for each category in the new data

     * Keep “Overall” as the first button

     * Maintain existing button styling and responsive design

   – Graph:

     * Update platform names on the y-axis

     * Update all score values and calculations

     * Keep existing color scheme and animations

     * Maintain mobile responsiveness (< 500px breakpoint)

5. Testing Requirements

   – Verify all buttons work correctly

   – Check graph updates when categories are selected

   – Test responsive layout on mobile and desktop views

   – Ensure logo placement remains correct with new data

Remember: The template’s structure and styling should remain unchanged – only update the data and necessary category-related elements.

Here is the new data to use: data = [

  { platform: “AcmeOCR”, Scanned_docs: 0.95, Digital_text: 0.99 },

  { platform: “TextPro”, Scanned_docs: 0.92, Digital_text: 0.97 },

  { platform: “DocReader”, Scanned_docs: 0.88, Digital_text: 0.96 },

  { platform: “SmartScan”, Scanned_docs: 0.85, Digital_text: 0.94 }

]

To maintain objectivity, we did not provide further prompts. We only responded with ‘Yes’ when asked whether to continue and entered our credentials to sign in to observablehq.com.

We used the following criteria to evaluate their performance:

  1. Sign in – or make the user take over to sign in to our observablehq.com account. (10 points)
  2. Finding the template (10 points)
  3. Fork it (10 points)
  4. Change its name (10 points)
  5. Not touching styling cell. (5 points)
  6. Updating the data in the code (15 points)
  7. Updating the graph code (20 points)
  8. Updating the button code (20 points)

Please be cautious about using AI agents on your own accounts. This may cause security problems or unwanted activities.

Our experience with the agents

ChatGPT Operator

OpenAI Operator is the easiest agent to set up in this benchmark since it comes from a website with the ChatGPT interface and a virtual browser. You do not need to set up a virtual environment or software to make it run.

In the sign-in process, it asked us to take over the sign-in process, we signed in and then gave the operation back to the agent. After that, it found the template and forked it. It changed the data successfully, but it cannot change the remaining graph code and button code snippets. It also could not rename the forked notebook.

Google Project Mariner

Google Project Mariner is not publicly available yet but can be tested with permission after the waitlist. It works directly on the browser as a Chrome extension. It can be used as a human-in-the-loop process. For example, when a CAPTCHA is on the screen, Mariner asks the user to take over and solve the CAPTCHA.

Browser Use

Browser Use is an open-source AI agent which you can use with your API keys.

You can watch the agent’s actions on the browser tab it opens, also you can read some outputs of its actions on your terminal.1

It signed in to observablehq.com and forked the template successfully. But after that, it deleted the original template. It could not manage to name the forked notebook correctly. We prompted it to keep the style cell as it is, but it failed to achieve that and write data code in that cell. It couldn’t change the graph code and button code.

We used Browser Use with the ChatGPT-4o API key.

Browser Use can also be used in a WebUI, but in this task, we didn’t use this.2

Anthropic Computer Use

Anthropic focuses on AI safety, and we observed these efforts in their agent. We tried every possible way but the agent did not sign in our observablehq account. It refused the signing in due to safety reasons.

Also, it is not possible to take over the process from the agent and then let it continue since we used the virtual environment recommended by Anthropic.

Therefore, the agent took 0 points from our task since it couldn’t move on.

Pricing

Anthropic Computer Use requires API keys, making it potentially more expensive for long tasks than other options.

OpenAI Operator is available on the Pro plan for users located in the US. 3

Pro plan costs $200/month as of February 2025.

Browser Use is an open-source tool with the only expense of API calls.

Web search benchmark results

To investigate the business use cases of AI agents, we used 2 different web scraping tasks. All agents failed most of the tasks. Anthropic Computer use and Dendrite performed slightly better than Phidata.

To learn more about web scraping, you can read Roadmap to Web Scraping: Use Cases, Methods & Tools and RPA Web Scraping.

Task 1:

Prompt: Provide all cloud GPU providers that offer H100. We need every H100 offer from each provider. Therefore a GPU provider may be presented in multiple rows when they offer multiple H100 GPU offer (e.g. an offer with a single H100 and another offer with two H100s). For each row, we need these data points: URL where offer is shared, number of GPUs as an integer, price per hour as a decimal in $. Output as json.

We evaluated their capabilities to

  • Find all the correct sources (Figure 1)

  • Provide correct information (Figure 2).

Figure 1: The percentage of the correctly provided sources by the products.
Figure 2: The percentage of the accuracy of the information provided by the products.

Task 2:

Prompt: Find B2B tech private companies that raised funding in October 2024. Format each result as: [Company name] raised [amount] in [sector/industry].

In this task, Anthropic Computer use (Figure 3) and Phidata (Figure 4) failed to provide answers.

Figure 3: Computer use’s answer to our task.
Figure 4: Phidata’s answer to our task, it provided relevant resources but not the answers.

ChatGPT’s search returned 7 companies, of which 6 are accurate. However, one company was listed as having fundraised in August 2024, which does not meet our requirement for companies that fundraised in October 2024. Therefore, this information is incorrect.

Dendrite correctly provided 2 companies, although there are many more. This is because it relied on incomplete search engine results.

Perplexity provided 6 companies, and while their names, raised amounts, and industries are accurate, none of them completed fundraising in October 2024. Therefore, this information does not meet our requirements.

So the leaders of this task are ChatGPT search and Dendrite.

Prices

The price of Anthropic computer use is based on API requests. For example, we spent ~$2,5 to run these 2 tasks, running each task a couple of times. $0.5 for a task run is expensive. If you want to use agentic process automation, you can find more cost-effective options.

ChatGPT’s search functionality is available to users subscribed to the Plus and Team plans, priced at $20 per month and $25 per user per month (billed annually), respectively.

Dendrite offers a limited free plan and a Developer plan priced at $30. Specific details regarding the limitations of the free plan will be updated once they are officially published.

Phidata has free, pro, and enterprise plans. Plans other than free are not available yet. Also, they claim that they will provide a pro plan free for students, educators, and start-ups.

Our methodology

Versions: The latest version is available as of November 1, 2024.

Deployment environment:

  • Dendrite and Phidata were run on our laptop.

  • Anthropic Computer use was deployed to a cloud VM as it was recommended against deployments on user devices.

  • ChatGPT search feature and Perplexity are available directly on their respective websites.

Process:

  • To evaluate vendors’ web-search capabilities, we first compiled a ground-truth list of cloud H100 providers. Then, we compared it with the outputs of the AI agents.

  • To evaluate the accuracy of the information, we checked all the links they provided to see whether the information they provided us was correct or not.

  • We did not try prompt engineering to get more accurate results.

Scoring:

Since the number of outputs they provide varies, we aimed to keep the scoring system as straightforward as possible. For task 1, if a product returns a URL that is not from a reliable source, it receives a score of 0.

Additionally, the number of outputs ranges from 6 to 28, so it’s important to note that a product with 3 correct answers out of 6 outputs and another with 14 correct answers out of 24 outputs receive the same score in Figure 2.

We did not score the products for Task 2, as the search results vary significantly based on the used browser and the location of the user, and the products scrape data accordingly from these sources. However, since ChatGPT and Dendrite provided accurate results, they are considered the leaders for this task.

Disclaimer

Since the agents use different browsers and locations, these models can encounter different sources while web scraping. To be fair to all agents, all potential sources were included in our ground-truth.

Since these products are in version 1 or beta, they have various limitations. We will continue benchmarking and update results as they evolve.

Since these models are newly developed, they may cause security vulnerabilities, so we recommend using them in a virtual machine or container. Anthropic also mentions the necessity of taking this precaution when using Computer use.4

Figure 5: Anthropic’s warning about the usage of Computer use.

Our experience with the tools

Anthropic Computer Use

Computer use makes numerous API calls for a single task. Running an agent with computer use is slow.

We initially encountered problems due to Anthropic’s rate limits. In Tier 1, Anthropic allows users to make 50 API requests per minute. This was not enough to finish our tasks, so we needed to run the prompt multiple times.

Then, we asked for a higher API limit and received the limit within hours, which facilitated benchmarking.

Perplexity

Perplexity’s search tool is accessible directly on its website. Like ChatGPT search, it is not an agentic AI, but we chose to include it in our testing since our benchmark task involves web scraping.

ChatGPT’s search feature is available to pro and team users directly within the ChatGPT interface. Although it is not an agentic AI, we included it in our testing because the focus of this benchmark is web scraping.

Dendrite

Dendrite provides examples of agents like data extraction agents on their website which facilitates building new agents.

Dendrite’s agents are running slower than most of the other agents in this benchmark.

Unlike other agents, it requires users to enter the search query.

Phidata

Phidata provides examples like web search agents on their website to make it easy to build new agents. We developed an agent in minutes.

Phidata’s agents’ hallucinated results in our benchmark provided links to pages and pricing information that do not exist.

FAQ

What are the AI agent applications and use cases?

AI agents can automate complex workflows, reducing the need for human intervention and increasing efficiency. They can handle exceptions and edge cases, making them more reliable than traditional automation solutions.
AI agents can perform tasks that would be difficult or boring to humans. They can also be used for natural language processing, data processing, and analysis.

How to build your own agents?

Choose a vendor by considering your needs, abilities, and prices.
They can be integrated with external systems using API calls and can access a wide range of data sources.
Design the task for your AI agent, you should be able to provide a prompt that is goal-oriented and not confusing to the model.

Are AI agents secure?

AI agents must be designed with data privacy and security in mind, using techniques such as encryption and access controls. In the current level of development, we suggest you to not share your sensitive data with artificial intelligence agents.

What are the business benefits of AI agents?

AI agents can increase efficiency and productivity, automating repetitive tasks and freeing up human agents to focus on more complex tasks.
They can analyze enterprise data and automate business processes. If you need to learn more, see agentic process automation. By building autonomous agents, you can automate processes and have more tasks done.

How to measure the success of AI agents?

If you use an agent in your business, use metrics such as efficiency, productivity, and customer satisfaction to measure the success of AI agents.
Monitor the performance of AI agents over time, making adjustments as needed.
Use data and analytics to provide insights into the decision-making processes and reliability of AI agents.

Share This Article
MailLinkedinX
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
Şevval is an AIMultiple industry analyst specializing in AI coding tools, AI agents and quantum technologies.

Next to Read

Comments

Your email address will not be published. All fields are required.

0 Comments