AIMultipleAIMultiple
No results found.

We Tested Mobile AI Agents Across 65 Real-World Tasks

Cem Dilmegani
Cem Dilmegani
updated on Nov 6, 2025

We spent 3 days benchmarking four mobile AI agents (DroidRun, Mobile-Agent, AutoDroid, and AppAgent) across 65 real-world tasks using an Android emulator with applications such as calendar management, contact creation, photo capture, audio recording, and file operations.

See benchmark results including real-world performance comparison, costs and execution times:

Mobile AI agents performance comparison

Loading Chart

DroidRun

Highest success rate (43%) with high cost per successful task ($0.075, ~3,225 tokens)

DroidRun demonstrated the strongest performance with a 43% success rate across the 65 tasks. When examining only the task that all agents successfully completed, DroidRun consumed an average of 3,225 tokens at a cost of $0.075 per task.

This substantial resource consumption reflects DroidRun’s multi-step reasoning architecture, where the agent maintains detailed state tracking, generates explicit action plans, and provides explanations for each decision. While expensive, this comprehensive approach delivers the highest success rate in the benchmark.

Mobile-Agent

Strong performance (29%) and cost-efficient ($0.025, ~1,130 tokens)

Mobile-Agent achieved the second-highest success rate at 29% while maintaining reasonable cost-efficiency. On commonly successful tasks across all agents, Mobile-Agent averaged $0.025 and 1,130 tokens per task.

This represents approximately one-third of DroidRun’s per-task cost while achieving about two-thirds of its success rate, making Mobile-Agent an attractive option for deployments where budget constraints are important.

However, the 14-percentage-point gap in success rate suggests that DroidRun’s additional reasoning capabilities provide meaningful value for mission-critical applications.

AutoDroid

Best cost-efficiency (14% success, $0.017, ~765 tokens) but limited effectiveness

AutoDroid demonstrated the lowest cost on commonly successful tasks at just $0.017 and 765 tokens per task, making it the most economical option in the benchmark.

However, its 14% success rate, less than half of Mobile-Agent’s performance and roughly one-third of DroidRun’s, indicates that this cost advantage comes with significant trade-offs in reliability.

Despite using an action-based approach similar to DroidRun, AutoDroid’s minimal reasoning overhead results in substantial cost savings but limited task completion capability.

AppAgent

Poorest performance (7% success) with highest cost ($0.90, ~2,346 tokens)

AppAgent recorded both the lowest success rate at 7% and the highest cost on commonly successful tasks at $0.90 and 2,346 tokens per task. twelve times more expensive than DroidRun and over fifty times more costly than AutoDroid.

This poor cost-to-performance ratio stems from AppAgent’s vision-based approach, which processes labeled screenshots through multimodal LLMs for every interaction. Each screenshot sent to the multimodal LLM consumes substantial input tokens for image processing, while the actual text responses (completion tokens) remain relatively modest.

This creates a highly imbalanced token distribution where the vision processing overhead dominates the cost without corresponding improvements in task completion, as the agent struggles with coordinate calculations and UI element identification on mobile interfaces.

Mobile AI agents execution time comparison

Loading Chart

On the single task that all agents successfully completed, AutoDroid was the fastest at 57 seconds, followed closely by Mobile-Agent at 66 seconds. DroidRun completed the task in 78 seconds, demonstrating that its multi-step reasoning architecture still enables efficient execution despite higher token consumption.

AppAgent exhibited significantly higher latency at 180 seconds, due to its vision-based approach requiring extensive screenshot processing through multimodal LLMs for every interaction.

You can see our benchmark methodology from here.

Overview of mobile AI agents

GitHub star counts change rapidly, and we will update the table accordingly.

DroidRun

DroidRun is an open-source framework that builds mobile-native AI agents that can autonomously control mobile apps and phones. It is a foundational framework that converts user interfaces into structured data that large language models can interact with, enabling complex automation directly on mobile devices. 

DroidRun rapidly gained traction: over 900 developers signed up within 24 hours, and the project soared to 3.8 k stars on GitHub, making it one of the fastest-growing frameworks for mobile AI agents.

See it in action:

AutoDroid

AutoDroid is a mobile task automation system designed to perform arbitrary tasks across any Android app without manual setup. It leverages the commonsense reasoning of large language models like GPT‑4 and Vicuna, combined with automated app-specific analysis. 

AutoDroid introduces a functionality-aware UI representation to connect app interfaces with LLMs, uses exploration-based memory injection to teach the model app-specific behaviors, and includes query optimization to reduce inference costs. Evaluated on a benchmark of 158 tasks, it achieved 90.9% action accuracy and 71.3% task success, outperforming GPT‑4-only baselines.1  

Mobile-Agent

The GitHub repo X-PLUG/MobileAgent is the official implementation of Mobile-Agent, an AI agent framework designed to autonomously control mobile applications by perceiving and reasoning over their visual UI representations.

This project comes from the X-PLUG group at Tsinghua University and was presented at ICLR 2024, aiming to push the boundaries of mobile agents by using multimodal learning, particularly visual perception and instruction-following. See the video to see it in action.

AppAgent

The GitHub repository TencentQQGYLab/AppAgent is an open-source research project from Tencent’s QQG Y-Lab. It introduces AppAgent, a mobile AI agent framework designed to autonomously operate, and reason through Android apps without human-written code for each individual app.

Source: AppAgent2

Features of a mobile AI agents

Goal-oriented command handling

The agent determines which apps to open, what actions to take, and how to sequence them. For example, users specify what they want done (e.g., “Book a ride to the airport”), not the individual steps.

LLM-backed reasoning

Powered by large language models (e.g., GPT-4, Claude, Gemini), these agents can:

  • Identify user intent and screen content
  • Generate logical, step-by-step action plans
  • Adapt to dynamic UI changes across different app states

Structured, native app control

Instead of relying on screen-scraping:

  • Agents extract structured UI hierarchies (e.g., XML-based trees of buttons and fields)
  • They interact directly with UI elements, treating them as first-class APIs.
    • Example: DroidRun uses Android Accessibility APIs to read and act on real UI elements.

Cross-app workflow execution

Agents operate across multiple apps and multi-step workflows. They can replan if an intermediate step fails. For example, “Download a file from email → upload it to Google Drive → send a confirmation.”

Benchmark methodology

We conducted a benchmark evaluation to assess the performance of AI mobile agents operating on the Android operating system in real-world tasks. We used the AndroidWorld framework and tested all agents on the same standard tasks.

AndroidWorld Framework

AndroidWorld is an open-source benchmark platform specifically developed by Google Research for evaluating mobile agents. This platform aims to measure the performance of agents working in real Android applications through standardized tasks.

The most important feature of AndroidWorld is that it uses real Android applications instead of artificial test environments and can automatically evaluate agents’ performance. We used 65 tasks in this study. These tasks cover daily mobile device usage scenarios such as calendar management, adding contacts, voice recording, taking photos, and file operations.

Environment Setup

System configuration: To set up the benchmark environment, we first installed Android Studio on Windows 11 operating system and configured Google’s official Android Emulator.

Virtual device setup: We created a virtual device simulating a Pixel 6 device. The specifications of this virtual device were set as Android 13 (API Level 33) operating system, 1080×2400 resolution, 8GB RAM, and 20GB storage space.

Emulator configuration: To integrate the emulator with AndroidWorld, we configured the gRPC port as 8554 because AndroidWorld communicates with the emulator through this port.

Python environment setup: To prepare the Python environment, we created a new conda environment with Python 3.11 using Miniconda. After cloning the AndroidWorld repository from GitHub, we installed all dependencies using pip. One of the most critical steps of AndroidWorld is the emulator setup process.

The setup command took approximately 45-60 minutes. During this process, AndroidWorld automatically installed all Android applications to be tested on the emulator.

Initial state data creation: It created initial state data for each application, for example, added some events to the calendar application, added contacts to the contacts application, and added a podcast named “banana” to the podcast application. It also saved snapshots for each task, so each task can start from a clean initial state.

Agent integrations

AutoDroid

AutoDroid Integration: To integrate AutoDroid, we first cloned the repository from GitHub and installed the required Python packages. AutoDroid’s main feature is identifying UI elements by parsing XML and completing tasks with an action-based approach.

The agent assigns an index number to each clickable or focusable element on the screen and receives commands from the LLM such as “tap(5)” or “text(‘hello’)”.

AutoDroid wrapper: For integration with AndroidWorld, we created a wrapper class named autodroid_agent.py. This wrapper performs the necessary configurations in AutoDroid’s initialization method, converts the task goal coming from AndroidWorld into a prompt format that AutoDroid can transforms the actions generated by AutoDroid into real ADB commands using AndroidWorld’s execute_adb_call functions.

Execution flow: In AutoDroid’s step method, the agent first takes a screenshot and XML dump of the screen, parses UI elements, sends this information to the LLM, and performs tap, swipe, or text input actions according to the received response.

DroidRun

DroidRun integration: We followed a similar integration process for DroidRun. After cloning the DroidRun repository from GitHub, we installed the dependencies in requirements.txt.

DroidRun’s architectural structure is more complex because it has a multi-step reasoning and state tracking system. DroidRun can explain not only what it will do at each step but also why, and can use the results of previous steps in the next step.

DroidRun wrapper: We created the droidrun_agent.py wrapper for AndroidWorld integration. The most important part in this wrapper was making DroidRun’s own CodeActAgent class compatible with AndroidWorld’s base agent interface.

Execution process: When we called DroidRun’s execute_task method, the agent goes through a task planning phase, then executes each step and evaluates the results. We adapted this process to AndroidWorld’s step-by-step execution model. We also implemented the tools used by DroidRun (tap_by_index, start_app, list_packages, etc.) with AndroidWorld’s ADB commands.

AppAgent

AppAgent integration: AppAgent’s integration was different from the others because it uses a vision-based approach. After cloning the AppAgent repository, we integrated the Python files in the scripts folder into AndroidWorld.

Vision-based approach: AppAgent’s working principle is as follows: it first takes a screenshot of the screen, then calculates the bounding boxes of UI elements, draws these boxes on the screenshot, assigns a number to each one, and sends this labeled screenshot to a multimodal LLM. The LLM visually determines which element should be clicked.

Wrapper configuration: The most important step in integrating AppAgent was redirecting the part that communicates with the Android device using AppAgent’s and_controller.py module to AndroidWorld’s emulator. In the appagent_agent.py wrapper, we reimplemented AppAgent’s get_screenshot and get_xml methods to work with AndroidWorld’s APIs. We also made AppAgent’s model.py file, which uses OpenAI API format, compatible with OpenRouter API.

Mobile-Agent (M3A)

Mobile-Agent (M3A) integration: M3A’s integration was the most comprehensive process because it works completely vision-based and has a very detailed UI analysis system. After cloning the M3A repository, we also installed the Mobile-Env Android interaction framework because M3A depends on this framework.

Multi-step analysis: M3A’s working principle is based on dividing the screen into grids, analyzing each grid separately, and doing multi-step planning. While creating the m3a_agent.py wrapper, we needed to integrate M3A’s own environment system with AndroidWorld’s environment. M3A normally uses its own Mobile-Env, but we redirected it to AndroidWorld’s env.

Multiple LLM calls: We observed that M3A makes multiple LLM calls at each step (such as planning, action selection, verification) and made them compatible with AndroidWorld’s step limits.

Test procedure and data collection

Test flow: The test procedure for each agent worked as follows: First, we started the emulator with a clean snapshot. After the emulator was fully opened, we ran AndroidWorld’s run.py. We ran 65 tasks sequentially for each agent and used Claude 4.5 Sonnet for all agents.

Task execution: AndroidWorld automatically performed the following steps for each task: load the initial state of the task, start the agent, send the task goal to the agent, track the agent’s steps, stop when the maximum number of steps is reached or when the agent says “task completed”, and check whether the task was successful.

Success criteria: AndroidWorld’s task evaluation system includes predefined success criteria. For example, for the “Add contact named John Doe” task, AndroidWorld queries the contacts database to confirm the contact was added.

For calendar tasks, it checks from the database whether the event was created with the correct date, time, title, and description. At the end of each task execution, AndroidWorld provided us with execution time and success status (True/False). This data was automatically recorded and used for analysis.

Data collection: After completing the entire benchmark, we identified the task that all agents successfully completed. Each of this task was then executed 10 times by each agent, and the average execution time, cost, and token consumption were calculated for more reliable performance metrics.

FAQ

Principal Analyst
Cem Dilmegani
Cem Dilmegani
Principal Analyst
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

0/450