AIMultipleAIMultiple
No results found.

Mobile AI Agents: Tools & Use Cases 

Cem Dilmegani
Cem Dilmegani
updated on Oct 21, 2025

At AIMultiple, we focus on developing and assessing Generative AI technologies such as custom GPTs, AI agents, and cloud GPU solutions. Another emerging area of interest is mobile AI agents. 

We focus on what modern mobile AI agents are, how they work, and the tools enabling them.

Mobile AI agent tools benchmark

Loading Chart

We benchmarked four agents on 65 real-world tasks by setting up an Android emulator with real applications, such as calendar management, contact creation, photo capture, audio recording, file operations, and measured success rate, average latency per successful task (seconds), completion tokens per successful task and cost per successful task.

DroidRun

Highest success rate (43%) with high cost per successful task ($0.075, ~3,225 tokens)

DroidRun demonstrated the strongest performance with a 43% success rate across the 65 tasks. When examining only the task that all agents successfully completed, DroidRun consumed an average of 3,225 tokens at a cost of $0.075 per task. This substantial resource consumption reflects DroidRun’s multi-step reasoning architecture, where the agent maintains detailed state tracking, generates explicit action plans, and provides explanations for each decision. While expensive, this comprehensive approach delivers the highest success rate in the benchmark.

Mobile-Agent

Strong performance (29%) and cost-efficient ($0.025, ~1,130 tokens)

Mobile-Agent achieved the second-highest success rate at 29% while maintaining reasonable cost-efficiency. On commonly successful tasks across all agents, Mobile-Agent averaged $0.025 and 1,130 tokens per task. This represents approximately one-third of DroidRun’s per-task cost while achieving about two-thirds of its success rate, making Mobile-Agent an attractive option for deployments where budget constraints are important. However, the 14-percentage-point gap in success rate suggests that DroidRun’s additional reasoning capabilities provide meaningful value for mission-critical applications.

AutoDroid

Best cost-efficiency (14% success, $0.017, ~765 tokens) but limited effectiveness

AutoDroid demonstrated the lowest cost on commonly successful tasks at just $0.017 and 765 tokens per task, making it the most economical option in the benchmark. However, its 14% success rate, less than half of Mobile-Agent’s performance and roughly one-third of DroidRun’s, indicates that this cost advantage comes with significant trade-offs in reliability. Despite using an action-based approach similar to DroidRun, AutoDroid’s minimal reasoning overhead results in substantial cost savings but limited task completion capability.

AppAgent

Poorest performance (7% success) with highest cost ($0.90, ~2,346 tokens)

AppAgent recorded both the lowest success rate at 7% and the highest cost on commonly successful tasks at $0.90 and 2,346 tokens per task. twelve times more expensive than DroidRun and over fifty times more costly than AutoDroid. This exceptionally poor cost-to-performance ratio stems from AppAgent’s vision-based approach, which processes labeled screenshots through multimodal LLMs for every interaction. Each screenshot sent to the multimodal LLM consumes substantial input tokens for image processing, while the actual text responses (completion tokens) remain relatively modest. This creates a highly imbalanced token distribution where the vision processing overhead dominates the cost without corresponding improvements in task completion, as the agent struggles with coordinate calculations and UI element identification on mobile interfaces.

Mobile AI agent tools execution time comparison

Loading Chart

On the single task that all agents successfully completed, AutoDroid was the fastest at 57 seconds, followed closely by Mobile-Agent at 66 seconds. DroidRun completed the task in 78 seconds, demonstrating that its multi-step reasoning architecture still enables efficient execution despite higher token consumption. AppAgent exhibited significantly higher latency at 180 seconds, due to its vision-based approach requiring extensive screenshot processing through multimodal LLMs for every interaction.

You can see our benchmark methodology from here.

Mobile AI agent tools

Selection Criteria

Given the field’s novelty, we included tools/frameworks that met at least one of the following criteria:

  • 3,000+ GitHub stars
  • 100+ academic citations

GitHub star counts change rapidly, and we will update the table accordingly.

What is a mobile AI agent? 

Mobile AI agents are software systems that interact autonomously with users and mobile applications by using natural language inputs and goal-driven reasoning to complete tasks on behalf of users. Unlike traditional automation tools or early personal assistants, these agents are powered by AI. Some of its use cases include:

  • Mobile QA automation without test scripts
  • Automating mobile workflows like uploading ID documents or changing profile settings
  • AI assistants that operate apps for the visually impaired, elderly, or anyone else.
  • Daily general tasks such as creating events on the calendar or even completing Duolingo lessons.

DroidRun

DroidRun is an open-source framework that builds mobile-native AI agents that can autonomously control mobile apps and phones.It is a foundational framework that converts user interfaces into structured data that large language models can understand and interact with, enabling complex automation directly on mobile devices. 

DroidRun rapidly gained traction: over 900 developers signed up within 24 hours, and the project soared to 3.8 k stars on GitHub, making it one of the fastest-growing frameworks for mobile AI agents.

See it in action:

AutoDroid

AutoDroid is a mobile task automation system designed to perform arbitrary tasks across any Android app without manual setup. It leverages the commonsense reasoning of large language models like GPT‑4 and Vicuna, combined with automated app-specific analysis. 

AutoDroid introduces a functionality-aware UI representation to connect app interfaces with LLMs, uses exploration-based memory injection to teach the model app-specific behaviors, and includes query optimization to reduce inference costs. Evaluated on a benchmark of 158 tasks, it achieved 90.9% action accuracy and 71.3% task success, outperforming GPT‑4-only baselines.1  

Mobile-Agent

The GitHub repo X-PLUG/MobileAgent is the official implementation of Mobile-Agent, an AI agent framework designed to autonomously control mobile applications by perceiving and reasoning over their visual UI representations.

This project comes from the X-PLUG group at Tsinghua University and was presented at ICLR 2024, aiming to push the boundaries of mobile agents by using multimodal learning, particularly visual perception and instruction-following. See the video to see it in action.

AppAgent

The GitHub repository TencentQQGYLab/AppAgent is an open-source research project from Tencent’s QQG Y-Lab. It introduces AppAgent, a mobile AI agent framework designed to autonomously understand, operate, and reason through Android apps without human-written code for each individual app.

Source: AppAgent2

What are the features of a mobile AI agent?

Goal-oriented command handling

Users specify what they want done (e.g., “Book a ride to the airport”), not the individual steps.
The agent determines which apps to open, what actions to take, and how to sequence them.

LLM-backed reasoning

Powered often by large language models (e.g., GPT-4, Claude, Gemini), these agents can:

  • Understand user intent and screen content
  • Generate logical, step-by-step action plans
  • Adapt to dynamic UI changes across different app states

Structured, native app control

Instead of relying on screen-scraping:

  • Agents extract structured UI hierarchies (e.g., XML-based trees of buttons and fields)
  • They interact directly with UI elements, treating them as first-class APIs.
    • Example: DroidRun uses Android Accessibility APIs to read and act on real UI elements.

Cross-app workflow execution

Agents operate across multiple apps and multi-step workflows. They can replan if an intermediate step fails. For example, “Download a file from email → upload it to Google Drive → send a confirmation.”

Legacy context: traditional definitions of mobile AI agents

The term “mobile AI agent” has evolved. In the past, it referred to simple rule-based systems on phones, like calendar managers or context-aware notification tools. These agents were precursors to today’s AI systems but operated in a different technical landscape. Today, it describes autonomous software powered by large language models that can operate mobile apps on behalf of users.

Well-known systems that embodied this early definition include Siri, Google Assistant, and Amazon Alexa. Though still widely used, these assistants relied on static, rule-based architectures and did not exhibit deep reasoning or complete autonomy.

Key characteristics

Traditional mobile agents typically feature:

  • Rule-based logic: They followed pre-programmed responses and workflows with no adaptive reasoning.
  • On-device processing: Due to constraints in mobile memory, processing power, and network bandwidth, these agents performed all tasks locally.
  • Preset commands: Users had to phrase their requests in specific formats, as these systems couldn’t flexibly interpret natural language.
  • Basic context awareness: They used sensor inputs (like GPS or accelerometers) to provide location- or time-based alerts and recommendations, but their responses were predefined rather than dynamic.

Functional capabilities

These agents were designed to automate routine tasks such as:

  • Managing emails, calendars, and reminders
  • Delivering notifications based on time or location
  • Making voice-activated queries or device controls

Limitations compared to modern AI agents

Unlike modern mobile AI agents powered by LLMs, traditional systems:

  • Could not understand or interact with complex app interfaces
  • Lacked the ability to reason through multi-step tasks or adjust plans mid-execution
  • Operated in silos and couldn’t coordinate across different apps or workflows
  • Were deterministic, unable to adapt or learn from new environments or inputs

Despite their limitations, these early agents marked a significant step in mobile computing. They introduced users to voice-activated automation and laid the groundwork for the development of today’s far more capable, LLM-driven mobile AI agents.

Benchmark methodology

We conducted a benchmark evaluation to assess the performance of AI mobile agents operating on the Android operating system in real-world tasks. We used the AndroidWorld framework and tested all agents on the same standard tasks.

AndroidWorld Framework

AndroidWorld is an open-source benchmark platform specifically developed by Google Research for evaluating mobile agents. This platform aims to measure the performance of agents working in real Android applications through standardized tasks. The most important feature of AndroidWorld is that it uses real Android applications instead of artificial test environments and can automatically evaluate agents’ performance. We used 65 tasks in this study. These tasks cover daily mobile device usage scenarios such as calendar management, adding contacts, voice recording, taking photos, and file operations.

Environment Setup

To set up the benchmark environment, we first installed Android Studio on Windows 11 operating system and configured Google’s official Android Emulator. We created a virtual device simulating a Pixel 6 device. The specifications of this virtual device were set as Android 13 (API Level 33) operating system, 1080×2400 resolution, 8GB RAM, and 20GB storage space. To integrate the emulator with AndroidWorld, we configured the gRPC port as 8554 because AndroidWorld communicates with the emulator through this port.

To prepare the Python environment, we created a new conda environment with Python 3.11 using Miniconda. After cloning the AndroidWorld repository from GitHub, we installed all dependencies using pip. One of the most critical steps of AndroidWorld is the emulator setup process. The setup command took approximately 45-60 minutes. During this process, AndroidWorld automatically installed all Android applications to be tested on the emulator. It created initial state data for each application, for example, added some events to the calendar application, added contacts to the contacts application, and added a podcast named “banana” to the podcast application. It also saved snapshots for each task, so each task can start from a clean initial state.

Agent integrations

AutoDroid Integration

To integrate AutoDroid, we first cloned the repository from GitHub and installed the required Python packages. AutoDroid’s main feature is understanding UI elements by parsing XML and completing tasks with an action-based approach. The agent assigns an index number to each clickable or focusable element on the screen and receives commands from the LLM such as “tap(5)” or “text(‘hello’)”. For integration with AndroidWorld, we created a wrapper class named autodroid_agent.py. This wrapper performs the necessary configurations in AutoDroid’s initialization method, converts the task goal coming from AndroidWorld into a prompt format that AutoDroid can understand, and transforms the actions generated by AutoDroid into real ADB commands using AndroidWorld’s execute_adb_call functions. In AutoDroid’s step method, the agent first takes a screenshot and XML dump of the screen, parses UI elements, sends this information to the LLM, and performs tap, swipe, or text input actions according to the received response.

DroidRun Integration

We followed a similar integration process for DroidRun. After cloning the DroidRun repository from GitHub, we installed the dependencies in requirements.txt. DroidRun’s architectural structure is more complex because it has a multi-step reasoning and state tracking system. DroidRun can explain not only what it will do at each step but also why, and can use the results of previous steps in the next step. We created the droidrun_agent.py wrapper for AndroidWorld integration. The most important part in this wrapper was making DroidRun’s own CodeActAgent class compatible with AndroidWorld’s base agent interface. When we called DroidRun’s execute_task method, the agent goes through a task planning phase, then executes each step and evaluates the results. We adapted this process to AndroidWorld’s step-by-step execution model. We also implemented the tools used by DroidRun (tap_by_index, start_app, list_packages, etc.) with AndroidWorld’s ADB commands.

AppAgent Integration

AppAgent’s integration was different from the others because it uses a vision-based approach. After cloning the AppAgent repository, we integrated the Python files in the scripts folder into AndroidWorld. AppAgent’s working principle is as follows: it first takes a screenshot of the screen, then calculates the bounding boxes of UI elements, draws these boxes on the screenshot, assigns a number to each one, and sends this labeled screenshot to a multimodal LLM. The LLM visually determines which element should be clicked. The most important step in integrating AppAgent was redirecting the part that communicates with the Android device using AppAgent’s and_controller.py module to AndroidWorld’s emulator. In the appagent_agent.py wrapper, we reimplemented AppAgent’s get_screenshot and get_xml methods to work with AndroidWorld’s APIs. We also made AppAgent’s model.py file, which uses OpenAI API format, compatible with OpenRouter API.

Mobile-Agent (M3A) Integration

M3A’s integration was the most comprehensive process because it works completely vision-based and has a very detailed UI analysis system. After cloning the M3A repository, we also installed the Mobile-Env Android interaction framework because M3A depends on this framework. M3A’s working principle is based on dividing the screen into grids, analyzing each grid separately, and doing multi-step planning. While creating the m3a_agent.py wrapper, we needed to integrate M3A’s own environment system with AndroidWorld’s environment. M3A normally uses its own Mobile-Env, but we redirected it to AndroidWorld’s env. We observed that M3A makes multiple LLM calls at each step (such as planning, action selection, verification) and made them compatible with AndroidWorld’s step limits.

Test procedure and data collection

The test procedure for each agent worked as follows: First, we started the emulator with a clean snapshot. After the emulator was fully opened, we ran AndroidWorld’s run.py. We ran 65 tasks sequentially for each agent and used Claude 4.5 Sonnet for all agents. AndroidWorld automatically performed the following steps for each task: load the initial state of the task, start the agent, send the task goal to the agent, track the agent’s steps, stop when the maximum number of steps is reached or when the agent says “task completed”, and check whether the task was successful.

AndroidWorld’s task evaluation system is quite sophisticated. There are predefined success criteria for each task. For example, for the “Add contact named John Doe” task, AndroidWorld queries the contacts database after the task is finished and checks whether a contact named “John Doe” has been added. For calendar tasks, it checks from the database whether the event was created with the correct date, time, title, and description. At the end of each task execution, AndroidWorld provided us with execution time and success status (True/False). This data was automatically recorded and used for analysis.

After completing the entire benchmark, we identified the task that all agents successfully completed. Each of this task was then executed 10 times by each agent, and the average execution time, cost, and token consumption were calculated for more reliable performance metrics.

Principal Analyst
Cem Dilmegani
Cem Dilmegani
Principal Analyst
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

0/450