Benchmark primer: what “Phone Use” agents are actually measured on

A grounded explanation of phone-use benchmarks and how to interpret them.

Benchmarks for phone‑use agents are useful, but they are not the same as production reliability. This primer explains what these benchmarks typically measure and how to interpret them without hype.

TODO: replace with benchmark diagram

What benchmarks usually measure

Most phone‑use benchmarks measure:

  • Task completion rate on predefined flows.
  • Accuracy of UI element selection.
  • Time or steps to reach a goal.
  • Error recovery in constrained scenarios.

These are controlled tasks, not real‑world chaos.

What benchmarks do not guarantee

Benchmarks typically do not prove:

  • Safe behavior on financial or sensitive workflows.
  • Robustness across device types and OS versions.
  • Performance under degraded network conditions.
  • Compliance with app or platform policies.

Treat benchmark scores as a starting point, not a final verdict.

Typical benchmark structure

Benchmarks usually include:

  1. A fixed set of UI tasks (open settings, change a toggle, search).
  2. A scoring rubric (success/failure, steps, time).
  3. A standardized evaluation device or emulator.

This helps compare models, but it limits generalization.

Task design matters

Two benchmarks can measure very different things:

  • Simple navigation tasks vs. multi‑step workflows
  • Single‑app vs. cross‑app flows
  • Static screens vs. dynamic content

Read the task definitions carefully before drawing conclusions.

Dataset limitations

Benchmarks often:

  • Use a small set of apps
  • Assume stable UI layouts
  • Avoid high‑risk actions

That is good for safety, but it does not reflect all real‑world cases.

How to use benchmarks in practice

  • Use them to compare models before deeper testing.
  • Track changes across releases with the same benchmark.
  • Combine benchmark results with manual audits.

Building your own evaluation set

If you need real‑world relevance, create a small internal benchmark:

  • Pick 5–10 representative tasks.
  • Keep them low‑risk and repeatable.
  • Use the same device and OS version for consistency.

This gives you a grounded baseline without overstating results.

Reporting results responsibly

When sharing results:

  • Describe the device and model version.
  • List tasks and success criteria.
  • Avoid claims of general superiority.

Interpreting score deltas

Small score changes can come from:

  • Minor UI layout shifts
  • Device OS updates
  • Prompt changes

Treat minor deltas as signals to investigate, not as definitive performance claims.

Example evaluation checklist

  • Can the agent complete the task without human help?
  • Does it take a safe path (no destructive actions)?
  • Can you reproduce the result on a different device?

Safety note

Benchmarks are not safety certifications. Always add confirmation steps and safety guardrails before real workflows.

Next steps

Waitlist

Mobile Regression Testing (coming soon)

Get notified when guided Android regression testing workflows and safety checklists are ready.

We only use your email for the waitlist. You can opt out anytime.

Benchmark primer: what “Phone Use” agents are actually measured on