Reflex Logo
Blog
Open Source
Squares Vertical DocsSquares Vertical Docs

500,000 Tokens to Click a Dropdown

We benchmarked browser-use vision agents against auto-generated API endpoints on the same Reflex app. 47 steps and 495k tokens vs 8 calls and 12k tokens.

PAPalash Awasthi

Image for blog post: 500,000 Tokens to Click a Dropdown

AI has thrown enterprises into an arms race to automate internal workflows the fastest.

A large part of automating these workflows consists of operating internal tools. Most enterprises have 100+ dashboards, tools, and data visualization apps that employees spend significant time operating. Making these tools agent-accessible is one of the first automation steps for many enterprises.

Let's say an agent needed to resolve a customer complaint by looking up the customer, checking their orders, and updating records across an admin panel. There are two common paths to do so:

  1. Vision agents (browser/computer-use) are most common in practice today from what I've seen. These agents point a vision model at the screen, conduct reasoning, click, and repeat. This works, but it's slow, costly, and inaccurate by nature.
  2. Bespoke MCPs or APIs solve the performance problem but create a second project that needs maintenance. Setting these up for a single app involves defining tool schemas, writing handler functions that mirror the app's existing logic, managing authentication and permissions separately, and testing the integration independently of the UI. Multiply that by 15 or 100 apps, each with its own update cycle. These APIs are also stateless, requiring additional engineering bandwidth for session context. With this approach, engineering overhead compounds with app size and number.

Both approaches force a choice between reliability and engineering cost.

If the app is built with Reflex, our full-stack Python web framework, that tradeoff no longer applies. A new update enables programmatic creation of API endpoints from UI components (event handlers, for granularity). An agent can call the same function a button triggers.

The result is an application that serves both humans and agents without engineering overhead. The agent's session is stateful by default because it operates through the human layer. Any change in one interface programmatically updates the other because they share the same code path. As you build a good user experience, you're simultaneously optimizing a great agent experience.

To quantify the gap between this and the vision agent default, we ran a benchmark on the same Reflex app, with one agent driving its UI and another calling its auto-generated endpoints.

We used a Reflex port of the react-admin Posters Galore demo, an admin panel for managing customers, orders, and reviews. Both agents target the same running Reflex app: the vision agent navigates its compiled React frontend at localhost:3001, and the API agent calls its auto-generated endpoints at localhost:8001. The interface is the only variable, so any difference in cost or accuracy comes from the interface itself.

We gave each agent the task described above:

Find the customer named "Smith" with the most orders. Locate their most recent pending order. Accept all of their pending reviews. Mark the order as delivered.

This touches three resources, requires filtering, pagination, cross-entity lookups, and both reads and writes. These are common tasks for any internal tool.

We ran the task two ways against the same pinned dataset:

Path A: Vision agent. Claude Sonnet driving the Reflex app's React UI via browser-use 0.12, vision mode, taking screenshots and executing clicks to navigate the app. The prompt was a 14-step UI walkthrough naming the sidebar items, tabs, and form fields to interact with.

Path B: API agent. Claude Sonnet with tool-use, calling REST tools that map to the Reflex app's auto-generated event-handler endpoints. The prompt was the abstract task above with no UI guidance.

We used the same model on the same data, giving it the same task. All code is open source.

MetricVision agent (Sonnet)API (Sonnet)API (Haiku)
Steps / calls53 ± 138 ± 08 ± 0
Wall-clock time1003s ± 254s (~17 min)19.7s ± 2.8s7.7s ± 0.5s
Input tokens550,976 ± 178,84912,151 ± 279,478 ± 809
Output tokens37,962 ± 10,850934 ± 41819 ± 52

Numbers are mean ± sample standard deviation (n−1), with n=5 per API path and n=3 for the vision path. API Sonnet was the most consistent: identical 8 tool calls across all 5 trials and only ±27 tokens of input variation. Browser Sonnet varied widely — the longest run consumed 751k input tokens across 68 cycles (vs. 407k across 43 in the shortest), driving the large standard deviations on every metric. Full run details are available in the repo.

With the abstract task as its prompt, the vision agent found one of four pending reviews, accepted it, and moved on. It never paginated. We rewrote the prompt as the 14-step walkthrough before the agent completed the task, and even with that walkthrough it averaged ~53 round-trips per run, each carrying a full-page screenshot worth thousands of tokens, most spent looking at dropdown menus and table rows.

The API agent completed the same task on a six-sentence prompt in 8 calls with no prompt iteration. It called GET /reviews?customer_id=421&status=pending, which returned the pending reviews directly. There is nothing to infer from pixels, and thus nothing to retry.

Haiku could not complete the vision path. The failure was specific to browser-use 0.12's structured-output schema, which Haiku could not reliably produce in either vision or text-only mode. On the API path, Haiku finished in under 8 seconds.

The cost difference follows directly from the architecture. An agent that must see to act will always pay for the seeing, regardless of how good the model gets.

With Reflex's framework, you write your frontend and backend in Python. Under the hood, it'll compile to a React frontend and a Starlette backend. This plugin extends that backend by exposing the interactive surface of your UI as API endpoints, making every action a human can take in the browser directly callable by an agent.

The Reflex app we benchmarked has three pages in about 500 lines of Python. The React UI a human uses and the API an agent calls both come from those same lines. There is no second codebase to maintain.

P.S. This is not a mark against vision agents or browser-use specifically. Vision agents are the best choice from a consumer perspective when you're forced to use an app with no agent-friendly interface. It's simply obsolete when you can control how that app is built, and both you and your agents are using that app. We also loved using browser-use, and found it succeeded more than similar tools we tested :)

Vision results are specific to browser-use 0.12 in vision mode, and other vision agents may behave differently. The Path B runner shapes the auto-generated endpoints into a small REST tool surface of about thirty lines, which the agent sees as list_customers, update_order, and similar. The dataset is pinned and small (900 customers, 600 orders, 324 reviews), so behavior on production-scale data is not measured here. The vision agent runs through LangChain's ChatAnthropic, and the API agent runs through the Anthropic SDK directly. Reported token counts are uncached input tokens.

All benchmark code is open-source: github.com/reflex-dev/agent-benchmark

The repo includes seed data generation, the patched react-admin demo, both agent scripts, and raw results.

Reflex agent-native plugin docs: reflex/docs/enterprise/event-handler-api.md

The Unified Platform to Build and Scale Enterprise AppsDescribe your idea, and let AI transform it into a complete, production-ready Python web application.
CTA Card
Built with Reflex