What is the difference between a vision agent and an API agent?

A vision agent (browser-use, computer-use) takes screenshots of an application, reasons about what it sees, and clicks UI elements to complete tasks. An API agent calls structured HTTP endpoints directly, receiving data in a format it can reason about without interpreting pixels. Both can complete the same tasks, but API agents use significantly fewer tokens and complete tasks faster because they skip the screenshot-reason-click loop entirely.

Why not just build a bespoke MCP or API for each internal tool?

Building a dedicated MCP or API for each tool creates a second codebase that diverges from the UI over time. Bugfixes and feature changes need to land in two places. These APIs are also stateless by default, requiring additional engineering for session context. This overhead compounds across 15 or 100 internal tools. The Reflex plugin generates API endpoints from existing event handlers, eliminating the second codebase entirely.

Does the Reflex plugin replace browser-use agents entirely?

No. Vision agents remain the best option when you cannot control how an application is built, such as third-party SaaS tools with no API. The Reflex plugin is for applications you build and control, where both humans and agents can use the same codebase through different interfaces.

How does the EventHandlerAPIPlugin work?

The plugin, part of reflex-enterprise, auto-generates an HTTP endpoint for every event handler on your Reflex State. If your app has an accept_review handler triggered by a button click, the plugin exposes that same handler as a POST endpoint. Responses stream as NDJSON state deltas including recomputed computed vars, so agents read the same data the UI displays.

Blog

Open Source

500,000 Tokens to Click a Dropdown

We benchmarked browser-use vision agents against auto-generated API endpoints on the same Reflex app. 47 steps and 495k tokens vs 8 calls and 12k tokens.

PAPalash Awasthi

Image for blog post: 500,000 Tokens to Click a Dropdown

AI has thrown enterprises into an arms race to automate internal workflows the fastest.

A large part of automating these workflows consists of operating internal tools. Most enterprises have 100+ dashboards, tools, and data visualization apps that employees spend significant time operating. Making these tools agent-accessible is one of the first automation steps for many enterprises.

Let's say an agent needed to resolve a customer complaint by looking up the customer, checking their orders, and updating records across an admin panel. There are two common paths to do so:

Vision agents (browser/computer-use) are most common in practice today from what I've seen. These agents point a vision model at the screen, conduct reasoning, click, and repeat. This works, but it's slow, costly, and inaccurate by nature.
Bespoke MCPs or APIs solve the performance problem but create a second project that needs maintenance. Setting these up for a single app involves defining tool schemas, writing handler functions that mirror the app's existing logic, managing authentication and permissions separately, and testing the integration independently of the UI. Multiply that by 15 or 100 apps, each with its own update cycle. These APIs are also stateless, requiring additional engineering bandwidth for session context. With this approach, engineering overhead compounds with app size and number.

Both approaches force a choice between reliability and engineering cost.

If the app is built with Reflex, our full-stack Python web framework, that tradeoff no longer applies. A new update enables programmatic creation of API endpoints from UI components (event handlers, for granularity). An agent can call the same function a button triggers.

The result is an application that serves both humans and agents without engineering overhead. The agent's session is stateful by default because it operates through the human layer. Any change in one interface programmatically updates the other because they share the same code path. As you build a good user experience, you're simultaneously optimizing a great agent experience.

To quantify the gap between this and the vision agent default, we ran a benchmark on the same Reflex app, with one agent driving its UI and another calling its auto-generated endpoints.

Setup

We used a Reflex port of the react-admin Posters Galore demo, an admin panel for managing customers, orders, and reviews. Both agents target the same running Reflex app: the vision agent navigates its compiled React frontend at localhost:3001, and the API agent calls its auto-generated endpoints at localhost:8001. The interface is the only variable, so any difference in cost or accuracy comes from the interface itself.

We gave each agent the task described above:

Find the customer named "Smith" with the most orders. Locate their most recent pending order. Accept all of their pending reviews. Mark the order as delivered.

This touches three resources, requires filtering, pagination, cross-entity lookups, and both reads and writes. These are common tasks for any internal tool.

We ran the task two ways against the same pinned dataset:

Path A: Vision agent. Claude Sonnet driving the Reflex app's React UI via browser-use 0.12, vision mode, taking screenshots and executing clicks to navigate the app. The prompt was a 14-step UI walkthrough naming the sidebar items, tabs, and form fields to interact with.

Path B: API agent. Claude Sonnet with tool-use, calling REST tools that map to the Reflex app's auto-generated event-handler endpoints. The prompt was the abstract task above with no UI guidance.

We used the same model on the same data, giving it the same task. All code is open source.

Results

Metric	Vision agent (Sonnet)	API (Sonnet)	API (Haiku)
Steps / calls	53 ± 13	8 ± 0	8 ± 0
Wall-clock time	1003s ± 254s (~17 min)	19.7s ± 2.8s	7.7s ± 0.5s
Input tokens	550,976 ± 178,849	12,151 ± 27	9,478 ± 809
Output tokens	37,962 ± 10,850	934 ± 41	819 ± 52

Numbers are mean ± sample standard deviation (n−1), with n=5 per API path and n=3 for the vision path. API Sonnet was the most consistent: identical 8 tool calls across all 5 trials and only ±27 tokens of input variation. Browser Sonnet varied widely — the longest run consumed 751k input tokens across 68 cycles (vs. 407k across 43 in the shortest), driving the large standard deviations on every metric. Full run details are available in the repo.

With the abstract task as its prompt, the vision agent found one of four pending reviews, accepted it, and moved on. It never paginated. We rewrote the prompt as the 14-step walkthrough before the agent completed the task, and even with that walkthrough it averaged ~53 round-trips per run, each carrying a full-page screenshot worth thousands of tokens, most spent looking at dropdown menus and table rows.

The API agent completed the same task on a six-sentence prompt in 8 calls with no prompt iteration. It called GET /reviews?customer_id=421&status=pending, which returned the pending reviews directly. There is nothing to infer from pixels, and thus nothing to retry.

Haiku could not complete the vision path. The failure was specific to browser-use 0.12's structured-output schema, which Haiku could not reliably produce in either vision or text-only mode. On the API path, Haiku finished in under 8 seconds.

The cost difference follows directly from the architecture. An agent that must see to act will always pay for the seeing, regardless of how good the model gets.

This Reflex Update

With Reflex's framework, you write your frontend and backend in Python. Under the hood, it'll compile to a React frontend and a Starlette backend. This plugin extends that backend by exposing the interactive surface of your UI as API endpoints, making every action a human can take in the browser directly callable by an agent.

The Reflex app we benchmarked has three pages in about 500 lines of Python. The React UI a human uses and the API an agent calls both come from those same lines. There is no second codebase to maintain.

P.S. This is not a mark against vision agents or browser-use specifically. Vision agents are the best choice from a consumer perspective when you're forced to use an app with no agent-friendly interface. It's simply obsolete when you can control how that app is built, and both you and your agents are using that app. We also loved using browser-use, and found it succeeded more than similar tools we tested :)

Notes

Vision results are specific to browser-use 0.12 in vision mode, and other vision agents may behave differently. The Path B runner shapes the auto-generated endpoints into a small REST tool surface of about thirty lines, which the agent sees as list_customers, update_order, and similar. The dataset is pinned and small (900 customers, 600 orders, 324 reviews), so behavior on production-scale data is not measured here. The vision agent runs through LangChain's ChatAnthropic, and the API agent runs through the Anthropic SDK directly. Reported token counts are uncached input tokens.

Reproduce it

All benchmark code is open-source: github.com/reflex-dev/agent-benchmark

The repo includes seed data generation, the patched react-admin demo, both agent scripts, and raw results.

Reflex agent-native plugin docs: reflex/docs/enterprise/event-handler-api.md

6 minute readApr 27, 2026

Computer use is 45x More Expensive Than Structured APIs

We benchmarked computer use against auto-generated API endpoints on the same admin panel. 53 steps and 551k tokens vs 8 calls and 12k tokens.

8 minute readApr 21, 2026

How to Build a Chat App With LLM in Python: Complete Guide April 2026

Learn how to build a chat app with LLM in Python with streaming, RAG, and production interfaces. Complete implementation guide for April 2026.

13 minute readApr 21, 2026

6 Best Business Intelligence Tools in April 2026

Compare the best business intelligence BI tools and business analytics software in 2026. See which options handle KPI dashboards, CRUD operations, and full application development.

Setup

Results

This Reflex Update

Notes

Reproduce it

More Posts