Skip to main content
Browser Use vs Claude Computer Use: DOM vs Vision

Browser Use vs Claude Computer Use: DOM vs Vision

· 25 min read

Browser automation agents can now fill forms, scrape pages, and navigate multi-step flows with no human input. Two approaches have emerged for how they understand the page: vision only, where the agent sees a screenshot and clicks pixel coordinates, and DOM access, where the agent reads the page's Document Object Model (DOM), a structured map of every element and its label, alongside the screenshot. The choice between them shows up in how much debugging each approach needs and which tasks break under real conditions.

This guide runs Browser Use (DOM plus vision) and the Claude Computer Use API (vision only) through the same five tasks to show where each approach holds up and where it doesn't.

The five tasks are:

  • Filling out a complex form with date pickers and dropdowns
  • Scraping a leaderboard
  • Extracting structured JSON from search results
  • Playing the I'm Not a Robot CAPTCHA game on Neal.fun
  • Navigating from the Wikipedia article on Cleopatra to Albert Einstein using only body-text links

Here, a Browser Use agent is on level 2 of Neal.fun's I'm Not a Robot game. It selects six stop sign squares, fails verify, reassesses, deselects the two pole squares, and passes.

Video is sped up 21x. The actual task took 12 minutes 23 seconds.

I also test the developer experience of each tool by letting a coding agent do all the building. For each task, I write a prompt and hand it to Claude Code with no manual intervention. The agent picks the model and writes the full implementation, including dependency setup. I only paste in an API key when prompted.


Two approaches to browser control

Browser automation agents need to understand what is on the page before they can act on it. The two tools in this guide take fundamentally different approaches to that problem.

DOM access plus vision: Browser Use

On each step, Browser Use extracts the accessibility tree from the page: a structured list of every interactive element, its type (button, input, link), its visible label, and its index. Your LLM receives this tree alongside a screenshot, then calls one of Browser Use's built-in actions by element index.

Because the agent identifies elements by name rather than pixel coordinates, standard form interactions are precise and repeatable. Vision fills in where the DOM falls short, such as reading CAPTCHA images or dismissing ad popups that don't appear in the tree.

Browser Use is an open-source Python library (MIT, v0.12.0, 79k+ GitHub stars). You write a task in plain English, call agent.run(), and the library runs the agent loop, browser control via Playwright, DOM extraction, and action dispatch. For production use, it also supports:

  • Remote browsers and Docker containers
  • Session persistence and browser profiles for authentication reuse
  • Pydantic structured output
  • Parallel agents via asyncio.gather

When you run Browser Use locally, a browser window opens on your machine and you can watch the agent work. In production, you'd swap that for a remote or headless browser. Browser Use works only inside the browser and has no access to native desktop applications or the file system.

Vision only: Claude Computer Use

Claude Computer Use gives Claude a computer tool: the ability to take screenshots and send mouse and keyboard actions to pixel coordinates. Claude sees the screen as an image, with no DOM access, and infers where to click from what it sees.

Claude Computer Use is a beta Anthropic API feature that works with Claude Opus 4.6 and Sonnet 4.6. Unlike Browser Use, it supplies no agent loop: your application sends Claude a prompt and a screenshot, Claude returns a single action, your application executes it, captures a new screenshot, and sends that back. You are responsible for:

  • Browser launch and page navigation
  • Screenshot capture and base64 encoding
  • API call construction and response parsing
  • Context management and image trimming to stay within token limits
  • Rate limiting
  • Fallback logic when the model misfires

When you run Computer Use locally, no browser window appears. The agent launches a headless Playwright browser, captures screenshots internally, and sends them to the API. You never see it working, and in production there is no visible desktop to manage. The tradeoff is that you are responsible for every part of the loop that Browser Use handles for you.

Computer Use can control any application visible on screen, not just browsers. That makes it the right tool for native desktop apps and file pickers where a DOM does not exist. For browser tasks, the question is whether pixel-level vision is sufficient or whether the absence of structured element data creates problems that matter. The five tasks below show where the line is.


Setting up

Both setups require Python 3.12 or later and an Anthropic API key. I start each tool from scratch in an empty folder.

Browser Use: setup resolves to 12 lines

I give the coding agent this prompt:

❯ Set up a Python project for a Browser Use agent. Use the browser-use Python library
with Claude as the LLM. Get any API keys or credentials you need — if there's a CLI
tool or other way to get them programmatically, use it. If you need me to provide
anything manually, tell me what you need. Once set up, run a simple smoke test: go to
example.com and print the page title.

The only thing I do is paste in an API key when prompted. The agent handles the rest: venv, dependencies, Playwright, and a working smoke test on the first run.

from browser_use import Agent
from browser_use.llm import ChatAnthropic

llm = ChatAnthropic(model="claude-sonnet-4-5", api_key=os.getenv("ANTHROPIC_API_KEY"))
agent = Agent(task="Go to example.com and tell me the page title.", llm=llm)
result = await agent.run()
print("Result:", result.final_result())

The agent loop, browser lifecycle, DOM extraction, and action dispatch are all inside browser-use. The working setup is 12 lines.

Computer Use: the smoke test is already 89 lines

I give the same coding agent the equivalent prompt:

❯ Set up a Python project for a Claude Computer Use agent that controls a local browser.
Use the Claude Computer Use API with a local Playwright browser. Get any API keys or
credentials you need — if there's a CLI tool or other way to get them programmatically,
use it. If you need me to provide anything manually, tell me what you need. Once set up,
run a simple smoke test: open a browser, take a screenshot, and print "done".

Again, I only paste in an API key when prompted. The agent creates the venv, installs dependencies, and passes the smoke test on the first run.

The smoke test is 89 lines. It is the minimum required to do one thing: launch a browser, take a screenshot, encode it as base64, send it to the Claude API with the computer tool configured, and print the response. The smoke test has no loop and no state management. Those come on top of this, and the developer writes all of it.

# developer writes: browser launch, screenshot, base64 encode, API call, parse response
browser = await p.chromium.launch(headless=True)
page = await browser.new_page(viewport={"width": 1280, "height": 800})
await page.goto("https://example.com")
screenshot_b64 = base64.standard_b64encode(await page.screenshot()).decode()
response = client.beta.messages.create(
model="claude-opus-4-6",
tools=[{"type": "computer_20251124", "name": "computer", ...}],
messages=[{"role": "user", "content": [{"type": "image", ...}, {"type": "text", ...}]}],
betas=["computer-use-2025-11-24"],
)

The Computer Use scripts for the tasks below run from 187 to 330 lines. The 89-line smoke test is the floor. Every task adds on top of it, and the developer writes all of it.

Model choices and cost

The coding agent picks claude-sonnet-4-5 for Browser Use and claude-opus-4-6 for Computer Use. I keep both throughout to measure the out-of-the-box experience rather than an optimized configuration. Opus costs roughly 15x more per token than Sonnet, which means Computer Use starts the comparison heavier on setup and heavier on cost. Across these five tasks, it doesn't earn that back.


Task 1: Form filling

Forms are where DOM access should have the clearest advantage. The accessibility tree gives the agent named elements and typed fields for every input, which is exactly what form filling requires. I use demoqa.com's automation practice form, a public test site with dynamic dropdowns, a date picker, and autocomplete fields, to see if that advantage holds.

I give both tools the same prompt:

❯ Fill out the form at https://demoqa.com/automation-practice-form with the following details:
- First name: Jane
- Last name: Smith
- Email: jane.smith@example.com
- Gender: Female
- Mobile: 0123456789
- Date of birth: 15 Mar 1990
- Subject: English
- Hobby: Reading
- Current address: 123 Test Street, London
- State: NCR
- City: Delhi

Time how long it takes from start to submission. Print whether the form submitted successfully and the time taken.

The demoqa.com form is public. You can run both tools against it using the setup prompts above.


Browser Use: first run, no debugging

The coding agent produces a script that hands the task off as a single string:

llm = ChatAnthropic(model="claude-sonnet-4-5")
agent = Agent(
task="""
Go to https://demoqa.com/automation-practice-form and fill out the form with the following details:
- First Name: John
- Last Name: Smith
...
After filling all fields, click the Submit button.
Report what confirmation message or result appears after submission.
""",
llm=llm,
)
result = await agent.run()

Browser Use works through the form without any intervention. It hits one self-correcting error on the date picker and deals with an ad popup mid-way through, but handles both without getting stuck. The confirmation modal reads "Thanks for submitting the form."

The video below shows the full run, including the ad popup and the date picker self-correction.

Video is sped up. The actual task took 8 minutes 18 seconds (498.4s).

The form submits on the first run in 25 turns and 498.4s, with no debugging.


Computer Use: 42 debug messages, date picker never solved visually

The same task requires building the agent loop from scratch. Before the loop even runs, the script needs two Playwright helpers to handle problems the loop alone can't solve.

The first is image trimming. Every turn sends all previous screenshots back to the API as base64. After a few turns the context grows large enough to hit the rate limit (30,000 input tokens per minute). The fix replaces older screenshots with a 1x1 placeholder:

def trim_old_images(messages: list, keep_last_n: int = 2) -> list:
"""Replace base64 image data in older messages with a tiny placeholder."""
PLACEHOLDER = "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNk+M9QDwADhgGAWjR9awAAAABJRU5ErkJggg=="
user_msg_indices = [i for i, m in enumerate(messages) if m["role"] == "user"]
to_trim = user_msg_indices[:-keep_last_n] if len(user_msg_indices) > keep_last_n else []
# ... replaces image source data in trimmed messages

The second is a Playwright helper to pre-fill the date of birth before the agent loop starts, because native <select> dropdowns in headless Chromium can't be reliably triggered by simulated mouse clicks:

async def set_date_of_birth(page, day: int, month: int, year: int):
dob_input = page.locator("#dateOfBirthInput")
await dob_input.click()
month_select = page.locator(".react-datepicker__month-select")
year_select = page.locator(".react-datepicker__year-select")
await month_select.select_option(str(month - 1)) # 0-indexed
await year_select.select_option(str(year))
day_selector = f".react-datepicker__day--0{day:02d}:not(.react-datepicker__day--outside-month)"
await page.locator(day_selector).click()

Getting to a working script takes 42 assistant messages across three separate debugging rounds: a rate limit fix (screenshots accumulate in context and hit 30,000 tokens per minute), a date picker bypass (the <select> element can't be triggered by mouse click in headless Chromium, so the agent pre-fills it with Playwright before the loop starts), and a page load timeout fix. The clean run, once the script works, takes 19 turns and 245 seconds.

The form submits, but the date picker is never solved visually. The script needs the Playwright bypass on every attempt.


Wall-clock time hides the real cost

Computer Use's clean run is faster on wall-clock time, but it takes 42 debug messages to get there. Browser Use needs none. The date picker is the clearest example of why: Browser Use names the calendar elements directly from the DOM and clicks the right day. Computer Use sees pixels, misses consistently, and the Playwright bypass ends up running on every attempt. The image trimming workaround recurs in every subsequent task. It is a fixed overhead of the vision-only approach, not a problem specific to this form.


Task 2: Scraping speed

A static page with no forms or interactions is the task where vision-only should be most competitive. The data is readable and there is nothing to click. I give each agent a simple scrape: pull the title, URL, and point count for the top 10 stories from Hacker News.

I give Browser Use this prompt:

❯ Using the existing browser-use Python project in this folder, write and run a script
that goes to news.ycombinator.com and returns the title, URL, and point count for the
top 10 stories. Time how long it takes from agent start to finish. Print the results
and the time taken.

And Computer Use this one:

❯ Using the existing computer use setup in this folder, go to news.ycombinator.com
and return the title, URL, and point count for the top 10 stories. Use the computer
use tools (screenshots, mouse, keyboard) to interact with the browser directly. Time
how long it takes from start to finish. Print the results and the time taken. Record
the start time when you begin and the end time when done. If an interaction isn't
working after a couple of attempts, try a different approach rather than repeating
the same action.

Hacker News is public and the page layout is stable, so results are reproducible.


Browser Use: 2 steps, 86.8s, clean result

The coding agent produces a script that hands the task off as a single string:

agent = Agent(
task=(
"Go to news.ycombinator.com and find the top 10 stories on the front page. "
"For each story, extract: the title, the URL it links to, and the point count. "
"Return the results as a numbered list in this format:\n"
"1. Title: <title>\n URL: <url>\n Points: <points>\n"
"Continue for all 10 stories."
),
llm=llm,
)
result = await agent.run()

Browser Use navigates to the page and returns all 10 stories in 2 agent steps with no errors. Here is the output:

=== Hacker News Top 10 ===

1. Title: You Want to Visit the UK? You Better Have a Google Play or App Store Account
URL: https://www.heltweg.org/...
Points: 57

2. Title: Show HN: Terminal Phone - E2EE Walkie Talkie from the Command Line
...

10. Title: First Website (1992)
Points: 235

Time taken: 86.8s

Computer Use: 21.45s, but the vision path failed

The script makes a single API call: it takes a screenshot, sends it to Claude, and asks for JSON back:

response = client.beta.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
tools=[{"type": "computer_20251124", "name": "computer",
"display_width_px": DISPLAY_WIDTH, "display_height_px": DISPLAY_HEIGHT}],
messages=messages,
betas=["computer-use-2025-11-24"],
)

The API call returns in 17.13s, but Claude's JSON is malformed. It truncates mid-object with JSONDecodeError: Extra data: line 63 column 1 (char 1413). The script falls back to page.evaluate(), pulling the data from the DOM directly. Here is the terminal output:

Sending screenshot to Claude Computer Use API...
API response in 17.13s
JSON parse error: Extra data: line 63 column 1 (char 1413)
Screenshot extraction failed, trying page source approach...
Extracted via page DOM

Total time: 21.45s

The faster result came from a DOM fallback, not vision

Computer Use finishes in 21.45s, Browser Use in 86.8s. That looks like a win until you check where the data came from. Claude's JSON is malformed, so the correct results come from page.evaluate(), a Playwright DOM call. The vision path fails on the task it is designed for. Browser Use's 86.8s reflects real agent work. The Browser Use script is 36 lines, a single task string handed to the agent. The Computer Use script is 187 lines of screenshot management, response parsing, and the DOM fallback that the result actually depends on.


Task 3: Structured output

PyPI's search results page lists package names and descriptions but omits version numbers. That forces any agent working from what it can see on screen to navigate to individual package pages. I give both tools the same prompt:

❯ Go to pypi.org, search for "web scraping", and return the name, latest version, and
description for the first 5 results. Return the data as structured JSON. Time how long
it takes from start to finish. Print the JSON and the time taken.

PyPI's search results are public and stable across time.


Browser Use: first run, navigates to each package page

The coding agent produces a script that hands the task off as a single string:

agent = Agent(
task=(
"Go to https://pypi.org/search/?q=web+scraping and find the first 5 packages listed. "
"For each package extract: the package name, the latest version, and the short description. "
"Return ONLY a JSON array with no extra text, in this exact format:\n"
'[{"name": "package-name", "version": "1.2.3", "description": "short description"}, ...]\n'
"Return exactly 5 items."
),
llm=llm,
)

PyPI's search results don't show version numbers. Browser Use figures this out on its own and navigates to each of the five individual package pages to retrieve them. It completes in 12 agent steps and 241.2s with no errors. Here is the output:

[
{"name": "kmy-web-scraping", "version": "0.1.0", "description": "search engines use scraping!"},
{"name": "utils-web-scraping", "version": "0.1.4", "description": "All the functions that are shared between all the web scrappers"},
{"name": "web-scraping-framework", "version": "0.6", "description": "web scraping wrapper built on puppeteer and celery."},
{"name": "web-scraping-toolkit", "version": "0.1.0", "description": "A toolkit for web scraping"},
{"name": "Web-Scraping-Utility", "version": "1.0.0", "description": "Web Scraping Utility"}
]

Time taken: 241.2s

Computer Use: succeeds on run three after two architectural rewrites

On the first run, PyPI's CDN returns a Fastly CAPTCHA page. The script needs browser stealth headers before the agent loop starts:

context = await browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...",
extra_http_headers={"Accept-Language": "en-US,en;q=0.9", ...},
)
await context.add_init_script(
"Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
)

The second run gets past the CAPTCHA. The agent loop does not terminate: Claude takes successive screenshots of the search results page without extracting any data or producing output. The script needs a structural rewrite. The two-phase structure gives each API call a single bounded task: one call to extract names and descriptions from the search results page, then one call per package to retrieve version numbers from individual pages. The third run succeeds, completing 6 API calls in 111.5s. Here is the output:

{
"query": "web scraping",
"results": [
{"name": "kmy-web-scraping", "version": "0.1.0", "description": "search engines use scraping!"},
{"name": "utils-web-scraping", "version": "0.1.4", "description": "All the functions that are shared between all the web scrappers"},
{"name": "web-scraping-framework", "version": "0.6", "description": "web scraping wrapper built on puppeteer and celery."},
{"name": "web-scraping-toolkit", "version": "0.1.0", "description": "A toolkit for web scraping"},
{"name": "Web-Scraping-Utility", "version": "1.0.0", "description": "Web Scraping Utility"}
],
"time_seconds": 111.5
}

Browser Use solved the version problem itself

Computer Use's clean run is faster, but it takes three attempts to get there. Browser Use gets the same results in 241.2s on the first run. Computer Use's run three finishes in 111.5s, after a Fastly CAPTCHA bypass, a full structural rewrite, and an explicit two-phase loop that the developer had to design. Browser Use encounters the same version number problem and solves it without any of that. The next task has no missing data problem to reason around, which makes it the clearest test of raw vision capability.


Task 4: Visual interaction (Neal.fun)

Neal.fun's I'm Not a Robot game presents a series of CAPTCHA-style visual puzzles where the agent reads what is on screen and clicks the correct element. There is no DOM structure available for element identification or click targeting. It is the task where a vision-only approach has its clearest advantage. I give both tools the same prompt:

❯ Go to https://neal.fun/not-a-robot and play the game. The game presents captcha-style
visual puzzles. Each level shows an instruction and you must click the correct element
on screen. Complete as many levels as you can within 10 minutes from when you start.
Track which level you reach. When the 10 minutes are up, or if the game ends, stop and
report: the highest level reached, a brief description of what each level asked you to
do, and whether you passed or failed each one. Print the results and the total time taken.

Neal.fun requires no account. Note that Browser Use ships with stealth configuration that Computer Use doesn't, so bot detection behaviour may differ from what's shown here.


Browser Use: reaches level 3, stopped by distorted text

The coding agent produces a script that runs the agent against the game for up to 200 steps:

agent = Agent(
task=task,
llm=llm,
max_actions_per_step=5,
)
result = await agent.run(max_steps=200)

Browser Use passes the level 1 checkbox on the first attempt. Level 2 shows a stop sign grid and asks the agent to select every square containing a stop sign. The agent selects six squares and clicks verify, but two of those squares show only the pole, so the attempt fails. After a few more failed attempts the agent reassesses, identifies the two pole squares, deselects them, and clicks verify again. It passes.

Level 3 asks the agent to read and type distorted text. The warped rendering is unreadable from a screenshot. The agent makes several guesses, and time expires before it passes.

The full run is below, including the stop sign self-correction from the introduction and the distorted-text level where time expires.

Video is sped up 21x. The actual task took 12 minutes 23 seconds.


Computer Use: blocked by Cloudflare before the game loads

Cloudflare serves a bot verification page for the full 10 minutes. The agent takes 24 actions attempting to get past it, but headless Playwright fails bot detection every time. Here is the terminal output:

Starting Not-a-Robot game agent. Time limit: 10 minutes.

[6s] Level 0 | Actions: 0 | Remaining: 594s
[15s] Level 0 | Actions: 1 | Remaining: 585s
...

Time limit reached after 604 seconds.

GAME RESULTS
Total time: 604s (10.1 minutes)
Total actions taken: 24
Highest level reached: 0

No levels completed.

Cloudflare blocks Computer Use before the game starts

This is the task where vision-only should be strongest, and Computer Use never gets the chance. Cloudflare fingerprints the headless Playwright browser and blocks it for the full 10 minutes. Browser Use ships with stealth configuration and bot-detection handling built in. Computer Use hands you bare Playwright, and bypassing bot detection is your problem to solve.

Browser Use reaches level 3 before hitting a genuine vision limit: distorted text that no screenshot-based model can reliably read. The Cloudflare block is a different kind of failure. Computer Use's vision-only design ships with no browser scaffolding, so a bot detection problem that Browser Use handles at the library level falls entirely to the developer.


Task 5: Multi-step navigation (Wikipedia)

Multi-step navigation is the task most favorable to Computer Use: no forms, no structured data, just links both agents can see. DOM access still makes a measurable difference, but not for the reason the earlier tasks suggest. I give both tools the same prompt: navigate from the Cleopatra article to Albert Einstein using only body-text links.

❯ Play the Wikipedia navigation game. Start at the Wikipedia article for "Cleopatra"
and navigate to the Wikipedia article for "Albert Einstein" using only links in the
body text of each article. Rules: you may only click links that appear in the main
body text of each article. No search bar, no infobox, no sidebar, no hatnotes. You
cannot go back, only forward. Track every page you visit. Stop when you reach the
Albert Einstein article or after 15 clicks, whichever comes first. Report: whether
you reached the target, the full path of pages visited in order, the number of clicks
taken, and the total time. Print the results and the time taken.

Wikipedia's article structure is stable, though the path varies with current article content. The starting and ending articles are fixed.


Browser Use: DOM read commits in one step

The coding agent produces a script that runs the agent for up to 30 steps:

agent = Agent(
task=task, # full rules and reporting format as task string
llm=llm,
)
result = await agent.run(max_steps=30)

Browser Use reads the full DOM of the Cleopatra page in a single step and picks Alexandria. From there it narrows: Alexandria leads to Science, Science to Natural science, Natural science to Physics, Physics to Theory of relativity, Theory of relativity to Albert Einstein. DOM access makes navigation faster but doesn't fix poor link choices. The Einstein link is on the Science page at click three, a prominent body-text link. The agent reads the DOM, does not pick it, and continues down the chain for three more hops, reaching the target in six clicks and 3 minutes 2 seconds.

The video below shows Browser Use navigating from Cleopatra to Albert Einstein in six clicks.

Video is sped up 4.3x. The actual task took 3 minutes 2 seconds.


Computer Use: nine scrolls before the first click

The agent manages context by resetting the conversation on each successful navigation and keeping a sliding window of 10 messages per page:

# Reset conversation for each new page
if new_title != current_title:
pages_visited.append(new_title)
conversation_messages = []

# Keep conversation window manageable
if len(conversation_messages) > 10:
conversation_messages = conversation_messages[-10:]

The agent makes 9 scroll API calls on the Cleopatra page before committing to a link. Each scroll is a full Opus API call with a screenshot attachment. The first navigation, to Library of Alexandria, happens at 95 seconds. From there the path runs through Aristotle, Classical mechanics, and Special relativity before reaching Albert Einstein. Four of the nine clicks are misfires on pixel coordinates that don't resolve to a new page. The click log is below.

[Click 0/15] [95s]  → click(576,655) → Navigated to: Library of Alexandria
[Click 1/15] [183s] → click(649,642) (misfire)
[Click 2/15] [198s] → click(643,642) (misfire)
[Click 3/15] [213s] → click(627,314) → Navigated to: Aristotle
[Click 4/15] [241s] → click(483,514) → Navigated to: Classical mechanics
[Click 5/15] [253s] → click(393,543) (misfire)
[Click 6/15] [262s] → click(393,543) (misfire)
[Click 7/15] [303s] → click(662,496) → Navigated to: Special relativity
[Click 8/15] [319s] → click(393,278) → Navigated to: Albert Einstein
REACHED TARGET: Albert Einstein!

Clicks taken: 9
Total time: 326.3s

Both succeed, but scrolling costs Computer Use nearly double the time

Browser Use finishes in 182 seconds. Computer Use takes 326, despite reaching the target in fewer hops. Computer Use spends 9 API calls scrolling the Cleopatra page before making its first navigation, each one a full Opus call with a screenshot attached. Browser Use reads the full DOM in a single step and acts immediately.

DOM access doesn't fix reasoning. Browser Use misses the Einstein link on the Science page at click three and takes three more hops to get there. But poor link selection costs you a hop. Scrolling costs you an API call per viewport, multiplied across every page in the chain.


Conclusion

TaskBrowser UseComputer Use
Setup12-line smoke test, no debugging89-line smoke test, no debugging
Form filling (demoqa.com)Success, 25 turns, 498s, no debugging42 debug messages; clean run 19 turns, 245s; date picker bypassed with JS
Scraping speed (Hacker News)Success, 2 steps, 86.8s21.45s but vision extraction failed; results from DOM fallback
Structured output (PyPI)Success first run, 12 steps, 241.2sSuccess on run 3; 6 API calls, 111.5s
Visual interaction (Neal.fun)Level 3 (failed distorted text)Level 0 (Cloudflare blocked)
Multi-step navigation (Wikipedia)Success, 6 clicks, 3m 2sSuccess, 9 clicks, 5m 26s

Vision-only hasn't closed the gap for browser tasks. Across five tasks, Computer Use requires debugging on three, is blocked entirely on one, and on the one clean win its vision path fails and a DOM fallback produces the result. Browser Use completes every task on the first run with no intervention.

The DOM access advantage compounds across tasks. The agent arrives with a complete picture of the page, with named elements and typed fields already mapped, while Computer Use arrives with a screenshot and the developer writes everything else. Browser Use scripts run 36 to 66 lines. Computer Use scripts run 187 to 330, on a model that costs 15x more per token.

Computer Use has a narrower use case. For native desktop automation or anything a DOM doesn't describe, it is the right tool. For browser automation, use a tool with DOM access like Browser Use.