BROWSER USE

Products:
- [Browser Harness](https://browser-harness.com)
- [Stealth Browsers](https://browser-use.com/stealth-browsers)
- [Browser Use Box](https://browser-use.com/bux)
- [Web Agents](https://browser-use.com/web-agents)
- [Custom Models](https://browser-use.com/custom-models)
- [Proxies](https://browser-use.com/proxies)

[Pricing](https://browser-use.com/pricing)
[Blog](https://browser-use.com/posts)
[Cloud Docs](https://docs.cloud.browser-use.com)
[Open Source Docs](https://docs.browser-use.com)

[GET STARTED](https://cloud.browser-use.com)
[GITHUB](https://github.com/browser-use/browser-use)

---

# Browser Agent Benchmark: Comparing LLM Models for Web Automation

**Author:** Alexander Yue
**Date:** 2026-01-31
> We benchmark every major LLM on 100 hard browser tasks. Browser Use Cloud scores 78%, 16 points ahead of the best open-source model.

---

![BU Bench V1: Success Rate by Model](https://browser-use.com/images/benchmark/accuracy_by_model_dark.png)

## 100 hard browser tasks, one leaderboard

To truly understand our agent performance, we built a suite of internal tools for evaluating our agent in a standardized and repeatable way so we can compare versions and models and continuously improve. We take evaluations seriously. As of now, we have over 600,000 tasks run in testing.

This is our first open source benchmark. BU Bench V1: 100 hand-selected tasks that are hard but possible, drawn from five established sources.


| Source | Tasks | Description |
| --- | --- | --- |
| Custom | 20 | Page interaction challenges (iframes, drag-and-drop, complex forms) |
| WebBench | 20 | Web browsing tasks |
| Mind2Web 2 | 20 | Multi-step web navigation |
| GAIA | 20 | General AI assistant tasks (web-based) |
| BrowseComp | 20 | Browser comprehension tasks |


Every task was run many times with different LLMs, agent settings, and frameworks. Too-easy tasks were removed. Tasks majority-voted impossible and never completed were removed. What's left is hard and verified completable.

The task set is encrypted to prevent LLM training contamination.

## The judge

Real websites can't be judged deterministically. We use an LLM judge (`gemini-2.5-flash`) with a simple true/false verdict. Rubric-based scoring sounds better in theory, but in practice LLMs give middling scores to both successes and failures. Binary verdicts are more reliable.

We hand-labeled 200 traces and measured alignment. The judge agrees with human judgments 87% of the time, differing only on partial successes and technicalities.

To ensure consistency across models, the same judge LLM, prompt, and inputs are used for every evaluation.

## Results


| Model | Type | Score |
| --- | --- | --- |
| Browser Use Cloud (bu-ultra) | Cloud | 78.0% |
| OSS + BU LLM (ChatBrowserUse-2) | OSS + Cloud LLM | 63.3% |
| claude-opus-4-6 | Open Source | 62.0% |
| gemini-3-1-pro | Open Source | 59.3% |
| claude-sonnet-4-6 | Open Source | 59.0% |
| gpt-5 | Open Source | 52.4% |
| gpt-5-mini | Open Source | 37.0% |
| gemini-2.5-flash | Open Source | 35.2% |


**Browser Use Cloud leads at 78%**, 16 points ahead of the best open-source model. Each model was evaluated multiple times and results include error bars (standard error).

The "Open Source" column means running the open-source Browser Use library with that LLM. "OSS + Cloud LLM" means the open-source library using our ChatBrowserUse-2 model, which is specifically optimized for browser automation. "Cloud" is the fully managed Browser Use Cloud agent.

![BU Bench V1: Success vs. Throughput](https://browser-use.com/images/benchmark/accuracy_vs_throughput_dark.png)

The throughput plot tells the rest of the story. Browser Use Cloud (bu-ultra) is both the most accurate **and** the fastest at ~14 tasks per hour. GPT-5 is the slowest at ~6 tasks per hour. Each bu-ultra step is slower than a smaller LLM, but it completes tasks in far fewer steps, so total wall-clock time is lower. Speed matters in production.

## Why Cloud scores higher

Browser Use Cloud is not just a model. It combines a purpose-built agent with our own browser infrastructure: stealth proxies, CAPTCHA solving, persistent filesystem, and optimized tool orchestration. The 16-point gap over the best open-source model comes from this full-stack optimization, not just a better LLM.

For users who need custom tools or self-hosting, the open-source library with ChatBrowserUse-2 (63.3%) still outperforms every standalone open-source model.

## Using the benchmark

The benchmark is open source at [github.com/browser-use/benchmark](https://github.com/browser-use/benchmark). Clone the repo, set your API keys, and run `uv run python run_eval.py`.

A single run of 100 tasks takes ~3 hours and costs ~$10 on the basic Browser Use plan. More expensive models like claude-opus-4-6 can cost nearly $100 in LLM API calls per run.

We want LLM providers to use this benchmark to test and improve their models on real agentic browsing tasks. If you need to run evaluations at larger scale, contact support@browser-use.com.

[Try Browser Use Cloud](https://cloud.browser-use.com)