What LLM model should I use for Browser Use? The Definitive Browser AI Benchmark

One question dominates every conversation we have with developers: "Which model should I plug into Browser Use?"

Answering this requires statistical rigor. A single test run means nothing when evaluating against the chaotic, ever-changing real web.

We run our benchmark multiple times and aggregate with statistical bootstrapping to generate real error bars. Our evaluation system has run over 600,000 tasks to date, with results validated by an LLM judge achieving 87% agreement with human labels.

Here's what we found.

The Benchmark Tasks

To find the best model, you need the hardest test.

We curate the hardest verifiable web tasks, explicitly avoiding synthetic websites as they fail to capture the bizarre reality of how the actual web is built. Instead, we sourced tasks from millions of LLM-labeled user spans to find complex, repeatable, and anonymous interactions that represent real production value.

These aren't toy examples. Each was hand-selected and independently verified to be hard, but possible.

Results: Model Performance vs Throughput

We evaluated all major frontier models across our open-source benchmark (tasks selected from AI research labs).

(For a deep dive into our evaluation infrastructure, read our technical report.)

The scatter plot below correlates speed and accuracy, with standard error bars for each model to account for web variance. You can view all models and their prices here.

Model Performance vs Throughput

The Model Breakdown

The Clear Winner (`bu-2`)

Our proprietary bu-2 endpoint dominates every category.

Introduced in Browser Use 1.0, it's the result of continuous on-task improvements based on what we know matters for our users: vastly faster execution through optimized output schema parsing, proprietary prompt structuring, and batched caching.

Verdict: Highest performance, lowest latency, highest rate limits—no separate API key required. If you want the best browser automation out of the box, use bu-2.

Gemini (SOTA Standalone)

For pure browser tasks, Gemini is currently state-of-the-art.

Strong vision, low latency, and effective agentic action selection make Gemini models highly reliable. Their massive context windows are crucial—modern web pages produce enormous DOMs, and Gemini handles them without breaking a sweat.

Verdict: Best standalone value for general web browsing.

Claude (The Coder)

Claude models are the most intelligent when it comes to code.

When the agent needs to execute custom JavaScript or extract complex structured data, Claude is unparalleled.

But for average browsing tasks—navigating, clicking, finding information—Claude scores slightly below Gemini and costs significantly more.

Verdict: Use Claude when your workflow relies on custom code execution or complex data extraction.

OpenAI

There was a time when gpt-4.1 was the undisputed SOTA browser agent.

Recent OpenAI models have mysteriously regressed on web tasks. We don't know what's going on, but they are no longer leading the pack.

Verdict: Still capable, but falling behind competitors in browser automation.

Grok

Grok models tend to delay releasing API access until months after launch, making them irrelevant by the time they could enter agent workflows.

Verdict: Irrelevant.

Open Source

Open-source models struggle heavily here.

The best we tested, kimi k2.5 and our own fine-tuned browser-use-open-weight model, fail to even break 30% accuracy. None of the open-source models make it onto the main performance chart.

Verdict: Not yet ready for reliable, complex browser automation.

Conclusion & Recommendations

Choosing the right model comes down to your use case:

bu-2: Best performance, lowest latency, zero setup. Use this.
Gemini: Best standalone value for general browsing.
Claude: Unmatched for custom code execution and structured data extraction.

A word of warning: running the full 100-task benchmark costs $10 to $100 depending on the model and takes hours of compute. This is built for researchers and LLM providers pushing state-of-the-art, not everyday users.

If you're an LLM provider or researcher looking to run these evaluations at scale, reach out at support@browser-use.com. The benchmark is open-source at github.com/browser-use/benchmark.

What LLM model should I use for Browser Use? The Definitive Browser AI Benchmark

The Benchmark Tasks

Results: Model Performance vs Throughput

The Model Breakdown

The Clear Winner (`bu-2`)

Gemini (SOTA Standalone)

Claude (The Coder)

OpenAI

Grok

Open Source

Conclusion & Recommendations

Read More

How we built scalable evaluation infrastructure for AI web agents

How Browser Use Empowers Frigade to Build the Best Onboarding Agent

Browser agent bot detection is about to change

The Benchmark Tasks

Results: Model Performance vs Throughput

The Model Breakdown

The Clear Winner (bu-2)

Gemini (SOTA Standalone)

Claude (The Coder)

OpenAI

Grok

Open Source

Conclusion & Recommendations

Read More

How we built scalable evaluation infrastructure for AI web agents

How Browser Use Empowers Frigade to Build the Best Onboarding Agent

Browser agent bot detection is about to change

The Clear Winner (`bu-2`)