Browser Use = state of the art Web Agent
We're excited to announce that Browser Use has achieved state-of-the-art performance on the WebVoyager benchmark, with an impressive 89.1% success rate across 586 diverse web tasks. And the best part? We are FULLY open source (repo).
Web Agent Accuracy
Method
We took the existing WebVoyager codebase and slightly changed it according to our needs. The prompts are slightly different. We also migrated pure openai to Langchain. All the code used for testing is available at eval repository.
We ran our evaluation with gpt4o (we will test all other models as well, just give us a little bit of time).
Some tasks are impossible to solve. For Apple doesn't show prices for certain products in the dataset, there are no recipes for chocolate chip cookies etc.
A lot of tasks have dates in the past (kookings, flights), so we just changed the years from 2023 to 2024 or 2024 to 2025 respectively.
We removed 55 tasks. You can find the full list at analysis.ipynb.
The Results
Here's how Browser Use performed across different domains:
Website | Success Rate | Avg Steps |
---|---|---|
Huggingface | 100% | 9.7 |
Google Flights | 95% | 36.2 |
Amazon | 92% | 14.7 |
GitHub | 92% | 15.9 |
Apple | 91% | 12.5 |
BBC News | 91% | 18.2 |
Cambridge Dictionary | 91% | 16.7 |
Allrecipes | 90% | 18.3 |
Coursera | 90% | 8.5 |
Google Search | 90% | 14.4 |
Google Map | 86% | 14.9 |
ESPN | 85% | 21.0 |
ArXiv | 83% | 17.6 |
Wolfram Alpha | 83% | 18.4 |
Booking | 80% | 32.7 |
Even our "worst" performing domain (Booking.com) still achieved an 80% success rate. Not too shabby!
🤔 Do these results matter?
The WebVoyager dataset is hard. BUT we believe it's not actually testing correct "things". It mostly tests the planning of the agents, but not the actual ability to understand the sites (for example, complex sites with iFrames and Shadow elements are extremely tricky, but not tested with this dataset).
The dataset also includes ambiguous tasks that even humans might interpret differently
BUT it is still the best we have. It's a great way to show off and optimize the agent to generalize well. We are calling for a new dataset that would actually test knowledge of the agent on a specific website!! If you can do that contact us.
Analysis
We believe in transparency, so here are some caveats:
Manual correction of evaluations
The eval model is not good. That's why we added another success criteria - unknown
if the eval model is not sure.
Most of the tasks are indeed correct, but some tasks had wrong assesment, and unknown
either went into success
or failed
.
We manually reviewed the evaluations for the tasks that are either "unknown" or "failed" and corrected them. This is due to the fact that the default WebVoyager evaluator is not good.
- Some tasks were impossible due to outdated data (we removed these)
- Cloudflare sometimes got grumpy and blocked us
- The original evaluation model wasn't great (we manually reviewed edge cases)
- Some prompts were more ambiguous than your average fortune cookie
What's Next?
We're not stopping here! Coming soon:
- make manually labeled items more transparent
- add proxy rotations for Cloudflare blocking
- test all kinds of models and different setups (Claude, GPT-4o, Llama 3, etc.)
- test different setups (single vs multiple images, single vs multiple tasks, etc.)
Want to build Web Agents?
Browser Use is 100% Open Source.
Remember: these results were achieved with our base setup - imagine what's possible with all the bells and whistles!
Stay tuned. For more updates join ↓