BROWSER USE

Products:
- [Browser Harness](https://browser-harness.com)
- [Stealth Browsers](https://browser-use.com/stealth-browsers)
- [Browser Use Box](https://browser-use.com/bux)
- [Web Agents](https://browser-use.com/web-agents)
- [Custom Models](https://browser-use.com/custom-models)
- [Proxies](https://browser-use.com/proxies)

[Pricing](https://browser-use.com/pricing)
[Blog](https://browser-use.com/posts)
[Cloud Docs](https://docs.cloud.browser-use.com)
[Open Source Docs](https://docs.browser-use.com)

[GET STARTED](https://cloud.browser-use.com)
[GITHUB](https://github.com/browser-use/browser-use)

---

# Claude Fable Sets High Score on BU Bench

**Author:** Laith Weinberger
**Date:** 2026-06-10
> We ran Claude Fable 5 on BU Bench with the open-source Browser Use library. Scoring 80%, it bested all other models.

---

Anthropic's new Claude Fable 5 model scored **80.0%** on BU Bench V1 with the open-source [Browser Use library](https://github.com/browser-use/browser-use). Setting a high score, it bested the next-highest model, GPT 5.5, by 12 points.

The result came at an exorbitant price: **$580.87** in API cost for 100 tasks.

BU Bench tests whether browser agents can complete real web tasks. It requires agents to handle multi-step navigation, search, information extraction, form filling, dynamic UI interactions, iframes, PDFs, downloaded files, and synthesis across live websites. Its tasks, code, and model scores are available [here](https://github.com/browser-use/benchmark).

Below is Claude Fable 5 compared to other open- and closed-source models.


  
  
  ![BU Bench V1 focused Browser Use model comparison with provider-colored bars and claude-fable-5 scoring 80 percent](https://browser-use.com/images/benchmark/claude-fable-browser-use-key-models-dark.png)


## Result

Fable passed **80 of 100** tasks. Of the 20 failures, 16 were judged incorrect, and four hit the 30-minute task timeout. Tasks took **6m 53s on average**.

That makes Fable slower per task than GPT 5.5, Opus 4.7, and BU 2.0, but faster than Gemini 3 Pro, Qwen 3.6 Plus, Kimi K2.6, and DeepSeek V4 Pro.


  
  
  ![BU Bench V1 Browser Use model success versus throughput chart for the selected comparison set](https://browser-use.com/images/benchmark/claude-fable-browser-use-key-models-throughput-dark.png)


The cost comparison makes the tradeoff clear.


  
  
  ![BU Bench V1 score versus API cost chart with claude-fable-5 far more expensive than other models](https://browser-use.com/images/benchmark/claude-fable-browser-use-cost-vs-score-dark.png)


Fable completed many of the long research and browsing tasks that usually separate stronger web agents from weaker ones. These are tasks that require following many steps of reasoning, checking constraints, and returning a structured answer rather than extracting one from a page.

## Failure pattern

Fable's failures mostly emerged from three problems: it extracted the wrong fact or chose the wrong entity, it could not reach or search a source well enough, or it gave an answer that the trace did not fully support.

That is different from GPT 5.5's failure pattern. GPT 5.5 failed 32 tasks, often by leaving multi-part tasks unfinished. Fable failed nine of those same tasks and passed the other 23. The shared failures tended to involve fragile source access, many-step research tasks, or pages that were hard to search.

Fable also had failures of its own. In a few cases, the final answer looked plausible from the trace but still disagreed with the benchmark answer.

Fable is clearly better at finishing complex Browser Use tasks, but its remaining failures are ones you would expect from any model.

## Final thoughts

From our initial testing of Fable, it appears the model is better at continuing complex tasks for longer. It is better at keeping track of constraints, moving between sources, and turning messy web pages into a final answer.

Browser agents often fail in boring ways. They click the wrong result, miss one condition, lose a page, or answer before they have enough proof. Better models reduce these small failures, making web agents feel more reliable and deterministic.

The cost is still hard to ignore. Fable is far more expensive than any other model. But it also failed in fewer dumb ways.
