Benchmarks

Strawberry vs. The Competition

We ran 12 real-world benchmarks to measure Strawberry against other AI browsers. Strawberry came out on top.

Real-World Performance, Not Lab Scores

Most AI browser benchmarks test what happens in a lab. Can the agent reason through a multi-step question? Can it navigate a test shopping site? These are useful signals, but they don't reflect real work.

Real work is messier: source 150 leads from LinkedIn, cross-check against your CRM, add only the new ones to an outreach sequence. Research 50 companies in parallel and pipe results into a structured spreadsheet. Move between five platforms without losing context or asking for hand-holding every two minutes.

We designed benchmarks modeled on workflows our users run daily. Then we ran them on Strawberry, Comet, and Atlas.

Average score by browser: Strawberry 99, Comet 90, Atlas 73. Mean across all 12 benchmarks.

Strawberry scored 99.2 out of 100. It ran all 12 benchmarks on its own in 43 minutes. Comet scored 90.8. Solid data quality, but it stopped mid-task for things the spec already covered, like formatting questions and filter clarifications. Atlas scored 73.3 and kept picking the wrong data even when we told it exactly what to look for.

We also scored ~78% on GAIA, the most cited public benchmark for AI agents—the highest score among AI browsers you can actually download today.

Below, we'll walk through our methodology, address why these findings matter, and show you how to validate them yourself.

What We Tested

We designed 12 benchmarks across three tiers, each inspired by workflows we see users run every day.

Tier 1: Research Quality

Can the browser find accurate information from the right sources under specific constraints?

YC companies matching a specific batch, stage, and geography—not just names, but funding history, founders, trajectory signals
Tech stack verification across 100 AI SaaS companies using primary sources (no guessing from job boards)
Testing 50 SaaS free tiers—navigate onboarding, document friction, compare pricing
Analyzing a VC firm's last 12 investments to extract their actual thesis
Nordic startup research using non-English sources and regional filings

Tier 2: Multi-Platform Integration

Tasks spanning multiple systems with separate logins, different interfaces, and no tolerance for dropped context.

Source 150+ leads from LinkedIn Sales Navigator, deduplicate against live CRM, add verified prospects to outreach campaign (one workflow, three platforms)
Talent sourcing: search LinkedIn Recruiter, cross-reference against ATS to flag existing candidates, score by fit, generate personalized outreach

Tier 3: Dynamic Content and Synthesis

The browser has to make judgment calls in real time, work with messy or changing content, not just pull what's there.

Analyze 50 trending GitHub repositories—extract metadata, identify patterns, surface emerging frameworks
Mine Product Hunt's top 50 launches—analyze upvotes, comment sentiment, pricing to build a reusable GTM checklist
Vet 100 fitness influencers across YouTube and Instagram—score on authenticity, audience quality, brand safety
Compare 100 products across multiple retailers—reconcile specs against customer reviews to find where marketing claims break down

How We Scored Them

Each benchmark was evaluated on a 100-point scale across five dimensions:

Data Accuracy (35 points) — Is the information factually correct? Did the browser follow constraints or drift?
Source Quality (25 points) — Did it cite official, primary sources? We required multiple verifiable citations per claim.
Completeness (25 points) — Are all required fields populated with real data? Vague entries like "varies" or "not available" are penalized unless the info genuinely doesn't exist.
Speed & Efficiency (5 points) — Execution time relative to quality. We deliberately weight quality over speed—a browser that takes longer to produce accurate results scores higher than one that rushes with gaps.
Insight Quality (10 points) — Thoughtful observations and appropriate confidence ratings. Does the output demonstrate understanding beyond data collection—identifying patterns, flagging limitations, explaining what matters?

Evaluation Approach

We used standardized LLM-based evaluation (via Google Notebook LM) to score each browser's output against a consistent rubric. The same evaluator and rubric were applied to all three browsers. This LLM-as-a-Judge method is increasingly common in AI evaluation (used by benchmarks like WebVoyager and Online-Mind2Web) because it removes the subjectivity of manual scoring by the team that built the product.

It's not perfect. An LLM evaluating another LLM has limitations. But combined with published specs and an open rubric, it means anyone can reproduce these tests. That's the real safeguard—not who scored it, but that you can verify it yourself.

One important caveat: LLM-based evaluation is non-deterministic. If you run the same benchmarks with the same rubric, you'll likely reach the same relative rankings—but exact point scores will vary between runs. We're confident in the gaps we observed, but treat the specific numbers as directional, not precise.

Addressing Bias: Why These Findings Matter

We designed this benchmark and Strawberry scored highest. Here's why the results still matter:

1. Full Transparency

We're publishing the complete benchmark specifications. Any team can run the same tests independently.

2. Standardized Evaluation

Standardized LLM-based evaluation scored all outputs using the same rubric—not our team manually grading our own work.

3. Validation Invitation

We're actively encouraging you to replicate these benchmarks on tasks that matter to your workflow, not just ours. If Strawberry falls short on your priorities, let us know. That's how we make the product better.

The benchmarks reflect real-world research tasks because they're the tasks we see users run daily and care about solving well. But the best test isn't our benchmark—it's whether Strawberry actually helps you get work done. More on that at the end.

The Results in Detail

Strawberry: 99.2/100

After a few setup questions, Strawberry ran every benchmark on its own in 43 minutes. It followed the filters we gave it, sourced from primary references, and when data wasn't available it said so rather than guessing.

Comet: 90.8/100

Data accuracy was close to Strawberry's, but it stopped mid-task for things the spec already covered. Formatting questions, clarifications on filters. It needed you in the loop.

Atlas: 73.3/100

Kept breaking constraints: pulled C++ repos instead of TypeScript, misidentified YC batches, ignored entity lists we explicitly provided. The data it did return was often fine. The problem was it didn't follow instructions.

Benchmark scores by browser: bar chart comparing Atlas, Comet, and Strawberry across 12 benchmarks (YC Deep Dive, Tech Stack, SaaS Pricing, VC Analysis, Nordic Startups, Sales Nav Hunt, Lead & CRM Dedup, Talent Sourcing, Github Intel, Product Hunt, Influencer Vetting, E-Com Search).

The gap between 99.2 and 90.8 might look small on paper. In practice, it's the difference between a tool you can trust to run while you do something else and a tool you need to watch.

Why Existing Benchmarks Miss What Matters

We also ran GAIA, the most cited academic benchmark for AI agents. It's 466 multi-step reasoning tasks across three difficulty levels, scored on exact-match accuracy.

Strawberry scored ~78% on GAIA—the highest score among AI browsers you can actually download and use today. For context, OpenAI scored 67.36%, and Manus scored over 65%. The current GAIA leaderboard top score is around 91%, held by research labs running custom setups with no shipping product.

We could probably push our score higher if we optimized for it. We chose not to. GAIA tests structured reasoning inside a single environment. It's 100% English, mostly Western-centric sources, and it bundles formatting errors with genuine failures. That's fine for what it measures—it's just not designed to measure what our users need.

What This Means If You're Evaluating AI Browsers

A leaderboard score is a decent starting point. Here's what we'd recommend looking for:

Can it complete a workflow across multiple platforms without constant hand-holding?
Does it follow specific constraints, or drift when tasks get complex?
Does it cite real sources, or summarize and hope you won't check?
Can it handle non-English content and region-specific data?

Benchmarks are benchmarks. The only test that counts is whether Strawberry actually helps you get your work done. Download it, try it for free, and let us know what you think.