Benchmarks

Benchmarks Specification

Full specifications for our 12-benchmark suite. See the results article for findings from our tests in February 2026.

Use these to independently replicate and evaluate AI browser performance. Each benchmark uses fewer entities and deeper research to make comparison straightforward. See the Browser Output Requirements section for the required output format.

Quick Start

Read the General Requirements section (5 min)
Execute benchmarks B1–B12 in order, following detailed instructions for each
Document metadata (start/end time, blockers) after each benchmark
Export your results as markdown following the Browser Output Requirements format
Repeat steps 2–4 for each browser you're comparing. Run the full suite once per browser before scoring.
To score your results, provide your exported output along with the Evaluation Framework section below to a third-party LLM (e.g., Google NotebookLM, Claude) with this prompt:

Score these benchmark results using the provided evaluation framework. For each of the 12 benchmarks, assign points across the 5 scoring criteria (Data Accuracy, Source Quality, Completeness, Speed & Efficiency, Insight Quality), totaling 100 points per benchmark. Return results as CSV format: Benchmark, Data_Accuracy, Source_Quality, Completeness, Speed, Insight_Quality, Total_Score.

Using a third-party LLM as evaluator removes subjective bias and keeps scoring consistent across runs.

Requirements for Full Benchmark Suite

Benchmarks 1–5: No special tools required
Benchmarks 6–8: LinkedIn Sales Navigator or Recruiter required
Benchmarks 7–8: CRM/ATS access required (Attio/Ashby used internally; substitute your own tools)
Benchmarks 9–12: No special tools required

If you don't have access to Sales Navigator or a CRM, skip B6–B8 and score out of 9 benchmarks instead of 12.

Benchmark Summary

Benchmark	Objective	Complexity
B1	YC Company Deep Dive	Medium
B2	Tech Stack Verification	Medium
B3	SaaS Pricing & UX Test	Medium
B4	VC Investment Analysis	Medium
B5	Nordic Startup Intelligence	Low
B6	Sales Navigator Lead Hunt	High
B7	Lead Sourcing & CRM Dedup	High
B8	Talent Sourcing	High
B9	GitHub Repository Intelligence Mining	High
B10	Product Hunt Launch Analysis	High
B11	Influencer Vetting	High
B12	E-Commerce Product Search	Medium

General Requirements

Run benchmarks B1 through B12 in order. After each one, record start time, end time, duration, and any blockers. See the Browser Output Requirements section for the full output format.

For each benchmark, report:

Did you complete without user interaction? (Yes/No)
If No, what required approval?
Self-Rated Autonomy: rate 1–10

B1: YC Company Deep Dive

Objective: Find and research 3 B2B SaaS companies from Y Combinator Winter 2021 (W21) batch.

Search Instructions:

Navigate to Y Combinator's company directory (ycombinator.com/companies)
Apply filters:
- Batch: W21 (Winter 2021)
- Industry/Category: B2B or SaaS (or related categories like "B2B Software", "SaaS", "Enterprise Software")
Select any 3 companies from the filtered results
Research each company comprehensively

Required Fields:

Field	Description	Format/Rules
Company Name	Official name	As listed on YC.com
Founder Names	All co-founders (first & last names)	"John Smith, Jane Doe" (comma-separated)
Founder Profiles	URL to their LinkedIn Profiles	Full URL with https://
Industry	YC batch category	From YC profile or similar
One-Liner	Official YC description	Direct quote from YC.com
Funding	Total raised to date	"$12.5M" (USD, use M/B notation)
Latest Round	Round type & date	"Series A, Jan 2024"
Website	Official company website	Full URL with https://
Market Signal	Recent hiring, funding, or product launch + "how are they doing"	Max 50 + 50 words
Glassdoor Review	Current rating if available	Score + number of reviews
Confidence	Data verification level	High/Medium/Low (see rules below)
Sources	URLs to verification sources	Min 2 URLs, comma-separated

Confidence Rating Rules:

High: Data verified on 2+ official sources (YC.com, company site, Crunchbase)
Medium: Data verified on 1 official source
Low: Data inferred or from unofficial sources

Edge Case Handling:

If founder names unavailable → mark "Not disclosed"
If funding undisclosed → mark "Undisclosed" (not "Unknown")
If One-liner differs between sources → use YC.com version
If fewer than 3 B2B SaaS companies in W21 → expand to adjacent categories (Enterprise, Developer Tools)

B2: Tech Stack Verification

Objective: Verify the primary tech stack for 2 major software companies using official sources.

Companies to Research: Notion, Linear

Required Fields:

Field	Description	Format/Rules
Company	Official name	As branded
Languages	3–5 main programming languages	"TypeScript, Python, Go" (comma-separated)
Cloud	Primary cloud infrastructure	AWS/GCP/Azure/Vercel/Other
Database	Primary database(s) used	"PostgreSQL, Redis" (comma-separated)
Frontend	Main UI framework(s)	"React, Vue, Svelte" etc.
AI/ML	Any AI tools built OR integrated	"OpenAI API, custom models" or "None"
Dev Tools	Key developer infrastructure	"GitHub, Figma, Vercel" (max 3–4)
Confidence	Verification level	High/Medium/Low
Sources	URLs to blog posts, GitHub, StackShare	Min 2 URLs

Confidence Rating Rules:

High: Sourced from company engineering blog or official GitHub
Medium: Sourced from StackShare, hiring posts, or job listings
Low: Inferred from third-party articles

Edge Case Handling:

If info contradicts between sources → list both versions in Notes section
If specific tech not found → mark "Not disclosed"
Prefer official engineering blogs over StackShare

B3: SaaS Pricing & UX Test

Objective: Research pricing AND test free tier UX for 2 project management tools.

Products to Research: Linear, ClickUp

Required Fields:

Field	Description	Format/Rules
Product	Official product name	As branded
Starter Price	Lowest PAID tier price/month	"$10/mo" (USD, per month)
Free Tier	Does free tier exist?	"Yes" or "No"
Free Limit	Max users/projects in free tier	"3 users, 1 project" or "N/A"
iOS Rating	Current App Store rating	"4.7/5.0" (check today's date)
Founded	Year company was founded	"2021" (year only)
Latest Funding	Most recent round & amount	"Series B, $30M, Nov 2023"
UX Note	Observation from testing free tier	Max 50 words
Confidence	Data verification level	High/Medium/Low
Sources	URLs to pricing, app store, Crunchbase	Min 2 URLs

Confidence Rating Rules:

High: All data verified on official sites + tested free tier yourself
Medium: Official data but didn't test UX (mark "UX not tested")
Low: Data from third-party sources only

Edge Case Handling:

If no free tier → mark "No" for Free Tier, "N/A" for Free Limit
If iOS app doesn't exist → mark "No iOS app"
If unable to test free tier → mark "UX not tested" in UX Note

B4: VC Investment Analysis

Objective: Research 2 recent VC investments with context on why the investor chose that company.

Research Target: Sequoia Capital investments in 2024 (January–December)

Required Fields:

Field	Description	Format/Rules
Company	Portfolio company name	Official name
Date	Investment month & year	"Jan 2024" (MMM YYYY)
Sector	Industry/sector	"Design Tools", "Fintech" etc.
Stage	Seed/Series A/B/C/etc	Exact stage
Amount	Investment amount or "Undisclosed"	"$100M" (USD, M/B notation) or "Undisclosed"
Location	HQ city, country	"San Francisco, USA"
Thesis	Why Sequoia invested (if stated)	Max 100 words, paraphrase OK
Confidence	Verification level	High/Medium/Low
Sources	URLs to announcements, Crunchbase	Min 2 URLs

Confidence Rating Rules:

High: Sourced from official Sequoia announcement or company press release
Medium: Sourced from Crunchbase or reputable news (TechCrunch)
Low: Inferred from secondary sources

Edge Case Handling:

If amount not disclosed → mark "Undisclosed" (not "Unknown")
If thesis not stated → write "Thesis not disclosed"
All dates must fall within Jan 1 – Dec 31, 2024

B5: Nordic Startup Intelligence

Objective: Research 2 Nordic startups with focus on recent activity and hiring trends.

Companies to Research: Einride, Legora

Required Fields:

Field	Description	Format/Rules
Company	Official name	As branded
Total Funding	All rounds combined	"$85M" (convert to single currency: USD)
Latest Round	Most recent funding amount	"$25M"
Round Date	Month & year	"Oct 2023" (MMM YYYY)
Valuation	If publicly disclosed	"$400M (Series B)" or "Private"
Employees	Approximate headcount	"150–200" (range OK)
HQ	City, country	"Stockholm, Sweden"
Recent News	Latest hire, product, or expansion	Max 50 words
Confidence	Verification level	High/Medium/Low
Sources	Crunchbase, company announcements	Min 2 URLs

Confidence Rating Rules:

High: All data from Crunchbase + company announcement
Medium: Crunchbase only, no recent confirmation
Low: Data inferred or outdated (>12 months old)

Edge Case Handling:

If mix of USD/EUR → convert all to USD at current rates
If employee count varies → use most recent number or range
If valuation never disclosed → mark "Private" (not "Unknown")

B6: Sales Navigator Lead Hunt

Objective: Find 3 B2B SaaS accounts and 2 qualified decision-maker leads per company on LinkedIn Sales Navigator.

Account Filters:

Industry: B2B SaaS (or Software Development)
Company Size: 50–500 employees
Location: United States
Funding: Series A or Series B
Timeframe: Funded in last 2 years (2024–2026)

Lead Filters (per account):

Job Titles: VP Sales, Director Sales, VP Revenue Ops, Director Revenue Ops, Head of Sales, CRO
Seniority: VP level or above

Required Fields:

Field	Description	Format/Rules
Account	Company name	Official name from Sales Nav
Size	Employee count	Specific number if shown
Location	HQ city/state	"San Francisco, CA"
Account URL	Company profile URL	Full LinkedIn URL
Lead Name	Individual's name	"Nate Robbins"
Title	Current job title	Exact title from profile
Lead URL	Profile URL	Full LinkedIn URL
Email	Was email shown on card?	Email address or "Not visible"
Confidence	Verification of seniority	High/Medium
Notes	Any verification details	Max 20 words

Confidence Rating Rules:

High: Title matches filter exactly (VP+), confirmed on profile
Medium: Title close to filter but not exact match

Edge Case Handling:

If only 1 VP-level candidate found → extract that 1, note in "Notes"
If email not visible → mark "Not visible" (not "N/A")
If account has <50 employees (error) → skip and find replacement

B7: Lead Sourcing & CRM Dedup with Outreach

Objective: Source B2B SaaS leads from LinkedIn Sales Navigator, deduplicate against CRM, and add to outreach campaign.

Scope: 1 account, 2 leads per account. Target companies under 500 employees in software.

Detailed Process:

Find leads in LinkedIn Sales Navigator
Verify in CRM (Attio) that they don't already exist
Present a list to the user for approval
Add to a Lemlist campaign prompted by the user or create a new campaign and add them
Do not add to Attio unless the user specifically asks

Required Fields: Lead Name, Title, Company, LinkedIn URL, Email (if exists), CRM Match (Yes/No), CRM ID, Campaign Status, Confidence, Notes

B8: Talent Sourcing

Objective: Source qualified talent from LinkedIn, LinkedIn Recruiter, or online sources matching specific criteria, verify they don't exist in Ashby (ATS dedup), and prepare outreach with personalized messaging.

Default Talent Profile (unless user specifies otherwise):

Senior Product Manager (role/seniority)
German language proficiency
PLG or SMB experience (product-led growth or small-to-medium business)

Required Fields: Name, Title, Current Company, Location, LinkedIn URL, Email (if found), Relevant Skills, German Proficiency (Yes/No), PLG/SMB Experience (Yes/No), Ashby Match (Yes/No), Ashby Record ID, Fit Score (1–10), Confidence, Notes

Workflow Steps:

Search for talent on LinkedIn/LinkedIn Recruiter matching the criteria
Find email and extract relevant information
Check Ashby (ATS) to verify they don't already exist (cold prospects only)
Compile a short list with fit scores for user approval
Generate personalized outreach message (LinkedIn or email)
Update Ashby with talent info and attach CV/resume PDF if available

Fit Scoring (1–10):

9–10: Perfect match (all criteria + strong PLG/SMB track record + German fluency)
7–8: Strong match (meets most criteria, clear relevant experience)
5–6: Partial match (meets core criteria but missing 1–2 elements)
1–4: Low match (meets title/role but weak on experience/language)

B9: GitHub Repository Intelligence Mining

Objective: Navigate to GitHub Trending page, extract top 5 trending repositories in "TypeScript" category (today), then drill into each repo to extract detailed metadata.

Required Fields: Repo Name, Owner, Stars, Forks, Open Issues, Last Commit (MMM DD, YYYY), Top 3 Contributors, Primary Language, Topics, Key Features (max 100 words), License, Confidence, Sources

Confidence Rules:

High: All data from official repo pages
Medium: Some data inferred
Low: Data incomplete/unavailable

Edge Cases:

Fewer than 5 TypeScript repos trending → extract what's available
Data not found → mark "Not disclosed"

Estimated Time: 10–12 min

B10: Product Hunt Launch Analysis

Objective: Navigate to Product Hunt homepage, find today's top 5 products, then drill into each to extract upvotes, comments count, maker names, pricing model, tech stack mentions, first comment timestamp, and top 3 comment themes.

Required Fields: Product Name, Tagline, Upvotes, Comments, Makers, Website, Pricing Model, Launch Time, Top Comment Themes (max 50 words each), Tech Stack Mentioned, Confidence, Sources

Confidence Rules:

High: All data visible on product page, comments reviewed
Medium: Some fields inferred
Low: Data incomplete/comments inaccessible

Edge Cases:

Comments require login → mark "Limited access"
Tech stack not mentioned → mark "Not disclosed"

B11: Global Sports Watch Influencer Campaign Vetting

Scenario: Launching advanced sports watch globally (Western English-speaking markets). Search for 5 fitness/tech influencers on YouTube & Instagram (200K–2M followers), then deep-vet for campaign fit.

Phase 1 (5–7 min): Find 5 influencers in fitness/sports tech niche across YouTube & Instagram. Filter by follower count, location signals, technical credibility. Exclude pure fashion/lifestyle.

Phase 2 (8–12 min): For each creator extract: Creator Name, Platform, Followers, Engagement Rate, Content Focus, Tech Credibility (1–10), Audience Geography, Recent Tech Content, Authenticity Score (1–10), Brand Safety, Campaign Fit Score (1–10), Recommendation (Tier 1/Tier 2/Not Suitable), Sources

Confidence Rules:

High: Official profiles, authentic engagement, geography confirmed, tech credibility evident
Medium: Most data from profile, some assumptions
Low: Incomplete data, limited access

Edge Cases:

Demographics not public → estimate from comments, mark "Estimated"
Private account → mark "Limited access"
Bot followers suspected → flag in Brand Safety
Competitor reviews (Garmin, Whoop, Oura) → note tech verification AND potential exclusivity conflict

B12: E-Commerce Product Search

Objective: Find 3 camera tripods across Amazon, B&H Photo, and Adorama. Extract detailed specifications and pricing, then synthesize into a comparison matrix with insights.

Scenario: Researching camera tripods for a professional video production setup. Compare options across 3 major retailers to find the best value proposition. Budget: $200–400.

Phase 1 (4–6 min): Search for "professional camera tripod" or "video tripod" on Amazon.com, bhphotovideo.com, and adorama.com. Find 1 tripod per site matching: price range $200–400, suitable for professional video, in stock with real-time pricing visible, has customer reviews available.

Phase 2 (8–12 min): For each tripod extract: Product Name, Retailer, Current Price, Original/List Price, Max Height, Weight Capacity, Weight (tripod), Material, Legs, Head Type, Payload, Folded Length, Carrying Case, Customer Rating, Review Count, Top Pros (2–3 most common, max 100 words), Top Cons (2–3 most common, max 100 words), Warranty, Confidence, Sources (min 2 URLs per product)

Phase 3 (2–3 min): Create comparison summary answering:

Which offers best value? (price vs. payload capacity)
Which is most portable? (folded length + weight)
Which has best reviews? (rating × review count consideration)
Notable trade-offs between options

Estimated Time: 12–15 min

Browser Output Requirements

Start your output with this header:

# [Browser Name] Benchmark Run [YYYY-MM-DD HH:MM]

The header should include:

Your browser name (e.g., "Strawberry", "Atlas", "Comet")
The text "Benchmark Run"
Date and time in format: YYYY-MM-DD HH:MM

Complete Output Structure:

# [Your Browser Name] Benchmark Run [YYYY-MM-DD HH:MM]

## Metadata
- **Browser:** [Your Name]
- **Start Time:** YYYY-MM-DD HH:MM:SS
- **End Time:** YYYY-MM-DD HH:MM:SS
- **Total Duration:** XX.X minutes

---

## B1: YC Company Deep Dive
**Status:** Complete | **Entities Found:** 3/3 | **Duration:** X.X min

[Insert table here]

**Notes:** [Your notes]

---

Before submitting, verify:

Header includes browser name
All sections have horizontal rule separators (---)
Tables are complete (no truncated fields)
Status lines follow format: Status: X | Entities Found: X/X | Duration: X.X min

Evaluation Framework

How to Compare Browser Performance

This framework is designed for independent LLM evaluation. You are evaluating 2–3 browsers by comparing their execution of these 12 benchmarks.

Evaluation Criteria (100 points per benchmark)

Data Accuracy: 35 pts

Hallucinations/invented data: -20 pts each
Factual errors: -5 to -10 pts each
Must be verifiable via provided sources

Source Quality: 25 pts

Official sources preferred over third-party
Minimum 2 sources per fact
Poor sourcing: -5 to -15 pts

Completeness: 25 pts

All required fields must be filled
Missing/vague fields: -5 to -10 pts each
Incomplete or superficial work: -10 to -15 pts

Insight Quality: 10 pts

Thoughtful observations vs generic
Appropriate confidence ratings
Limitations clearly noted

Speed & Efficiency: 5 pts

5 pts: Completed within estimated time
3 pts: 30–50% over but high quality
1 pt: 50%+ over but excellent quality justifies
0 pts: Incomplete or poor quality to rush

Quality Penalties (Applied Regardless of Time)

Hallucinated/invented data: -20 pts per instance
Unverifiable facts without sources: -10 to -15 pts per instance
Required field missing/left blank: -5 to -10 pts each
Vague/incomplete data when discoverable: -5 to -10 pts per instance
Formatting errors or sloppy execution: -3 to -5 pts per issue
Inconsistent/over-confident ratings: -5 to -8 pts
Incomplete/partial benchmark execution: -10 to -20 pts

Total Score = 100 pts max per benchmark

Quality focus (95 pts): Data (35) + Source (25) + Completeness (25) + Insight (10) Speed (5 pts): Minimal weight Penalties: Applied independently of base scoring

A browser completing in 2 minutes with incomplete/vague data scores lower than one taking 25 minutes with thorough, verified data. Time never excuses quality shortcuts.

Duration should be tracked and reported separately for each benchmark (B1, B2, B3, etc.), not as a single total run time. Completion is what matters — a browser that completes all 12 benchmarks in 60 minutes is superior to one that completes only 3 in 15 minutes.

Overall Score Calculation

Per Benchmark: Sum of all criteria = 100 points max

Overall Browser Score: Average of all benchmarks = 100 points max

Total Score = (B1 + B2 + B3 + ... + B12) / 12

Note on non-determinism: LLM-based evaluation is inherently non-deterministic. Scores from independent runs may vary slightly. Rubric and weights are fixed; individual judge scores are approximate.

Quality Checklist

Before submitting results, verify:

All source URLs are valid and working
Dates are in consistent format (MMM YYYY or YYYY-MM-DD)
Currency is noted (USD) for all funding amounts
Confidence ratings are consistent (High/Med/Low with clear reasoning)
No hallucinated data (all verifiable via sources)
Metadata includes start/end times
Markdown table formatting is clean (no broken links)
Notes section explains any gaps or limitations