Beta
Benchmarks

Benchmarks Specification

Full specifications for our 12-benchmark suite. See the results article for findings from our tests in February 2026.

Use these to independently replicate and evaluate AI browser performance. Each benchmark uses fewer entities and deeper research to make comparison straightforward. See the Browser Output Requirements section for the required output format.

Quick Start

  1. Read the General Requirements section (5 min)
  2. Execute benchmarks B1–B12 in order, following detailed instructions for each
  3. Document metadata (start/end time, blockers) after each benchmark
  4. Export your results as markdown following the Browser Output Requirements format
  5. Repeat steps 2–4 for each browser you're comparing. Run the full suite once per browser before scoring.
  6. To score your results, provide your exported output along with the Evaluation Framework section below to a third-party LLM (e.g., Google NotebookLM, Claude) with this prompt:
Score these benchmark results using the provided evaluation framework. For each of the 12 benchmarks, assign points across the 5 scoring criteria (Data Accuracy, Source Quality, Completeness, Speed & Efficiency, Insight Quality), totaling 100 points per benchmark. Return results as CSV format: Benchmark, Data_Accuracy, Source_Quality, Completeness, Speed, Insight_Quality, Total_Score.

Using a third-party LLM as evaluator removes subjective bias and keeps scoring consistent across runs.

Requirements for Full Benchmark Suite

  • Benchmarks 1–5: No special tools required
  • Benchmarks 6–8: LinkedIn Sales Navigator or Recruiter required
  • Benchmarks 7–8: CRM/ATS access required (Attio/Ashby used internally; substitute your own tools)
  • Benchmarks 9–12: No special tools required

If you don't have access to Sales Navigator or a CRM, skip B6–B8 and score out of 9 benchmarks instead of 12.


Benchmark Summary

BenchmarkObjectiveComplexity
B1YC Company Deep DiveMedium
B2Tech Stack VerificationMedium
B3SaaS Pricing & UX TestMedium
B4VC Investment AnalysisMedium
B5Nordic Startup IntelligenceLow
B6Sales Navigator Lead HuntHigh
B7Lead Sourcing & CRM DedupHigh
B8Talent SourcingHigh
B9GitHub Repository Intelligence MiningHigh
B10Product Hunt Launch AnalysisHigh
B11Influencer VettingHigh
B12E-Commerce Product SearchMedium

General Requirements

Run benchmarks B1 through B12 in order. After each one, record start time, end time, duration, and any blockers. See the Browser Output Requirements section for the full output format.

For each benchmark, report:

  1. Did you complete without user interaction? (Yes/No)
  2. If No, what required approval?
  3. Self-Rated Autonomy: rate 1–10

B1: YC Company Deep Dive

Objective: Find and research 3 B2B SaaS companies from Y Combinator Winter 2021 (W21) batch.

Search Instructions:

  1. Navigate to Y Combinator's company directory (ycombinator.com/companies)
  2. Apply filters:
    • Batch: W21 (Winter 2021)
    • Industry/Category: B2B or SaaS (or related categories like "B2B Software", "SaaS", "Enterprise Software")
  3. Select any 3 companies from the filtered results
  4. Research each company comprehensively

Required Fields:

FieldDescriptionFormat/Rules
Company NameOfficial nameAs listed on YC.com
Founder NamesAll co-founders (first & last names)"John Smith, Jane Doe" (comma-separated)
Founder ProfilesURL to their LinkedIn ProfilesFull URL with https://
IndustryYC batch categoryFrom YC profile or similar
One-LinerOfficial YC descriptionDirect quote from YC.com
FundingTotal raised to date"$12.5M" (USD, use M/B notation)
Latest RoundRound type & date"Series A, Jan 2024"
WebsiteOfficial company websiteFull URL with https://
Market SignalRecent hiring, funding, or product launch + "how are they doing"Max 50 + 50 words
Glassdoor ReviewCurrent rating if availableScore + number of reviews
ConfidenceData verification levelHigh/Medium/Low (see rules below)
SourcesURLs to verification sourcesMin 2 URLs, comma-separated

Confidence Rating Rules:

  • High: Data verified on 2+ official sources (YC.com, company site, Crunchbase)
  • Medium: Data verified on 1 official source
  • Low: Data inferred or from unofficial sources

Edge Case Handling:

  • If founder names unavailable → mark "Not disclosed"
  • If funding undisclosed → mark "Undisclosed" (not "Unknown")
  • If One-liner differs between sources → use YC.com version
  • If fewer than 3 B2B SaaS companies in W21 → expand to adjacent categories (Enterprise, Developer Tools)

B2: Tech Stack Verification

Objective: Verify the primary tech stack for 2 major software companies using official sources.

Companies to Research: Notion, Linear

Required Fields:

FieldDescriptionFormat/Rules
CompanyOfficial nameAs branded
Languages3–5 main programming languages"TypeScript, Python, Go" (comma-separated)
CloudPrimary cloud infrastructureAWS/GCP/Azure/Vercel/Other
DatabasePrimary database(s) used"PostgreSQL, Redis" (comma-separated)
FrontendMain UI framework(s)"React, Vue, Svelte" etc.
AI/MLAny AI tools built OR integrated"OpenAI API, custom models" or "None"
Dev ToolsKey developer infrastructure"GitHub, Figma, Vercel" (max 3–4)
ConfidenceVerification levelHigh/Medium/Low
SourcesURLs to blog posts, GitHub, StackShareMin 2 URLs

Confidence Rating Rules:

  • High: Sourced from company engineering blog or official GitHub
  • Medium: Sourced from StackShare, hiring posts, or job listings
  • Low: Inferred from third-party articles

Edge Case Handling:

  • If info contradicts between sources → list both versions in Notes section
  • If specific tech not found → mark "Not disclosed"
  • Prefer official engineering blogs over StackShare

B3: SaaS Pricing & UX Test

Objective: Research pricing AND test free tier UX for 2 project management tools.

Products to Research: Linear, ClickUp

Required Fields:

FieldDescriptionFormat/Rules
ProductOfficial product nameAs branded
Starter PriceLowest PAID tier price/month"$10/mo" (USD, per month)
Free TierDoes free tier exist?"Yes" or "No"
Free LimitMax users/projects in free tier"3 users, 1 project" or "N/A"
iOS RatingCurrent App Store rating"4.7/5.0" (check today's date)
FoundedYear company was founded"2021" (year only)
Latest FundingMost recent round & amount"Series B, $30M, Nov 2023"
UX NoteObservation from testing free tierMax 50 words
ConfidenceData verification levelHigh/Medium/Low
SourcesURLs to pricing, app store, CrunchbaseMin 2 URLs

Confidence Rating Rules:

  • High: All data verified on official sites + tested free tier yourself
  • Medium: Official data but didn't test UX (mark "UX not tested")
  • Low: Data from third-party sources only

Edge Case Handling:

  • If no free tier → mark "No" for Free Tier, "N/A" for Free Limit
  • If iOS app doesn't exist → mark "No iOS app"
  • If unable to test free tier → mark "UX not tested" in UX Note

B4: VC Investment Analysis

Objective: Research 2 recent VC investments with context on why the investor chose that company.

Research Target: Sequoia Capital investments in 2024 (January–December)

Required Fields:

FieldDescriptionFormat/Rules
CompanyPortfolio company nameOfficial name
DateInvestment month & year"Jan 2024" (MMM YYYY)
SectorIndustry/sector"Design Tools", "Fintech" etc.
StageSeed/Series A/B/C/etcExact stage
AmountInvestment amount or "Undisclosed""$100M" (USD, M/B notation) or "Undisclosed"
LocationHQ city, country"San Francisco, USA"
ThesisWhy Sequoia invested (if stated)Max 100 words, paraphrase OK
ConfidenceVerification levelHigh/Medium/Low
SourcesURLs to announcements, CrunchbaseMin 2 URLs

Confidence Rating Rules:

  • High: Sourced from official Sequoia announcement or company press release
  • Medium: Sourced from Crunchbase or reputable news (TechCrunch)
  • Low: Inferred from secondary sources

Edge Case Handling:

  • If amount not disclosed → mark "Undisclosed" (not "Unknown")
  • If thesis not stated → write "Thesis not disclosed"
  • All dates must fall within Jan 1 – Dec 31, 2024

B5: Nordic Startup Intelligence

Objective: Research 2 Nordic startups with focus on recent activity and hiring trends.

Companies to Research: Einride, Legora

Required Fields:

FieldDescriptionFormat/Rules
CompanyOfficial nameAs branded
Total FundingAll rounds combined"$85M" (convert to single currency: USD)
Latest RoundMost recent funding amount"$25M"
Round DateMonth & year"Oct 2023" (MMM YYYY)
ValuationIf publicly disclosed"$400M (Series B)" or "Private"
EmployeesApproximate headcount"150–200" (range OK)
HQCity, country"Stockholm, Sweden"
Recent NewsLatest hire, product, or expansionMax 50 words
ConfidenceVerification levelHigh/Medium/Low
SourcesCrunchbase, company announcementsMin 2 URLs

Confidence Rating Rules:

  • High: All data from Crunchbase + company announcement
  • Medium: Crunchbase only, no recent confirmation
  • Low: Data inferred or outdated (>12 months old)

Edge Case Handling:

  • If mix of USD/EUR → convert all to USD at current rates
  • If employee count varies → use most recent number or range
  • If valuation never disclosed → mark "Private" (not "Unknown")

B6: Sales Navigator Lead Hunt

Objective: Find 3 B2B SaaS accounts and 2 qualified decision-maker leads per company on LinkedIn Sales Navigator.

Account Filters:

  • Industry: B2B SaaS (or Software Development)
  • Company Size: 50–500 employees
  • Location: United States
  • Funding: Series A or Series B
  • Timeframe: Funded in last 2 years (2024–2026)

Lead Filters (per account):

  • Job Titles: VP Sales, Director Sales, VP Revenue Ops, Director Revenue Ops, Head of Sales, CRO
  • Seniority: VP level or above

Required Fields:

FieldDescriptionFormat/Rules
AccountCompany nameOfficial name from Sales Nav
SizeEmployee countSpecific number if shown
LocationHQ city/state"San Francisco, CA"
Account URLCompany profile URLFull LinkedIn URL
Lead NameIndividual's name"Nate Robbins"
TitleCurrent job titleExact title from profile
Lead URLProfile URLFull LinkedIn URL
EmailWas email shown on card?Email address or "Not visible"
ConfidenceVerification of seniorityHigh/Medium
NotesAny verification detailsMax 20 words

Confidence Rating Rules:

  • High: Title matches filter exactly (VP+), confirmed on profile
  • Medium: Title close to filter but not exact match

Edge Case Handling:

  • If only 1 VP-level candidate found → extract that 1, note in "Notes"
  • If email not visible → mark "Not visible" (not "N/A")
  • If account has <50 employees (error) → skip and find replacement

B7: Lead Sourcing & CRM Dedup with Outreach

Objective: Source B2B SaaS leads from LinkedIn Sales Navigator, deduplicate against CRM, and add to outreach campaign.

Scope: 1 account, 2 leads per account. Target companies under 500 employees in software.

Detailed Process:

  1. Find leads in LinkedIn Sales Navigator
  2. Verify in CRM (Attio) that they don't already exist
  3. Present a list to the user for approval
  4. Add to a Lemlist campaign prompted by the user or create a new campaign and add them
  5. Do not add to Attio unless the user specifically asks

Required Fields: Lead Name, Title, Company, LinkedIn URL, Email (if exists), CRM Match (Yes/No), CRM ID, Campaign Status, Confidence, Notes


B8: Talent Sourcing

Objective: Source qualified talent from LinkedIn, LinkedIn Recruiter, or online sources matching specific criteria, verify they don't exist in Ashby (ATS dedup), and prepare outreach with personalized messaging.

Default Talent Profile (unless user specifies otherwise):

  1. Senior Product Manager (role/seniority)
  2. German language proficiency
  3. PLG or SMB experience (product-led growth or small-to-medium business)

Required Fields: Name, Title, Current Company, Location, LinkedIn URL, Email (if found), Relevant Skills, German Proficiency (Yes/No), PLG/SMB Experience (Yes/No), Ashby Match (Yes/No), Ashby Record ID, Fit Score (1–10), Confidence, Notes

Workflow Steps:

  1. Search for talent on LinkedIn/LinkedIn Recruiter matching the criteria
  2. Find email and extract relevant information
  3. Check Ashby (ATS) to verify they don't already exist (cold prospects only)
  4. Compile a short list with fit scores for user approval
  5. Generate personalized outreach message (LinkedIn or email)
  6. Update Ashby with talent info and attach CV/resume PDF if available

Fit Scoring (1–10):

  • 9–10: Perfect match (all criteria + strong PLG/SMB track record + German fluency)
  • 7–8: Strong match (meets most criteria, clear relevant experience)
  • 5–6: Partial match (meets core criteria but missing 1–2 elements)
  • 1–4: Low match (meets title/role but weak on experience/language)

B9: GitHub Repository Intelligence Mining

Objective: Navigate to GitHub Trending page, extract top 5 trending repositories in "TypeScript" category (today), then drill into each repo to extract detailed metadata.

Required Fields: Repo Name, Owner, Stars, Forks, Open Issues, Last Commit (MMM DD, YYYY), Top 3 Contributors, Primary Language, Topics, Key Features (max 100 words), License, Confidence, Sources

Confidence Rules:

  • High: All data from official repo pages
  • Medium: Some data inferred
  • Low: Data incomplete/unavailable

Edge Cases:

  • Fewer than 5 TypeScript repos trending → extract what's available
  • Data not found → mark "Not disclosed"

Estimated Time: 10–12 min


B10: Product Hunt Launch Analysis

Objective: Navigate to Product Hunt homepage, find today's top 5 products, then drill into each to extract upvotes, comments count, maker names, pricing model, tech stack mentions, first comment timestamp, and top 3 comment themes.

Required Fields: Product Name, Tagline, Upvotes, Comments, Makers, Website, Pricing Model, Launch Time, Top Comment Themes (max 50 words each), Tech Stack Mentioned, Confidence, Sources

Confidence Rules:

  • High: All data visible on product page, comments reviewed
  • Medium: Some fields inferred
  • Low: Data incomplete/comments inaccessible

Edge Cases:

  • Comments require login → mark "Limited access"
  • Tech stack not mentioned → mark "Not disclosed"

B11: Global Sports Watch Influencer Campaign Vetting

Scenario: Launching advanced sports watch globally (Western English-speaking markets). Search for 5 fitness/tech influencers on YouTube & Instagram (200K–2M followers), then deep-vet for campaign fit.

Phase 1 (5–7 min): Find 5 influencers in fitness/sports tech niche across YouTube & Instagram. Filter by follower count, location signals, technical credibility. Exclude pure fashion/lifestyle.

Phase 2 (8–12 min): For each creator extract: Creator Name, Platform, Followers, Engagement Rate, Content Focus, Tech Credibility (1–10), Audience Geography, Recent Tech Content, Authenticity Score (1–10), Brand Safety, Campaign Fit Score (1–10), Recommendation (Tier 1/Tier 2/Not Suitable), Sources

Confidence Rules:

  • High: Official profiles, authentic engagement, geography confirmed, tech credibility evident
  • Medium: Most data from profile, some assumptions
  • Low: Incomplete data, limited access

Edge Cases:

  • Demographics not public → estimate from comments, mark "Estimated"
  • Private account → mark "Limited access"
  • Bot followers suspected → flag in Brand Safety
  • Competitor reviews (Garmin, Whoop, Oura) → note tech verification AND potential exclusivity conflict

B12: E-Commerce Product Search

Objective: Find 3 camera tripods across Amazon, B&H Photo, and Adorama. Extract detailed specifications and pricing, then synthesize into a comparison matrix with insights.

Scenario: Researching camera tripods for a professional video production setup. Compare options across 3 major retailers to find the best value proposition. Budget: $200–400.

Phase 1 (4–6 min): Search for "professional camera tripod" or "video tripod" on Amazon.com, bhphotovideo.com, and adorama.com. Find 1 tripod per site matching: price range $200–400, suitable for professional video, in stock with real-time pricing visible, has customer reviews available.

Phase 2 (8–12 min): For each tripod extract: Product Name, Retailer, Current Price, Original/List Price, Max Height, Weight Capacity, Weight (tripod), Material, Legs, Head Type, Payload, Folded Length, Carrying Case, Customer Rating, Review Count, Top Pros (2–3 most common, max 100 words), Top Cons (2–3 most common, max 100 words), Warranty, Confidence, Sources (min 2 URLs per product)

Phase 3 (2–3 min): Create comparison summary answering:

  • Which offers best value? (price vs. payload capacity)
  • Which is most portable? (folded length + weight)
  • Which has best reviews? (rating × review count consideration)
  • Notable trade-offs between options

Estimated Time: 12–15 min


Browser Output Requirements

Start your output with this header:

# [Browser Name] Benchmark Run [YYYY-MM-DD HH:MM]

The header should include:

  1. Your browser name (e.g., "Strawberry", "Atlas", "Comet")
  2. The text "Benchmark Run"
  3. Date and time in format: YYYY-MM-DD HH:MM

Complete Output Structure:

# [Your Browser Name] Benchmark Run [YYYY-MM-DD HH:MM]

## Metadata
- **Browser:** [Your Name]
- **Start Time:** YYYY-MM-DD HH:MM:SS
- **End Time:** YYYY-MM-DD HH:MM:SS
- **Total Duration:** XX.X minutes

---

## B1: YC Company Deep Dive
**Status:** Complete | **Entities Found:** 3/3 | **Duration:** X.X min

[Insert table here]

**Notes:** [Your notes]

---

Before submitting, verify:

  • Header includes browser name
  • All sections have horizontal rule separators (---)
  • Tables are complete (no truncated fields)
  • Status lines follow format: Status: X | Entities Found: X/X | Duration: X.X min

Evaluation Framework

How to Compare Browser Performance

This framework is designed for independent LLM evaluation. You are evaluating 2–3 browsers by comparing their execution of these 12 benchmarks.

Evaluation Criteria (100 points per benchmark)

Data Accuracy: 35 pts

  • Hallucinations/invented data: -20 pts each
  • Factual errors: -5 to -10 pts each
  • Must be verifiable via provided sources

Source Quality: 25 pts

  • Official sources preferred over third-party
  • Minimum 2 sources per fact
  • Poor sourcing: -5 to -15 pts

Completeness: 25 pts

  • All required fields must be filled
  • Missing/vague fields: -5 to -10 pts each
  • Incomplete or superficial work: -10 to -15 pts

Insight Quality: 10 pts

  • Thoughtful observations vs generic
  • Appropriate confidence ratings
  • Limitations clearly noted

Speed & Efficiency: 5 pts

  • 5 pts: Completed within estimated time
  • 3 pts: 30–50% over but high quality
  • 1 pt: 50%+ over but excellent quality justifies
  • 0 pts: Incomplete or poor quality to rush

Quality Penalties (Applied Regardless of Time)

  • Hallucinated/invented data: -20 pts per instance
  • Unverifiable facts without sources: -10 to -15 pts per instance
  • Required field missing/left blank: -5 to -10 pts each
  • Vague/incomplete data when discoverable: -5 to -10 pts per instance
  • Formatting errors or sloppy execution: -3 to -5 pts per issue
  • Inconsistent/over-confident ratings: -5 to -8 pts
  • Incomplete/partial benchmark execution: -10 to -20 pts

Total Score = 100 pts max per benchmark

Quality focus (95 pts): Data (35) + Source (25) + Completeness (25) + Insight (10) Speed (5 pts): Minimal weight Penalties: Applied independently of base scoring

A browser completing in 2 minutes with incomplete/vague data scores lower than one taking 25 minutes with thorough, verified data. Time never excuses quality shortcuts.

Duration should be tracked and reported separately for each benchmark (B1, B2, B3, etc.), not as a single total run time. Completion is what matters — a browser that completes all 12 benchmarks in 60 minutes is superior to one that completes only 3 in 15 minutes.


Overall Score Calculation

Per Benchmark: Sum of all criteria = 100 points max

Overall Browser Score: Average of all benchmarks = 100 points max

Total Score = (B1 + B2 + B3 + ... + B12) / 12

Note on non-determinism: LLM-based evaluation is inherently non-deterministic. Scores from independent runs may vary slightly. Rubric and weights are fixed; individual judge scores are approximate.


Quality Checklist

Before submitting results, verify:

  • All source URLs are valid and working
  • Dates are in consistent format (MMM YYYY or YYYY-MM-DD)
  • Currency is noted (USD) for all funding amounts
  • Confidence ratings are consistent (High/Med/Low with clear reasoning)
  • No hallucinated data (all verifiable via sources)
  • Metadata includes start/end times
  • Markdown table formatting is clean (no broken links)
  • Notes section explains any gaps or limitations