Benchmarks Specification
Full specifications for our 12-benchmark suite. See the results article for findings from our tests in February 2026.
Use these to independently replicate and evaluate AI browser performance. Each benchmark uses fewer entities and deeper research to make comparison straightforward. See the Browser Output Requirements section for the required output format.
Quick Start
- Read the General Requirements section (5 min)
- Execute benchmarks B1–B12 in order, following detailed instructions for each
- Document metadata (start/end time, blockers) after each benchmark
- Export your results as markdown following the Browser Output Requirements format
- Repeat steps 2–4 for each browser you're comparing. Run the full suite once per browser before scoring.
- To score your results, provide your exported output along with the Evaluation Framework section below to a third-party LLM (e.g., Google NotebookLM, Claude) with this prompt:
Score these benchmark results using the provided evaluation framework. For each of the 12 benchmarks, assign points across the 5 scoring criteria (Data Accuracy, Source Quality, Completeness, Speed & Efficiency, Insight Quality), totaling 100 points per benchmark. Return results as CSV format: Benchmark, Data_Accuracy, Source_Quality, Completeness, Speed, Insight_Quality, Total_Score.
Using a third-party LLM as evaluator removes subjective bias and keeps scoring consistent across runs.
Requirements for Full Benchmark Suite
- Benchmarks 1–5: No special tools required
- Benchmarks 6–8: LinkedIn Sales Navigator or Recruiter required
- Benchmarks 7–8: CRM/ATS access required (Attio/Ashby used internally; substitute your own tools)
- Benchmarks 9–12: No special tools required
If you don't have access to Sales Navigator or a CRM, skip B6–B8 and score out of 9 benchmarks instead of 12.
Benchmark Summary
| Benchmark | Objective | Complexity |
|---|---|---|
| B1 | YC Company Deep Dive | Medium |
| B2 | Tech Stack Verification | Medium |
| B3 | SaaS Pricing & UX Test | Medium |
| B4 | VC Investment Analysis | Medium |
| B5 | Nordic Startup Intelligence | Low |
| B6 | Sales Navigator Lead Hunt | High |
| B7 | Lead Sourcing & CRM Dedup | High |
| B8 | Talent Sourcing | High |
| B9 | GitHub Repository Intelligence Mining | High |
| B10 | Product Hunt Launch Analysis | High |
| B11 | Influencer Vetting | High |
| B12 | E-Commerce Product Search | Medium |
General Requirements
Run benchmarks B1 through B12 in order. After each one, record start time, end time, duration, and any blockers. See the Browser Output Requirements section for the full output format.
For each benchmark, report:
- Did you complete without user interaction? (Yes/No)
- If No, what required approval?
- Self-Rated Autonomy: rate 1–10
B1: YC Company Deep Dive
Objective: Find and research 3 B2B SaaS companies from Y Combinator Winter 2021 (W21) batch.
Search Instructions:
- Navigate to Y Combinator's company directory (ycombinator.com/companies)
- Apply filters:
- Batch: W21 (Winter 2021)
- Industry/Category: B2B or SaaS (or related categories like "B2B Software", "SaaS", "Enterprise Software")
- Select any 3 companies from the filtered results
- Research each company comprehensively
Required Fields:
| Field | Description | Format/Rules |
|---|---|---|
| Company Name | Official name | As listed on YC.com |
| Founder Names | All co-founders (first & last names) | "John Smith, Jane Doe" (comma-separated) |
| Founder Profiles | URL to their LinkedIn Profiles | Full URL with https:// |
| Industry | YC batch category | From YC profile or similar |
| One-Liner | Official YC description | Direct quote from YC.com |
| Funding | Total raised to date | "$12.5M" (USD, use M/B notation) |
| Latest Round | Round type & date | "Series A, Jan 2024" |
| Website | Official company website | Full URL with https:// |
| Market Signal | Recent hiring, funding, or product launch + "how are they doing" | Max 50 + 50 words |
| Glassdoor Review | Current rating if available | Score + number of reviews |
| Confidence | Data verification level | High/Medium/Low (see rules below) |
| Sources | URLs to verification sources | Min 2 URLs, comma-separated |
Confidence Rating Rules:
- High: Data verified on 2+ official sources (YC.com, company site, Crunchbase)
- Medium: Data verified on 1 official source
- Low: Data inferred or from unofficial sources
Edge Case Handling:
- If founder names unavailable → mark "Not disclosed"
- If funding undisclosed → mark "Undisclosed" (not "Unknown")
- If One-liner differs between sources → use YC.com version
- If fewer than 3 B2B SaaS companies in W21 → expand to adjacent categories (Enterprise, Developer Tools)
B2: Tech Stack Verification
Objective: Verify the primary tech stack for 2 major software companies using official sources.
Companies to Research: Notion, Linear
Required Fields:
| Field | Description | Format/Rules |
|---|---|---|
| Company | Official name | As branded |
| Languages | 3–5 main programming languages | "TypeScript, Python, Go" (comma-separated) |
| Cloud | Primary cloud infrastructure | AWS/GCP/Azure/Vercel/Other |
| Database | Primary database(s) used | "PostgreSQL, Redis" (comma-separated) |
| Frontend | Main UI framework(s) | "React, Vue, Svelte" etc. |
| AI/ML | Any AI tools built OR integrated | "OpenAI API, custom models" or "None" |
| Dev Tools | Key developer infrastructure | "GitHub, Figma, Vercel" (max 3–4) |
| Confidence | Verification level | High/Medium/Low |
| Sources | URLs to blog posts, GitHub, StackShare | Min 2 URLs |
Confidence Rating Rules:
- High: Sourced from company engineering blog or official GitHub
- Medium: Sourced from StackShare, hiring posts, or job listings
- Low: Inferred from third-party articles
Edge Case Handling:
- If info contradicts between sources → list both versions in Notes section
- If specific tech not found → mark "Not disclosed"
- Prefer official engineering blogs over StackShare
B3: SaaS Pricing & UX Test
Objective: Research pricing AND test free tier UX for 2 project management tools.
Products to Research: Linear, ClickUp
Required Fields:
| Field | Description | Format/Rules |
|---|---|---|
| Product | Official product name | As branded |
| Starter Price | Lowest PAID tier price/month | "$10/mo" (USD, per month) |
| Free Tier | Does free tier exist? | "Yes" or "No" |
| Free Limit | Max users/projects in free tier | "3 users, 1 project" or "N/A" |
| iOS Rating | Current App Store rating | "4.7/5.0" (check today's date) |
| Founded | Year company was founded | "2021" (year only) |
| Latest Funding | Most recent round & amount | "Series B, $30M, Nov 2023" |
| UX Note | Observation from testing free tier | Max 50 words |
| Confidence | Data verification level | High/Medium/Low |
| Sources | URLs to pricing, app store, Crunchbase | Min 2 URLs |
Confidence Rating Rules:
- High: All data verified on official sites + tested free tier yourself
- Medium: Official data but didn't test UX (mark "UX not tested")
- Low: Data from third-party sources only
Edge Case Handling:
- If no free tier → mark "No" for Free Tier, "N/A" for Free Limit
- If iOS app doesn't exist → mark "No iOS app"
- If unable to test free tier → mark "UX not tested" in UX Note
B4: VC Investment Analysis
Objective: Research 2 recent VC investments with context on why the investor chose that company.
Research Target: Sequoia Capital investments in 2024 (January–December)
Required Fields:
| Field | Description | Format/Rules |
|---|---|---|
| Company | Portfolio company name | Official name |
| Date | Investment month & year | "Jan 2024" (MMM YYYY) |
| Sector | Industry/sector | "Design Tools", "Fintech" etc. |
| Stage | Seed/Series A/B/C/etc | Exact stage |
| Amount | Investment amount or "Undisclosed" | "$100M" (USD, M/B notation) or "Undisclosed" |
| Location | HQ city, country | "San Francisco, USA" |
| Thesis | Why Sequoia invested (if stated) | Max 100 words, paraphrase OK |
| Confidence | Verification level | High/Medium/Low |
| Sources | URLs to announcements, Crunchbase | Min 2 URLs |
Confidence Rating Rules:
- High: Sourced from official Sequoia announcement or company press release
- Medium: Sourced from Crunchbase or reputable news (TechCrunch)
- Low: Inferred from secondary sources
Edge Case Handling:
- If amount not disclosed → mark "Undisclosed" (not "Unknown")
- If thesis not stated → write "Thesis not disclosed"
- All dates must fall within Jan 1 – Dec 31, 2024
B5: Nordic Startup Intelligence
Objective: Research 2 Nordic startups with focus on recent activity and hiring trends.
Companies to Research: Einride, Legora
Required Fields:
| Field | Description | Format/Rules |
|---|---|---|
| Company | Official name | As branded |
| Total Funding | All rounds combined | "$85M" (convert to single currency: USD) |
| Latest Round | Most recent funding amount | "$25M" |
| Round Date | Month & year | "Oct 2023" (MMM YYYY) |
| Valuation | If publicly disclosed | "$400M (Series B)" or "Private" |
| Employees | Approximate headcount | "150–200" (range OK) |
| HQ | City, country | "Stockholm, Sweden" |
| Recent News | Latest hire, product, or expansion | Max 50 words |
| Confidence | Verification level | High/Medium/Low |
| Sources | Crunchbase, company announcements | Min 2 URLs |
Confidence Rating Rules:
- High: All data from Crunchbase + company announcement
- Medium: Crunchbase only, no recent confirmation
- Low: Data inferred or outdated (>12 months old)
Edge Case Handling:
- If mix of USD/EUR → convert all to USD at current rates
- If employee count varies → use most recent number or range
- If valuation never disclosed → mark "Private" (not "Unknown")
B6: Sales Navigator Lead Hunt
Objective: Find 3 B2B SaaS accounts and 2 qualified decision-maker leads per company on LinkedIn Sales Navigator.
Account Filters:
- Industry: B2B SaaS (or Software Development)
- Company Size: 50–500 employees
- Location: United States
- Funding: Series A or Series B
- Timeframe: Funded in last 2 years (2024–2026)
Lead Filters (per account):
- Job Titles: VP Sales, Director Sales, VP Revenue Ops, Director Revenue Ops, Head of Sales, CRO
- Seniority: VP level or above
Required Fields:
| Field | Description | Format/Rules |
|---|---|---|
| Account | Company name | Official name from Sales Nav |
| Size | Employee count | Specific number if shown |
| Location | HQ city/state | "San Francisco, CA" |
| Account URL | Company profile URL | Full LinkedIn URL |
| Lead Name | Individual's name | "Nate Robbins" |
| Title | Current job title | Exact title from profile |
| Lead URL | Profile URL | Full LinkedIn URL |
| Was email shown on card? | Email address or "Not visible" | |
| Confidence | Verification of seniority | High/Medium |
| Notes | Any verification details | Max 20 words |
Confidence Rating Rules:
- High: Title matches filter exactly (VP+), confirmed on profile
- Medium: Title close to filter but not exact match
Edge Case Handling:
- If only 1 VP-level candidate found → extract that 1, note in "Notes"
- If email not visible → mark "Not visible" (not "N/A")
- If account has <50 employees (error) → skip and find replacement
B7: Lead Sourcing & CRM Dedup with Outreach
Objective: Source B2B SaaS leads from LinkedIn Sales Navigator, deduplicate against CRM, and add to outreach campaign.
Scope: 1 account, 2 leads per account. Target companies under 500 employees in software.
Detailed Process:
- Find leads in LinkedIn Sales Navigator
- Verify in CRM (Attio) that they don't already exist
- Present a list to the user for approval
- Add to a Lemlist campaign prompted by the user or create a new campaign and add them
- Do not add to Attio unless the user specifically asks
Required Fields: Lead Name, Title, Company, LinkedIn URL, Email (if exists), CRM Match (Yes/No), CRM ID, Campaign Status, Confidence, Notes
B8: Talent Sourcing
Objective: Source qualified talent from LinkedIn, LinkedIn Recruiter, or online sources matching specific criteria, verify they don't exist in Ashby (ATS dedup), and prepare outreach with personalized messaging.
Default Talent Profile (unless user specifies otherwise):
- Senior Product Manager (role/seniority)
- German language proficiency
- PLG or SMB experience (product-led growth or small-to-medium business)
Required Fields: Name, Title, Current Company, Location, LinkedIn URL, Email (if found), Relevant Skills, German Proficiency (Yes/No), PLG/SMB Experience (Yes/No), Ashby Match (Yes/No), Ashby Record ID, Fit Score (1–10), Confidence, Notes
Workflow Steps:
- Search for talent on LinkedIn/LinkedIn Recruiter matching the criteria
- Find email and extract relevant information
- Check Ashby (ATS) to verify they don't already exist (cold prospects only)
- Compile a short list with fit scores for user approval
- Generate personalized outreach message (LinkedIn or email)
- Update Ashby with talent info and attach CV/resume PDF if available
Fit Scoring (1–10):
- 9–10: Perfect match (all criteria + strong PLG/SMB track record + German fluency)
- 7–8: Strong match (meets most criteria, clear relevant experience)
- 5–6: Partial match (meets core criteria but missing 1–2 elements)
- 1–4: Low match (meets title/role but weak on experience/language)
B9: GitHub Repository Intelligence Mining
Objective: Navigate to GitHub Trending page, extract top 5 trending repositories in "TypeScript" category (today), then drill into each repo to extract detailed metadata.
Required Fields: Repo Name, Owner, Stars, Forks, Open Issues, Last Commit (MMM DD, YYYY), Top 3 Contributors, Primary Language, Topics, Key Features (max 100 words), License, Confidence, Sources
Confidence Rules:
- High: All data from official repo pages
- Medium: Some data inferred
- Low: Data incomplete/unavailable
Edge Cases:
- Fewer than 5 TypeScript repos trending → extract what's available
- Data not found → mark "Not disclosed"
Estimated Time: 10–12 min
B10: Product Hunt Launch Analysis
Objective: Navigate to Product Hunt homepage, find today's top 5 products, then drill into each to extract upvotes, comments count, maker names, pricing model, tech stack mentions, first comment timestamp, and top 3 comment themes.
Required Fields: Product Name, Tagline, Upvotes, Comments, Makers, Website, Pricing Model, Launch Time, Top Comment Themes (max 50 words each), Tech Stack Mentioned, Confidence, Sources
Confidence Rules:
- High: All data visible on product page, comments reviewed
- Medium: Some fields inferred
- Low: Data incomplete/comments inaccessible
Edge Cases:
- Comments require login → mark "Limited access"
- Tech stack not mentioned → mark "Not disclosed"
B11: Global Sports Watch Influencer Campaign Vetting
Scenario: Launching advanced sports watch globally (Western English-speaking markets). Search for 5 fitness/tech influencers on YouTube & Instagram (200K–2M followers), then deep-vet for campaign fit.
Phase 1 (5–7 min): Find 5 influencers in fitness/sports tech niche across YouTube & Instagram. Filter by follower count, location signals, technical credibility. Exclude pure fashion/lifestyle.
Phase 2 (8–12 min): For each creator extract: Creator Name, Platform, Followers, Engagement Rate, Content Focus, Tech Credibility (1–10), Audience Geography, Recent Tech Content, Authenticity Score (1–10), Brand Safety, Campaign Fit Score (1–10), Recommendation (Tier 1/Tier 2/Not Suitable), Sources
Confidence Rules:
- High: Official profiles, authentic engagement, geography confirmed, tech credibility evident
- Medium: Most data from profile, some assumptions
- Low: Incomplete data, limited access
Edge Cases:
- Demographics not public → estimate from comments, mark "Estimated"
- Private account → mark "Limited access"
- Bot followers suspected → flag in Brand Safety
- Competitor reviews (Garmin, Whoop, Oura) → note tech verification AND potential exclusivity conflict
B12: E-Commerce Product Search
Objective: Find 3 camera tripods across Amazon, B&H Photo, and Adorama. Extract detailed specifications and pricing, then synthesize into a comparison matrix with insights.
Scenario: Researching camera tripods for a professional video production setup. Compare options across 3 major retailers to find the best value proposition. Budget: $200–400.
Phase 1 (4–6 min): Search for "professional camera tripod" or "video tripod" on Amazon.com, bhphotovideo.com, and adorama.com. Find 1 tripod per site matching: price range $200–400, suitable for professional video, in stock with real-time pricing visible, has customer reviews available.
Phase 2 (8–12 min): For each tripod extract: Product Name, Retailer, Current Price, Original/List Price, Max Height, Weight Capacity, Weight (tripod), Material, Legs, Head Type, Payload, Folded Length, Carrying Case, Customer Rating, Review Count, Top Pros (2–3 most common, max 100 words), Top Cons (2–3 most common, max 100 words), Warranty, Confidence, Sources (min 2 URLs per product)
Phase 3 (2–3 min): Create comparison summary answering:
- Which offers best value? (price vs. payload capacity)
- Which is most portable? (folded length + weight)
- Which has best reviews? (rating × review count consideration)
- Notable trade-offs between options
Estimated Time: 12–15 min
Browser Output Requirements
Start your output with this header:
# [Browser Name] Benchmark Run [YYYY-MM-DD HH:MM]
The header should include:
- Your browser name (e.g., "Strawberry", "Atlas", "Comet")
- The text "Benchmark Run"
- Date and time in format: YYYY-MM-DD HH:MM
Complete Output Structure:
# [Your Browser Name] Benchmark Run [YYYY-MM-DD HH:MM]
## Metadata
- **Browser:** [Your Name]
- **Start Time:** YYYY-MM-DD HH:MM:SS
- **End Time:** YYYY-MM-DD HH:MM:SS
- **Total Duration:** XX.X minutes
---
## B1: YC Company Deep Dive
**Status:** Complete | **Entities Found:** 3/3 | **Duration:** X.X min
[Insert table here]
**Notes:** [Your notes]
---
Before submitting, verify:
- Header includes browser name
- All sections have horizontal rule separators (---)
- Tables are complete (no truncated fields)
- Status lines follow format: Status: X | Entities Found: X/X | Duration: X.X min
Evaluation Framework
How to Compare Browser Performance
This framework is designed for independent LLM evaluation. You are evaluating 2–3 browsers by comparing their execution of these 12 benchmarks.
Evaluation Criteria (100 points per benchmark)
Data Accuracy: 35 pts
- Hallucinations/invented data: -20 pts each
- Factual errors: -5 to -10 pts each
- Must be verifiable via provided sources
Source Quality: 25 pts
- Official sources preferred over third-party
- Minimum 2 sources per fact
- Poor sourcing: -5 to -15 pts
Completeness: 25 pts
- All required fields must be filled
- Missing/vague fields: -5 to -10 pts each
- Incomplete or superficial work: -10 to -15 pts
Insight Quality: 10 pts
- Thoughtful observations vs generic
- Appropriate confidence ratings
- Limitations clearly noted
Speed & Efficiency: 5 pts
- 5 pts: Completed within estimated time
- 3 pts: 30–50% over but high quality
- 1 pt: 50%+ over but excellent quality justifies
- 0 pts: Incomplete or poor quality to rush
Quality Penalties (Applied Regardless of Time)
- Hallucinated/invented data: -20 pts per instance
- Unverifiable facts without sources: -10 to -15 pts per instance
- Required field missing/left blank: -5 to -10 pts each
- Vague/incomplete data when discoverable: -5 to -10 pts per instance
- Formatting errors or sloppy execution: -3 to -5 pts per issue
- Inconsistent/over-confident ratings: -5 to -8 pts
- Incomplete/partial benchmark execution: -10 to -20 pts
Total Score = 100 pts max per benchmark
Quality focus (95 pts): Data (35) + Source (25) + Completeness (25) + Insight (10) Speed (5 pts): Minimal weight Penalties: Applied independently of base scoring
A browser completing in 2 minutes with incomplete/vague data scores lower than one taking 25 minutes with thorough, verified data. Time never excuses quality shortcuts.
Duration should be tracked and reported separately for each benchmark (B1, B2, B3, etc.), not as a single total run time. Completion is what matters — a browser that completes all 12 benchmarks in 60 minutes is superior to one that completes only 3 in 15 minutes.
Overall Score Calculation
Per Benchmark: Sum of all criteria = 100 points max
Overall Browser Score: Average of all benchmarks = 100 points max
Total Score = (B1 + B2 + B3 + ... + B12) / 12
Note on non-determinism: LLM-based evaluation is inherently non-deterministic. Scores from independent runs may vary slightly. Rubric and weights are fixed; individual judge scores are approximate.
Quality Checklist
Before submitting results, verify:
- All source URLs are valid and working
- Dates are in consistent format (MMM YYYY or YYYY-MM-DD)
- Currency is noted (USD) for all funding amounts
- Confidence ratings are consistent (High/Med/Low with clear reasoning)
- No hallucinated data (all verifiable via sources)
- Metadata includes start/end times
- Markdown table formatting is clean (no broken links)
- Notes section explains any gaps or limitations