Fondo | Skyvern 2.0 Launches: State-Of-The-Art Web Navigation with 85.8% on WebVoyager Eval

Skyvern 2.0 recently launched!

Launch YC: Skyvern 2.0 - State-of-the-art web navigation with 85.8% on WebVoyager Eval!

‍

^{"Open source AI Web Agent"}

‍

^{TLDR; Real-World Tests: The Skyvern team ran all of the tests in Skyvern Cloud to get a better representation of autonomous browser operations (ie, they didn’t run on any local machines).}

‍

^{Open-Sourced Results: All of the runs can be}^{seen here}^{through the UI.}

‍

*^Skyvern^{is just getting started. Try}^{Skyvern Cloud}^or^{Skyvern Open Source}^{out for yourself and see Skyvern in action!}***

‍

Founded by Suchintan Singh & Shuchang Zheng

The founders have been working hard, cooking up something new to share with you all!

Skyvern 2.0 scored state-of-the-art 85.85% on the WebVoyager Eval.

This is the best-in-class performance of all WebAgents, giving advanced closed-source web agents like Google Mariner a run for their money.

‍

Agent Architecture

Achieving this SOTA result required expanding Skyvern’s original architecture. Skyvern 1.0 involved a single prompt operating in a loop both making decisions and taking actions on a website. This approach was a good starting point but scored ~45% on the WebVoyager benchmark because it had insufficient memory of previous actions and could not do complex reasoning.

To solve this problem, a self-reflection feedback loop within Skyvern was created. This resulted in 2 main changes:

The team added a “Planner” phase, which could decompose very complex objectives down into smaller achievable goals.
This allowed Skyvern to have a working memory of things it had completed and things that were still waiting to be finished.
- This allows Skyvern to work with long, complex prompts without increasing the hallucination rate.
They added a “Validator” phase, which confirmed whether or not the original goals the “Planner” generates are successfully completed.
This acts as a supervisor function to confirm that the Task executor is achieving its objectives as expected and report any errors/tweaks back to the Planner so it can make adjustments in real-time as needed.

‍

Test Setup

All tests were run in Skyvern Cloud with an async cloud browser and used a combination of GPT-4o and GPT-4o-mini as the primary decision-making LLMs. The goal of this test is to assert real-world quality — the quality represented by this benchmark is the same as what you would experience with Skyvern’s browsers running asynchronously.

💡 Why is this important? Most benchmarks are run on local browsers with a relatively safe IP address and an impressive browser fingerprint. This is not representative of how Autonomous agents will run in the cloud, and the founders wanted their benchmark to represent how agents would behave in production.

In addition to the above, the team made a few minor tweaks to the dataset to bring it up to date:

They removed 8 tasks from the dataset because the results are no longer valid. For example, one of the tasks asked to go to apple.com and check when the Apple Vision Pro will be released — in 2025, it’s already been released and forgotten.
Many of the flight/hotel booking tasks referenced old dates. They updated both the prompt and the answer to more modern dates for this evaluation.

‍

🔍 For the curious:

The full dataset can be seen here: https://github.com/Skyvern-AI/skyvern/tree/main/evaluation/datasets

The full list of modifications can be seen here: https://github.com/Skyvern-AI/skyvern/pull/1576/commits/60dc48f4cf3b113ff1850e5267a197c84254edf1

‍

Test Results

Skyvern is doing something out of the ordinary. In addition to the results, they are making their entire benchmark run public.

💡 Why is this important? Most benchmarks are run behind closed doors, with impressive results being published without any accompanying material to verify the results. This makes it hard to understand how things like hallucinations or website drift over time play an effect on agent performance.

The founders believe this isn’t aligned with their open source mission, and have decided to publish the entire eval results to the public.

📊 All individual run results can be seen here: https://eval.skyvern.com

🔍 The entire Eval dataset can be seen here: https://github.com/Skyvern-AI/skyvern/tree/main/evaluation/datasets

‍

Limitations of the WebVoyager benchmark

The WebVoyager benchmark is a comprehensive benchmark testing a variety of prompts on 15 different websites. While this acts as a good first step in testing Web agents, this only captures 15 hand-picked websites of the millions of active websites on the internet.

The team at Skyvern thinks there is tremendous opportunity here to better evaluate web agents against one another with a more comprehensive benchmark similar to SWE-Bench.

‍

What’s on the horizon

Browser automation is still a nascent space with tons of room for improvement. While Skyvern has achieved a major milestone in agent performance, a few important issues are next to be solved:

Can they improve Skyvern’s reasoning to operate efficiently in situations with more uncertainty? Examples include vague prompts, ambiguous or highly complex websites/tools, websites with extremely poor UX (legacy portals).
Can they give Skyvern access to more tools so it can effectively log into websites, make purchases, and behave more like a human?
Can they have Skyvern better memorize things it has already done in the past so it can do them again at a lower price point?

‍

References

‍

Learn More

‍

^{🌐 Visit to}^{www.skyvern.com}^{learn more.}

^‍

*^{👣 Follow Skyvern on}^LinkedIn***^&^X^.

‍

Posted

March 14, 2025

Launch

David J. Phillips

CEO & Founder

View Posts

About The Author

David is the CEO & Founder of Fondo (YC W18). He is an angel investor in Rippling, Flexport, LiquidDeath, and 85+ other startups. David began his career as an accountant at Deloitte before learning to code and becoming a founder. Previously, he was co-founder of Hackbright where 1,000+ software engineers have been trained and placed at tech companies including Slack, Disney, and Uber and was acquired by Capella Education NASDAQ: $CPLA in 2016.

← Back to all posts

Skyvern 2.0 Launches: State-Of-The-Art Web Navigation with 85.8% on WebVoyager Eval

Save time, money, and run a better startup.

The all-in-one accounting platform for startups. Bookkeeping, taxes, and tax credits on autopilot.

"Open source AI Web Agent"

TLDR; Real-World Tests: The Skyvern team ran all of the tests in Skyvern Cloud to get a better representation of autonomous browser operations (ie, they didn’t run on any local machines).

Open-Sourced Results: All of the runs can be seen here through the UI.

Skyvern is just getting started. Try Skyvern Cloud or Skyvern Open Source out for yourself and see Skyvern in action!

‍

Agent Architecture

Test Setup

🔍 For the curious:

Test Results

Limitations of the WebVoyager benchmark

What’s on the horizon

References

Learn More

🌐 Visit to www.skyvern.com learn more.

‍

👣 Follow Skyvern on LinkedIn & X.

Featured

Hemut Launches: AI-Powered Truck Management Software

Roger Launches: The AI Sales Rep That Books You Meetings While You Sleep

Middleware Launches: Ops AI

Categories

David J. Phillips

About The Author

Simplify Startup Finances Today

Take the stress out of bookkeeping, taxes, and tax credits with Fondo’s all-in-one accounting platform built for startups. Start saving time and money with our expert-backed solutions.

Simplify Startup Finances Today

Take the stress out of bookkeeping, taxes, and tax credits with Fondo’s all-in-one accounting platform built for startups. Start saving time and money with our expert-backed solutions.

Skyvern 2.0 Launches: State-Of-The-Art Web Navigation with 85.8% on WebVoyager Eval

"Open source AI Web Agent"

TLDR; Real-World Tests: The Skyvern team ran all of the tests in Skyvern Cloud to get a better representation of autonomous browser operations (ie, they didn’t run on any local machines).

Open-Sourced Results: All of the runs can be seen here through the UI.

Skyvern is just getting started. Try Skyvern Cloud or Skyvern Open Source out for yourself and see Skyvern in action!

‍

Agent Architecture

Test Setup

🔍 For the curious:

Test Results

Limitations of the WebVoyager benchmark

What’s on the horizon

References

Learn More

🌐 Visit to www.skyvern.com learn more.

‍

👣 Follow Skyvern on LinkedIn & X.

David J. Phillips

About The Author

Join Our Newsletter and Get the LatestPosts to Your Inbox

Featured

Hemut Launches: AI-Powered Truck Management Software

Roger Launches: The AI Sales Rep That Books You Meetings While You Sleep

Middleware Launches: Ops AI

Categories

Newsletter

Save time, money, and run a better startup.

The all-in-one accounting platform for startups. Bookkeeping, taxes, and tax credits on autopilot.

Products

Resources

About

Get started ⚡

^{"Open source AI Web Agent"}

^{TLDR; Real-World Tests: The Skyvern team ran all of the tests in Skyvern Cloud to get a better representation of autonomous browser operations (ie, they didn’t run on any local machines).}

^{Open-Sourced Results: All of the runs can be}^{seen here}^{through the UI.}

*^Skyvern^{is just getting started. Try}^{Skyvern Cloud}^or^{Skyvern Open Source}^{out for yourself and see Skyvern in action!}***

^{🌐 Visit to}^{www.skyvern.com}^{learn more.}

^‍

*^{👣 Follow Skyvern on}^LinkedIn***^&^X^.

^{"Open source AI Web Agent"}

^{TLDR; Real-World Tests: The Skyvern team ran all of the tests in Skyvern Cloud to get a better representation of autonomous browser operations (ie, they didn’t run on any local machines).}

^{Open-Sourced Results: All of the runs can be}^{seen here}^{through the UI.}

*^Skyvern^{is just getting started. Try}^{Skyvern Cloud}^or^{Skyvern Open Source}^{out for yourself and see Skyvern in action!}***

^{🌐 Visit to}^{www.skyvern.com}^{learn more.}

^‍

*^{👣 Follow Skyvern on}^LinkedIn***^&^X^.

Join Our Newsletter and Get the Latest
Posts to Your Inbox