WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

Abstract

Web agents automate browser tasks, ranging from simple form completion to complex workflows like ordering groceries. While current benchmarks evaluate general-purpose performance (e.g., WebArena) or safety against malicious actions (e.g., SafeArena), no existing framework assesses an agent’s ability to successfully execute user-facing website security and privacy tasks, such as managing cookie preferences, configuring privacy-sensitive account settings, or revoking inactive sessions.

To address this gap, we introduce WebSP-Eval, an evaluation framework for measuring web-agent performance on website security and privacy tasks. WebSP-Eval comprises (1) a manually crafted task dataset of 200 task instances across 28 websites; (2) a robust agentic system supporting account and initial-state management across runs using a custom Google Chrome extension; and (3) an automated evaluator. We evaluate a total of 8 web-agent instantiations using state-of-the-art multimodal large language models, conducting a fine-grained analysis across websites, task categories, and UI elements. Our evaluation reveals that current models suffer from limited autonomous exploration capabilities to reliably solve website security and privacy tasks, and struggle with specific task categories and websites. Crucially, we identify stateful UI elements such as toggles and checkboxes as a primary reason for agent failure, failing at a rate of more than 45% in tasks containing these elements across many models.

Overview

Can web agents keep your accounts private and secure?

Web users routinely make small but consequential security & privacy decisions — managing cookies, tightening profile visibility, turning off ad personalization, or revoking old sessions. As web agents begin acting on our behalf, we ask whether they can make these decisions reliably. We study these website security and privacy tasks across nine categories:

Account Security & Access Control
Cookie & Tracking Consent Management
Advertising & Personalization Control
Notification & Communication Preferences
Profile Visibility & Customization
Social Safety & Content Moderation
User Privacy & Data Rights
Data & Asset Management
UI/UX Preferences

These tasks change a website’s state, which lives on the server, so a faithful comparison requires every run to start from the same initial state S₀ — a challenge that doesn’t scale with fresh accounts on constantly changing live sites. WebSP-Eval addresses this through three modules — Task Curation, Agent Instantiation, and Automated Verification — detailed in the methodology below.

To the best of our knowledge, ours is also the first benchmark to evaluate web agents on live websites tied to user accounts.

Methodology

How WebSP-Eval works

WebSP-Eval is built from three modules, each handling one stage of an evaluation run: curating tasks with a defined initial state, instantiating and running the agent, and verifying the outcome.

Task Curation

Starting from the Tranco top sites and WebVoyager, websites are categorized via Trellix TrustedSource, filtering out those needing PII, MFA, or biometric verification. Two authors draft instructions grounded in NCSC, NIST, and FTC guidance. Each task pairs a user query with a rigorously defined initial state S₀ — binary settings get both ON and OFF states; multi-choice settings start from the least-private option. This yields 200 instances (138 distinct tasks) across 28 websites and 9 categories.

Agent Instantiation

Built on WebVoyager with a Selenium backbone, the system runs each instance fully automatically: authenticate → initialize state S₀ → execute. A custom Chrome (Manifest V3) extension implements a record-and-replay mechanism with a deterministic data-websp-index and shadow-DOM–aware locators, so initial states stay reproducible even on highly dynamic pages. The action space is extended (scroll-to-end, scroll-within-popup, tab switching, shadow-DOM element detection) to mirror real user behavior.

Automated Verification

An MLLM-as-a-judge receives the user prompt, the agent’s entire trajectory (actions + screenshots), and a manually annotated ground-truth action sequence, then returns a binary CORRECT / INCORRECT verdict with reasoning. The final judge is a majority-vote ensemble of Gemini-3.1-Pro, Claude-Opus-4.6, and GPT-5.2, reaching a 96.0% F1 and 95.5% agreement against human annotations.

Setting Consistent Initial State

Replaying a recorded trace to enforce the initial state S₀ on a Quora settings page. Click to play.

The WebSP-Eval Chrome extension setting a website's initial state by replaying recorded toggle and checkbox actions on a Quora privacy and settings page.

Results

What we found

We instantiate the agent with eight backbone MLLMs and evaluate all 200 instances across four research questions, measuring success rate and failure rate (explicit mistakes plus timeouts / iteration-limit hits).

RQ1 — Exploration. Top-tier models do well (Gemini-3-Pro: 84.5% with navigation, 82.5% without), but forcing autonomous exploration degrades performance across the board. Gemini-2.5-Flash drops most (a 17.9% relative fall when navigation is removed). When a model succeeds on only one prompt variant, it is almost always the one with explicit navigation.
RQ2 — Websites & categories. Performance is highly sensitive to site-specific UI. Even the best models stumble on sites like Steam (7 of 8 models below 50%) and on categories like UI/UX Preferences and Profile Visibility & Customization. Lower-ranked models occasionally win locally (e.g. Claude-Haiku-4.5 solves all of Coursera).
RQ3 — UI elements & state. Agents navigate links and buttons reliably but fail on stateful elements: Gemini-2.5-Flash fails 46.9% of toggle tasks. Models also show a strong bias to act even when the initial state already matches the target, succeeding far more often when a toggle starts ON than OFF.
RQ4 — Robustness. Consistency is the real bottleneck. On the strict pass₃ metric (success on all three trials), the best models — Claude-Sonnet-4.5 and Gemini-2.5-Pro — clear only 22–23% of tasks; the open-weight Gemma-3-27B reaches just 2%.
Failure analysis. Manual review of failures surfaces five modes — incorrect navigation, state misunderstanding, hallucinated success, repeated actions, and partial completion. We also observed a potentially destructive failure: Gemini-2.5-Flash deactivated a Pinterest account when merely asked to sign out.

**Table 1.** Overall success out of 200 instances per backbone model (selected results). “WithNav” includes navigation hints in the prompt; “W/oNav” gives the task instruction only.
Backbone model	WithNav — success	W/oNav — success
Gemini-3-Pro-Preview	169	165
Gemini-2.5-Pro	131	122
Claude-Sonnet-4.5	117	122
Gemini-2.5-Flash	127	106
Claude-Haiku-4.5	117	106
GPT-5.1	108	88
GPT-5-Mini	91	87
Gemma-3-27B (open-weight)	50	40