Skip to main content
Back to Guides
Compliance12 min read

Inside Automated Cookie Scanning: How CDP, Headless Browsers, and Drift Detection Actually Work

Most CMPs scan your site once and hand you a list. This article tears open the machinery: Chrome DevTools Protocol instrumentation, headless browser orchestration, three-channel client-side drift detection, and the auto-promote loop that keeps a cookie inventory accurate without human intervention.

If you've ever used a CMP's cookie scanner, you probably saw a list appear after a few minutes: cookie names, domains, expiry values, maybe a category guess. What you didn't see is what happened between clicking "scan" and getting that list. Most vendors treat their scanner as a black box, and for good reason — the implementation is where differentiation lives. This article opens the box. We'll walk through the actual protocol-level instrumentation, browser orchestration, and client-side detection systems that separate a real scanner from a glorified database lookup.

This is written for developers and privacy engineers who want to understand (or build) the machinery, not for marketers who want a feature checklist. If you want a higher-level comparison of automated vs manual approaches, see Cookie Scanning vs Manual Audit.

The Foundation: Chrome DevTools Protocol

Every serious cookie scanner runs a real browser — a full Chromium instance executing JavaScript exactly as a visitor's browser would. Most cookies don't exist in the HTML. They're created by JavaScript that runs after the page loads, by third-party scripts that chain-load other scripts, and by server responses to XHR requests. You can't discover these by parsing markup.

The key technology is the Chrome DevTools Protocol (CDP) — the same WebSocket-based RPC interface that powers the DevTools panel you open with F12. CDP gives programmatic access to everything happening inside the browser: network traffic, DOM state, JavaScript execution, storage APIs, and security contexts. Three CDP domains matter most for scanning:

Network domain. Network.requestWillBeSent fires before every HTTP request the browser makes — including third-party scripts, tracking pixels, and ad network calls. Each event carries the request URL, headers, and initiator chain (which script triggered it). Network.getAllCookies() returns every cookie the browser holds, including httpOnly cookies invisible to document.cookie. A scanner that reads document.cookie will miss every httpOnly cookie — virtually all session cookies, auth tokens, and CSRF protections. CDP sees them all.

Storage domain. Enumerates localStorage and sessionStorage key-value pairs with their owning origin. Under the ePrivacy Directive, these storage mechanisms require the same consent as cookies when used for non-essential purposes.

Runtime domain. Lets scanners inject JavaScript to scroll pages (triggering lazy-loaded content), accept consent banners programmatically, and intercept storage API calls to attribute writes to specific scripts.

Headless Browser Orchestration

CDP is the protocol; Puppeteer and Playwright are the most common libraries that speak it. Puppeteer (Google's Node.js library) wraps CDP directly — its event model maps 1:1 to CDP events, so Network.requestWillBeSent gives you the raw protocol payload. It only supports Chromium, but for scanning that's fine: cookies behave identically across browsers. Playwright (Microsoft) adds a higher-level abstraction and supports Firefox and WebKit too; its advantage for scanners is mature browser context isolation — each context gets its own cookie jar and storage, useful for simulating multiple visitor profiles.

The Lambda Constraint

Running a full Chromium instance requires 200-400MB per browser plus 50-100MB per tab. In a serverless environment like AWS Lambda, you're working within a hard memory ceiling (1-3GB) and a 15-minute execution limit. This imposes real constraints:

  • You need a purpose-built Chromium binary. Projects like @sparticuz/chromium provide stripped-down builds optimized for Lambda's deployment package limits.
  • Concurrent tabs must be bounded by available memory. CookieBeam's scanner dynamically adjusts parallelism — ramping from 2 to 10 tabs when memory allows, backing off when it doesn't.
  • The pay-per-invocation model means scanning costs scale linearly with usage. Per-scan costs stay under $0.01 for most sites while running a full browser with CDP instrumentation.

What Scanners Actually Detect

A common misconception is that cookie scanners detect cookies. A better description: they capture every side effect a web page produces in the browser's storage and network layers.

HTTP cookies — both server-set (via Set-Cookie headers, including httpOnly) and client-set (via document.cookie). For each cookie, a scanner records name, domain, path, expiry, httpOnly/secure/sameSite attributes, and — by correlating with the initiator chain — which script or server set it.

localStorage, sessionStorage, and IndexedDB — key-value stores used by tracking scripts for cross-session identification, A/B test bucketing, and analytics state. IndexedDB — a more capable client-side database used by some analytics and fingerprinting libraries — falls under the same ePrivacy "storage of information" clause. Comprehensive scanners enumerate all three.

Third-party scripts — a <script> tag loading from connect.facebook.net is a finding even if it hasn't set a cookie yet. Script detection catalogs URLs, domains, element types (script, iframe, or tracking pixel via the 1x1 image heuristic), and baseline status.

Outbound network connections — the most recent frontier. A page can contact a tracking domain without setting a cookie, via fetch(), XHR, navigator.sendBeacon(), WebSocket, or EventSource. CookieBeam classifies each connection against a maintained database of known tracking domains — Google Analytics, Meta, Microsoft Clarity, Hotjar, LinkedIn, TikTok, and dozens more — assigning a vendor and consent category.

The Scan Flow: From Seed URL to Converged Inventory

Loading the homepage isn't enough. Different pages load different scripts, which set different cookies. A competent scanner covers the site's template diversity in three phases:

Phase 1: Consent and seed. Load the seed URL with a clean browser profile. Detect and programmatically accept the consent banner (CookieBeam supports its own plus OneTrust, Cookiebot, CookieYes, Complianz, Iubenda, Quantcast, and generic patterns) so all categories activate. Scroll to trigger lazy-loaded content. Wait for network activity to settle.

Phase 2: Page discovery. Fetch robots.txt, locate the sitemap (including nested sitemapindex files), and build a page list. No sitemap? Fall back to DOM link extraction, prioritizing high-value paths like /shop, /cart, /blog, and /login.

Phase 3: Parallel scanning with convergence. Pages are grouped by URL prefix and scanned in parallel. Within each group, the scanner stops after three consecutive pages produce zero new findings. No hard page cap — unlike scanners that impose 500-page limits regardless of whether the cookie surface has been fully explored.

The Problem Periodic Scanning Can't Solve

A deep scan is a snapshot. Between scans, your site changes: a new pixel via the tag manager, a plugin auto-update, a rotated cookie name. That window — where undisclosed cookies are active in production — is exactly what regulators exploit. CNIL's 2025 SHEIN fine and the ICO's mass warning campaign both cite cookies set before or without consent. A scanner that runs monthly can't protect you from something that appeared yesterday. This is where client-side drift detection enters.

Three-Channel Drift Detection

Drift detection flips the observation model. Instead of a scanner visiting your site, your visitors' browsers become the sensors. The consent banner script — already loaded on every page — carries a baseline of known items compiled from the last deep scan. On each page load, it compares what it sees against that baseline and reports anything new.

CookieBeam implements three independent drift channels, each watching a different surface:

Channel 1: Cookie Drift

The simplest channel. On each page load, the script reads document.cookie, extracts cookie names, and compares them against the known cookie baseline. Any name not in the baseline is a drift finding. Reports go to /api/scan/drift.

This channel has a known limitation: document.cookie can't see httpOnly cookies. Those are only visible to CDP during a deep scan. Cookie drift catches client-set cookies from JavaScript — which is where most third-party tracking cookies originate.

The script checks at five strategically timed moments: page load (inline scripts and server responses), 500ms after consent is granted (scripts that activate immediately post-consent), 2 seconds after load (slow third-party scripts), 10 seconds after load (deferred analytics and advertising scripts), and when the tab regains focus (background-triggered cookies).

Channel 2: Script Drift

This channel scans the DOM for <script src>, <iframe src>, and <img src> elements (with a 1x1 pixel heuristic for tracking images). URLs are normalized — stripping query parameters, replacing content-hash segments with wildcards — and compared against the known script fingerprint baseline. First-party scripts, CookieBeam scripts, and browser extension URLs are excluded. Reports go to /api/scan/script-drift.

Channel 3: Connection Drift

The most sophisticated channel. It wraps five browser APIs to intercept outbound requests in real time:

  • fetch() — parses the URL, extracts the hostname, records the connection, then delegates to native fetch.
  • XMLHttpRequest.open() + .send() — intercepts at open() to capture URL and method, at send() to optionally block.
  • navigator.sendBeacon() — captures the destination URL before delegating.
  • WebSocket and EventSource — intercept the constructor to capture the destination.

Unknown third-party connections are queued as drift reports to /api/scan/connection-drift, batched with a 3-second debounce timer.

Native Reference Capture: Defending Against Wrapper Evasion

The connection drift system captures references to the native fetch, XMLHttpRequest.prototype.open, XMLHttpRequest.prototype.send, and navigator.sendBeacon in an immediately-invoked function expression (IIFE) at initialization time. This means if a third-party script later overwrites window.fetch with its own wrapper, CookieBeam's wrapper still calls the original native function — preventing wrapper chains from silently bypassing detection. The same technique extends to child windows: window.__cb_repatchWindow() can re-instrument same-origin iframes after they load.

Deduplication and Rate Limiting

Without safeguards, drift detection on a high-traffic site would generate millions of redundant reports. Every channel implements the same three-layer deduplication:

Client-side deduplication. Each browser session tracks which items (cookie names, script fingerprints, connection domains) it has already reported using sessionStorage. If a visitor navigates ten pages and the same unknown cookie appears on each, it's reported once. The dedup key includes the banner ID and hostname, preventing cross-site interference.

Batching. Up to 20 items are combined into a single API request per channel, minimizing network overhead.

Server-side rate limiting and origin validation. The drift endpoints validate that requests originate from a domain registered in the banner's configuration. Reports from unauthorized domains are logged as security events and discarded. Per-banner caps prevent flood attacks — each banner is limited to 500 unique items in each drift table.

Auto-Promote: When Drift Becomes Confirmed

This is the mechanism that makes drift detection self-correcting rather than just alerting.

When an unknown item accumulates 5 independent drift reports (deduplicated per session, so this represents 5 distinct browser sessions encountering the same item), it crosses the auto-promote threshold. Here's what happens:

  1. The drift report's hit count reaches 5.
  2. The item is inserted into the corresponding detected table — detected_cookies, detected_scripts, or detected_connections — merging it with the deep scan baseline.
  3. A CDN script rebuild is triggered automatically.
  4. The next time the banner script is served, the promoted item is included in the known baseline.
  5. Visitors stop reporting it.

Each banner can auto-promote a maximum of 50 items across each channel. Beyond that, manual review is required. This cap prevents a malicious script from flooding fabricated items into the baseline via automated drift reports.

The result is a feedback loop: real visitors detect a new third-party cookie or script, the system verifies it through independent observations, and it's absorbed into the baseline without human intervention. The 5-hit threshold is low enough to catch changes quickly (minutes to hours on active sites) but high enough to filter out one-off anomalies from browser extensions or client-side state corruption.

One Threshold, Three Channels

All three drift channels — cookies, scripts, and connections — share the same auto-promote architecture: threshold of 5 hits, cap of 50 promotions per banner, upsert-with-reactivation semantics (so a previously dismissed item that reappears can re-trigger promotion). This consistency simplifies reasoning about the system's behavior and ensures no channel is a blind spot.

Connection Blocking: From Detection to Enforcement

The same wrappers that detect connections can also block them. When connection blocking is enabled, blocked fetch() calls receive a synthetic 204 No Content response, blocked XHR requests trigger an error event, blocked sendBeacon() returns false, and blocked WebSocket/EventSource connections get stub objects.

Each known connection is mapped to a consent category (necessary, analytics, marketing, preferences). If the visitor hasn't consented, the connection is blocked before it leaves the browser. Unknown connections are handled by the banner's Unknown Request Policy: allow (default), block, or allow-with-consent.

This differs fundamentally from script-level blocking (which prevents the <script> tag from loading). Connection-level blocking lets a script run its UI functions while suppressing its tracking requests — and it catches connections from scripts loaded before the consent banner initialized.

How This Differs from Cookie Databases

Several CMPs — most notably Cookiebot — take a fundamentally different approach: maintaining a central database of known cookie names and their purposes. When a cookie is detected, it's looked up and assigned a category. This works for common cookies (_ga is always Google Analytics), but has structural limitations:

  • It can't classify the unknown. Niche vendors, custom analytics, or SDKs that changed their naming convention all appear as "unclassified."
  • It can't detect connections or scripts. A database maps cookie names to vendors. It says nothing about network requests that don't set cookies — which is where tracking is moving as browsers restrict third-party cookies.
  • It can't detect drift. A database is a static artifact. It doesn't know when a new cookie appeared on your site.
  • It can't capture cookie attributes. It can tell you _ga is analytics, but not whether your instance is set with httpOnly, secure, sameSite=Lax, and a 2-year expiry — those depend on your specific GA4 configuration.

Protocol-level scanning with CDP captures the actual state of the actual browser on the actual site. A database captures what cookies typically do in general. For compliance, "typically" isn't good enough.

CookieBeam's Architecture: Putting It Together

Layer 1: Lambda-powered deep scanning. Puppeteer + stripped-down Chromium on AWS Lambda. CDP instrumentation captures cookies (including httpOnly), localStorage, sessionStorage, scripts, and connections. Convergence-based crawling, no arbitrary page limits, classification against a maintained vendor database.

Layer 2: Three-channel client-side drift detection. The consent banner script carries the deep scan baseline and monitors deviations in real time across cookies, scripts, and connections. Auto-promotion at 5 hits closes the loop without human intervention.

The data flows form a cycle: deep scan → CDN script build → drift baselines → drift reports → auto-promote → detected tables → next CDN rebuild → updated baseline. The cookie inventory stays accurate between scans, automatically. For a feature-by-feature CMP comparison, see How Cookie Scanners Work.

What Scanners Still Can't Do

No scanner can determine legal purpose (that's a legal judgment, not a technical measurement), see inside cross-origin iframes (separate security contexts — __cb_repatchWindow() handles same-origin child windows, but cross-origin remains opaque), or replicate every visitor state (login status, geography, device type all affect which scripts fire). And a complete cookie inventory is necessary for compliance but not sufficient — you still need a GDPR compliance process and data processing agreements with your vendors.

Automated scanning is infrastructure, not absolution. But it's infrastructure that every compliance program needs — because you can't regulate what you haven't measured.

Inside Automated Cookie Scanning: CDP, Headless Browsers & Drift Detection | CookieBeam | CookieBeam