Every agent-readiness checklist on the web, including the one we shipped last quarter, is somebody’s opinion about what agents should care about. We got tired of guessing. So we ran the agents and measured it.
Deep Scan v2 is a different kind of benchmark. The checklist is no longer a list of protocols we think matter. It is reverse-engineered from real agents doing real tasks: where they go, what they read, what they ignore, and what actually changes whether they finish. As a side effect of moving that work off the critical path, a full scan now completes in 20 to 30 seconds instead of minutes.
The inversion: agents are the instrument, not the checker
The old model put the AI in the wrong place. Every scan sent agents out, per domain, to role-play a checklist we wrote by hand: can an agent complete this task, can it get through auth, can it recover from an error. The checklist was our opinion, and running a multi-turn agent for each item on every scan made the scan take minutes.
v2 flips where the intelligence lives. We run agents as research on our side. In our lab we spawn real agents across thousands of intents and thousands of sites, then watch where they succeed and where they stall. From that evidence we derive the checklist and its weights. The live scan reads that derived checklist, mostly statically: fast, deterministic, reproducible. AI stays in the scan only where it cannot be precomputed.
We call each of these runs an agentic journey: one agent, one intent, one site, traced turn by turn from its first click to the moment it finishes or gives up. The lab runs them by the thousand to set the weights. But the same instrument is now yours to point at your own site. At journey.ora.ai you pick an agent and an intent and watch a real agent navigate your site live - the exact thing v2 learns from, run on demand for you.
The animation below is one run, simplified. An agent enters from a search, lands on the homepage, reaches the docs, and only later follows links to the structured files. Then the path it walked collapses into the checks we score.
The agent enters from a search, lands on the homepage, then the docs. It only reaches the structured files later, and only by following links. A file nothing points to is reached late or never.
This is the same shift the documentation world is having out loud. Mintlify now ships an Agent Score and an open Agent-Friendly Documentation Spec because, in their words, docs “have a new audience that never tells you if it got what it needed.” They reached it from the docs side. We reached it from the agent side. We agree on the conclusion: the thing to measure is whether an agent can find it, understand it, and use it, not whether you published the right file.
What the runs showed
The clearest finding is also the most uncomfortable for a checklist built on protocols. Agents reach known resources at wildly different rates, and the drop-off from documentation to everything else is a cliff.
docs pages
homepage
llms.txt
.well-known/*
openapi.json
robots / sitemap
agents.md
llms-full.txt
A few patterns showed up again and again, across categories and intents.
Reach is not usefulness. Weighting by how often a file gets reached would be a mistake. The agent-first formats are rarely reached, but they convert when they are. On the one site we tested that ships the full stack, when an agent actually reached them: openapi.json was used ~72% of the time, .well-known/* ~63%, agents.md ~42%. A file can be rare and still high-value. So v2 weights by value-when-present, and uses N/A to handle absence instead of punishing it.
Plot every resource by reach against value-when-reached and the scoring rule falls out of the picture. The number follows the vertical axis, not the horizontal one.
Green counts toward the score (verified). Amber is tracked but not scored (emerging). The score follows the y-axis, not the x-axis.
Agents search to find their way, not to find the answer. Most homepage and docs visits come from prior brand knowledge, not search. When web search did appear, in about 38% of runs, it mostly pointed the agent at the right URL rather than supplying the answer. The task still got completed from on-site content.
AI-native files are reached late, and only by following links. Agents hit familiar pages first - homepage around turn 1, docs around turn 1.7 - then reach structured files many turns later: llms.txt and openapi.json around turn 3, agents.md around 4.6, llms-full.txt around 6.7. A file nothing points to gets reached late or never.
The homepage is the gateway, and the easiest place to break a run. It was reached in ~84% of runs, almost always on the first turn (~95%). If its navigation is hidden or JavaScript-only, the agent goes blind from step one.
Agents guess standard paths, so meet them where they look. They probe pages like /pricing and /integrations early, from habit, and those work when they exist. But /api is the convention most sites do not serve.
/docs
the first place agents look
/pricing
probed early, from habit
/integrations
probed early, from habit
/api
guessed in ~11% of runs, found in ~2%
Whatever an agent reaches, it uses. A reachable page with stale or wrong information hurts more than having no page at all.
That last one is the sharpest. Once a file is fetched it shapes the answer almost every time, and agents tend to build their answer from the first believable page they land on. Correctness on the pages agents actually reach matters more than coverage of the pages they do not.
What this changes about the score
The score now reflects what agents do, not what a working group decided a protocol should be. Two labels carry the difference. Verified signals are the ones we have evidence agents rely on. They count toward your 0-100 score, weighted by measured impact. Emerging signals are tracked and shown, but excluded from the score until the evidence says agents use them.
Verified
verifiedCounts toward your score.
docs reachability and depth
reached in ~88% of runs
homepage content without JS
the gateway, reached turn 1
llms.txt quality
lower reach, high lift when used
OpenAPI spec
N/A when absent, high value when present
Emerging
emergingTracked and shown, excluded from the score.
auth.md
we check it, agents never reach it
agents.md
promising, one site of evidence
llms-full.txt
reached in ~5% of runs
payment variants (MPP, UCP)
no measured lift yet
The honest example is auth.md. We were early to it, we wrote about why we are bullish, and we still ship the full set of checks for it. But in our runs, agents never actually reach it yet. So in v2 it is emerging: visible on your scorecard, not counted against your number. The day the runs show agents leaning on it, it graduates to verified and starts to weigh. Nothing about that decision is our taste. It is the data.
llms.txt cuts the other way and shows why reach alone is the wrong dial. Google has said plainly that its AI systems do not use it, and public AI search crawlers largely ignore it. But coding agents in editors like Cursor fetch it routinely as a routing layer, and when they do, it beats scraping HTML. Lower reach, real lift. So llms.txt stays verified, weighted below docs. The line between verified and emerging is not popularity. It is proven lift.
And it is fast now
Moving the agentic work into the lab did something we did not fully expect: it made the scan fast enough to feel instant. The 10 to 15 seconds of “initiating agent” dead air is gone, because there is no per-scan agent loop to initiate. Most checks are now a static read of a derived signal. The slowly-changing measurements that genuinely need an LLM are computed on a cadence and cached, so a live scan reads the last value instead of recomputing it.
Full scan, end to end
Time to first real score
LLM calls per scan
None of this makes the scan dumber. The intelligence did not disappear, it moved upstream into the research loop that sets the weights and is distilled into static checks. The handful of things that truly need judgment per scan, like the run summary and your feedback, still run an AI model. Everything else is the checklist the agents already wrote.
The lab runs daily, so the bar stays honest
The agentic web does not hold still, so neither can the benchmark. We keep the lab running daily: new intents, new sites, fresh runs. When a protocol starts showing real lift, its check graduates from emerging to verified and earns weight. When a signal we counted stops mattering, it loses it. The methodology is the loop, not a frozen list.
Run agents
thousands of intents x sites, in the lab
Measure
reach and lift per signal
Derive
weights, verified vs emerging
Score
live scans read the checklist
This is the same promise we have made since v1, now with a mechanism behind it: a score from six months ago does not mean the same thing today, and that is by design. The difference in v2 is that the bar moves because the agents moved, not because we changed our minds.
Run a fresh scan at /#scan and watch it finish in under half a minute. The full methodology, including how verified and emerging are decided, is at /methodology. If your number moved from v1, the verified/emerging split is most of the story, and every check still ships the evidence string that decided it.