ora
LeaderboardMethodResearchBlogJourney

Deep Scan v2: we ran the agents, then scored what they did

ora research·Jun 17, 2026·6 min read

Every agent-readiness checklist on the web, including the one we shipped last quarter, is somebody’s opinion about what agents should care about. We got tired of guessing. So we ran the agents and measured it.

Deep Scan v2 is a different kind of benchmark. The checklist is no longer a list of protocols we think matter. It is reverse-engineered from real agents doing real tasks: where they go, what they read, what they ignore, and what actually changes whether they finish. As a side effect of moving that work off the critical path, a full scan now completes in 20 to 30 seconds instead of minutes.

20-30s
full scan, end to end
~90%
less time than v1
daily
the lab re-measures

The inversion: agents are the instrument, not the checker

The old model put the AI in the wrong place. Every scan sent agents out, per domain, to role-play a checklist we wrote by hand: can an agent complete this task, can it get through auth, can it recover from an error. The checklist was our opinion, and running a multi-turn agent for each item on every scan made the scan take minutes.

v2 flips where the intelligence lives. We run agents as research on our side. In our lab we spawn real agents across thousands of intents and thousands of sites, then watch where they succeed and where they stall. From that evidence we derive the checklist and its weights. The live scan reads that derived checklist, mostly statically: fast, deterministic, reproducible. AI stays in the scan only where it cannot be precomputed.

We call each of these runs an agentic journey: one agent, one intent, one site, traced turn by turn from its first click to the moment it finishes or gives up. The lab runs them by the thousand to set the weights. But the same instrument is now yours to point at your own site. At journey.ora.ai you pick an agent and an intent and watch a real agent navigate your site live - the exact thing v2 learns from, run on demand for you.

The animation below is one run, simplified. An agent enters from a search, lands on the homepage, reaches the docs, and only later follows links to the structured files. Then the path it walked collapses into the checks we score.

one agent, one runscroll to advance
searchhomepageturn 1docsturn 1.7llms.txtturn 3openapi.jsonturn 3agents.mdturn 4.6auth.mdnever reached

The agent enters from a search, lands on the homepage, then the docs. It only reaches the structured files later, and only by following links. A file nothing points to is reached late or never.

Reach and turn figures are aggregates from our agent-journey runs. The scan never role-plays this live; it reads the checklist the runs produced.

This is the same shift the documentation world is having out loud. Mintlify now ships an Agent Score and an open Agent-Friendly Documentation Spec because, in their words, docs “have a new audience that never tells you if it got what it needed.” They reached it from the docs side. We reached it from the agent side. We agree on the conclusion: the thing to measure is whether an agent can find it, understand it, and use it, not whether you published the right file.

What the runs showed

The clearest finding is also the most uncomfortable for a checklist built on protocols. Agents reach known resources at wildly different rates, and the drop-off from documentation to everything else is a cliff.

Where agents actually go% of runs reached

docs pages

88%

homepage

84%
the documentation cliff

llms.txt

46%

.well-known/*

34%

openapi.json

17%

robots / sitemap

9%

agents.md

8%

llms-full.txt

5%
Share of real task runs that reached each resource. Documentation and the homepage dominate; every purpose-built file trails far behind.

A few patterns showed up again and again, across categories and intents.

Reach is not usefulness. Weighting by how often a file gets reached would be a mistake. The agent-first formats are rarely reached, but they convert when they are. On the one site we tested that ships the full stack, when an agent actually reached them: openapi.json was used ~72% of the time, .well-known/* ~63%, agents.md ~42%. A file can be rare and still high-value. So v2 weights by value-when-present, and uses N/A to handle absence instead of punishing it.

Plot every resource by reach against value-when-reached and the scoring rule falls out of the picture. The number follows the vertical axis, not the horizontal one.

Reach vs value when reachedhover a point
how often agents reach it →value when reached →rare, still verifiedheavy weightwatch-listcommon, low liftdocshomepagellms.txtopenapi.json.well-known/*robots / sitemapagents.mdllms-full.txtauth.md

Green counts toward the score (verified). Amber is tracked but not scored (emerging). The score follows the y-axis, not the x-axis.

Agents search to find their way, not to find the answer. Most homepage and docs visits come from prior brand knowledge, not search. When web search did appear, in about 38% of runs, it mostly pointed the agent at the right URL rather than supplying the answer. The task still got completed from on-site content.

AI-native files are reached late, and only by following links. Agents hit familiar pages first - homepage around turn 1, docs around turn 1.7 - then reach structured files many turns later: llms.txt and openapi.json around turn 3, agents.md around 4.6, llms-full.txt around 6.7. A file nothing points to gets reached late or never.

The homepage is the gateway, and the easiest place to break a run. It was reached in ~84% of runs, almost always on the first turn (~95%). If its navigation is hidden or JavaScript-only, the agent goes blind from step one.

Agents guess standard paths, so meet them where they look. They probe pages like /pricing and /integrations early, from habit, and those work when they exist. But /api is the convention most sites do not serve.

Agents guess where things liveprobe → served?

/docs

the first place agents look

200

/pricing

probed early, from habit

200

/integrations

probed early, from habit

200

/api

guessed in ~11% of runs, found in ~2%

404
Agents expect a convention most sites do not serve. The fix is cheap: serve the paths they already knock on.
Whatever an agent reaches, it uses. A reachable page with stale or wrong information hurts more than having no page at all.

That last one is the sharpest. Once a file is fetched it shapes the answer almost every time, and agents tend to build their answer from the first believable page they land on. Correctness on the pages agents actually reach matters more than coverage of the pages they do not.

What this changes about the score

The score now reflects what agents do, not what a working group decided a protocol should be. Two labels carry the difference. Verified signals are the ones we have evidence agents rely on. They count toward your 0-100 score, weighted by measured impact. Emerging signals are tracked and shown, but excluded from the score until the evidence says agents use them.

Verified

verified

Counts toward your score.

  • docs reachability and depth

    reached in ~88% of runs

  • homepage content without JS

    the gateway, reached turn 1

  • llms.txt quality

    lower reach, high lift when used

  • OpenAPI spec

    N/A when absent, high value when present

Emerging

emerging

Tracked and shown, excluded from the score.

  • auth.md

    we check it, agents never reach it

  • agents.md

    promising, one site of evidence

  • llms-full.txt

    reached in ~5% of runs

  • payment variants (MPP, UCP)

    no measured lift yet

The honest example is auth.md. We were early to it, we wrote about why we are bullish, and we still ship the full set of checks for it. But in our runs, agents never actually reach it yet. So in v2 it is emerging: visible on your scorecard, not counted against your number. The day the runs show agents leaning on it, it graduates to verified and starts to weigh. Nothing about that decision is our taste. It is the data.

llms.txt cuts the other way and shows why reach alone is the wrong dial. Google has said plainly that its AI systems do not use it, and public AI search crawlers largely ignore it. But coding agents in editors like Cursor fetch it routinely as a routing layer, and when they do, it beats scraping HTML. Lower reach, real lift. So llms.txt stays verified, weighted below docs. The line between verified and emerging is not popularity. It is proven lift.

And it is fast now

Moving the agentic work into the lab did something we did not fully expect: it made the scan fast enough to feel instant. The 10 to 15 seconds of “initiating agent” dead air is gone, because there is no per-scan agent loop to initiate. Most checks are now a static read of a derived signal. The slowly-changing measurements that genuinely need an LLM are computed on a cadence and cached, so a live scan reads the last value instead of recomputing it.

v1 vs v2, same checkslower is better

Full scan, end to end

v1
~2-5 min
v2
20-30s

Time to first real score

v1
~15-60s
v2
~6-15s

LLM calls per scan

v1
20-40
v2
3-5
The agentic work moved into the lab, so the live scan reads a derived checklist instead of role-playing one per run.

None of this makes the scan dumber. The intelligence did not disappear, it moved upstream into the research loop that sets the weights and is distilled into static checks. The handful of things that truly need judgment per scan, like the run summary and your feedback, still run an AI model. Everything else is the checklist the agents already wrote.

The lab runs daily, so the bar stays honest

The agentic web does not hold still, so neither can the benchmark. We keep the lab running daily: new intents, new sites, fresh runs. When a protocol starts showing real lift, its check graduates from emerging to verified and earns weight. When a signal we counted stops mattering, it loses it. The methodology is the loop, not a frozen list.

The methodology is a looprepeats daily
1

Run agents

thousands of intents x sites, in the lab

2

Measure

reach and lift per signal

3

Derive

weights, verified vs emerging

4

Score

live scans read the checklist

↺back to step 1, every day, so emerging signals graduate to verified when the evidence shows up.

This is the same promise we have made since v1, now with a mechanism behind it: a score from six months ago does not mean the same thing today, and that is by design. The difference in v2 is that the bar moves because the agents moved, not because we changed our minds.

Run a fresh scan at /#scan and watch it finish in under half a minute. The full methodology, including how verified and emerging are decided, is at /methodology. If your number moved from v1, the verified/emerging split is most of the story, and every check still ships the evidence string that decided it.

Want to see where you rank?

Run the same scan we ran on thousands of sites. Free, public, takes about 1 minute.

Scan your site →Explore the data
← all posts
Published Jun 17, 2026
© 2026 era labs. All rights reserved.
AboutBlogDocsPrivacyContact