Measure GEO without lying to yourself

GEO measurement is a mess.

That does not mean it is useless. It means you should stop pretending one screenshot proves anything.

The hard part is that AI answers are probabilistic, product behavior changes, engines differ, and the prompts you test may not match what real buyers ask. The 2026 paper “Don’t Measure Once” is useful here because it treats AI search visibility as a stochastic measurement problem, not a screenshot contest. Many tools do not have real user prompt data. They approximate. Sometimes that is fine. Sometimes it creates a very expensive astrology dashboard.

I would measure GEO like a baseline and trend problem, not like a precise attribution channel.

separate visibility from traffic

Start with a basic distinction:

Traffic is what arrives on your site.

Visibility is whether the brand appears, gets recommended, or gets cited inside the answer.

AI referral traffic is currently small for most sites. SE Ranking’s 2026 study put it at 0.32 percent of total traffic across its dataset. That does not mean AI visibility is irrelevant. It may mean the important moment happened before the click.

If a buyer asks an answer engine for a shortlist and never sees your brand, there may be no referral to measure.

So I would report both:

referral traffic from AI platforms
answer visibility in high intent prompts

Do not collapse them into one number. They describe different things.

use a fixed prompt panel

A prompt panel is just a saved set of prompts you test repeatedly.

A small diagnostic prompt panel

Comparison

{product} vs {competitor}

{product} alternatives

Decision

How do I choose a {category}?

{category} pricing comparison

Implementation

How to implement {use case} with {stack}?

Best {category} that integrates with {stack}

Reputation

Is {product} reliable?

{category} tools recommended on Reddit

0 absent 1 mentioned 2 recommended 3 recommended with support

For each category, I would maintain 20 to 50 prompts across:

category selection
direct comparison
alternatives
implementation
pricing
reputation
migration

Each prompt should have:

category
buyer stage
intent level
target persona
target region or market
engines tested
sampling count

The prompt panel should change slowly. If you rewrite it every month, you cannot tell whether visibility changed or the test changed.

You can add new prompts, but keep the old ones as a core baseline.

sample repeatedly

For a lightweight process, sample each prompt three times per engine. For important prompts, sample more.

Record:

response date
engine
prompt
sample number
brand mentioned
brand recommended
rank or position
cited URLs
cited domains
competitors mentioned
answer accuracy
notes

This is tedious. It is also where the truth lives.

If a brand appears in one out of three samples, that is different from appearing in three out of three. A single screenshot hides that difference.

score simply

I would start with a simple score:

0: absent
1: mentioned
2: recommended
3: recommended with a useful cited source or clear supporting reason

Then calculate:

mention rate
recommendation rate
average score
cited source count
competitor share of recommendations
accuracy issue count

Do this by prompt bucket and by engine.

The bucket view matters because not all prompts are equal. If you win brand-name prompts but lose “best X for Y” prompts, you are visible only to people who already know you. That is not the same as category visibility.

track source drift

AI answers can change sources over time. Some vendor reporting in the GEO market claims very high citation churn, sometimes up to 90 percent in certain contexts. I would not build a worldview around one vendor number, but source drift is clearly real enough to track. If the same prompt can return different citations across samples, the source map is part of the measurement, not a footnote.

So track domains along with brand mentions.

For each month:

top cited domains
new domains
disappeared domains
competitor-owned domains
third-party domains
your own domains

This tells you where the answer layer is getting its facts.

If a new comparison site starts appearing in several prompts and you are absent from it, that is an action item. If your docs are cited more often after a documentation rewrite, that is useful evidence, even if referral traffic barely moved.

report uncertainty in plain language

The reporting style matters. If the market is noisy, the report should admit it.

I would include a measurement note like this:

This report is based on a fixed prompt panel sampled three times per engine between {date} and {date}. AI answers vary by session, account, location, product mode, and retrieval behavior. Treat the numbers as a visibility baseline and trend indicator, not exact market share.

This may sound less impressive than a confident dashboard. Good. I would trust it more.

connect metrics to actions

A GEO report that ends at “your score is 42” is not useful.

Every metric should point to one of a few action types:

publish or improve owned content
update docs
create a comparison page
add facts, citations, or benchmarks
fix outdated public information
earn third-party mentions
respond to reputation gaps
adjust positioning

Example:

Finding: In "best LLM observability tools" prompts, the brand was recommended in 1 of 15 samples. Competitor A was recommended in 11 of 15. The most cited source was {domain}, where Competitor A appears and the brand does not.

Action: Try to earn inclusion on {domain}, then publish a stronger owned comparison page covering tracing, evals, cost tracking, integrations, and deployment options.

That is much better than a chart.

do not fake attribution

The temptation will be to turn AI visibility into revenue attribution too early.

Be careful.

Some AI traffic shows up as referral traffic. Some does not. Some influence happens through brand search later. Some happens inside a sales call when a buyer says, “I saw you recommended somewhere.” Some is impossible to separate from normal SEO and content work.

I would use softer attribution until there is enough data:

AI referral sessions
assisted conversions from known AI referrers
branded search lift after visibility campaigns
sales call mentions
self-reported attribution
high intent prompt visibility over time
competitor displacement in recommendation prompts

None of these is perfect. Together, they are better than pretending the channel is as clean as paid search.

the monthly report I would use

A report shape that does not hide the uncertainty

AI answer visibility report Monthly visibility review

Summary

Mention rate, recommendation rate, average score, main competitor gap.

Source map

Top cited domains, new sources, disappeared sources, and missing inclusion targets.

Actions

Owned content, third-party source work, docs cleanup, and measurement changes.

# AI answer visibility report

Period: {month}
Category: {category}
Prompt panel: {count} prompts
Engines: {engines}
Samples: {samples}

## summary

Mention rate: {x}
Recommendation rate: {y}
Average score: {z}
Main competitor gap: {competitor}
Largest movement: {movement}

## wins

- {specific prompt bucket improved}
- {new source started citing us}
- {old inaccurate description disappeared}

## losses

- {competitor gained in category prompts}
- {important source stopped appearing}
- {engine has outdated information}

## source map

Top cited domains:
1. {domain}
2. {domain}
3. {domain}

Missing source opportunities:
1. {domain}
2. {domain}

## actions for next month

1. {owned content action}
2. {third-party action}
3. {measurement action}

## measurement note

This is a sampled baseline, not exact attribution.

If that report feels boring, it is probably on the right track.

Measurement in this market should be boring. The work is already uncertain enough. The report does not need extra drama.