SEO
→ GEO
The evolution of search in the age of generative AI.
For twenty years search had a stable grammar: query → list of links → click. That grammar is breaking down.
For twenty years online search worked according to a stable grammar: a user types a query, an engine returns an ordered list of links, the user clicks and lands on a site. On that mechanism — crawling, indexing, ranking, click-through — an entire discipline was built, SEO, and with it the economic model of the open web, where visibility translated into traffic and traffic into value.
Generative engines (Google with AI Overviews and AI Mode, ChatGPT, Perplexity, Microsoft Copilot, Claude) no longer return primarily a list of links: they read the web on the user's behalf, synthesise an answer and cite a handful of sources. The click, which used to be the goal, becomes the exception. The unit of competition changes (no longer the page but the single chunk of content that a retrieval system can extract and cite), the metrics change (from SERP position to share of citation), the players change (a fragmented archipelago, evolving fast, including beyond the West). From this shift, GEO is born, Generative Engine Optimization.
This document has three aims: to explain how the engines really work at the level of retrieval (embedding, chunking, re-ranking, query fan-out) and source selection, engine by engine; to distinguish what is documented from what is inferred via reverse-engineering or merely asserted by vendors; and to place the phenomenon in the Italian and European context, where the regulatory framework (AI Act, TDM opt-out, DSA, GDPR) is the most stringent in the world and concretely shapes what GEO can and cannot do. The tone is educational and analytical: understand the mechanism, not sell a recipe.
Five points
- Zero-click is structural: in 2024, 58.5% of US Google searches ended without a click (SparkToro/Datos), rising to 68.01% in early 2026. With an AI Overview, the organic CTR of the top page drops by 47-61% depending on the study (Authoritas, Pew, Seer Interactive, Ahrefs).
- GEO originates from the paper by Aggarwal et al. (IIT Delhi/Princeton, KDD 2024): statistics, citations and quotations raise visibility "by up to 40%"; keyword stuffing is the only tested method that worsens it.
- Each engine has a different pipeline (ChatGPT on Bing/scraping, Perplexity with its own crawler, Claude on Brave, Copilot on Bing/Prometheus, Gemini with fan-out): all of them use RAG, embeddings and chunk-level selection, so structure, freshness, authority and citability matter more than traditional positioning.
- AI visibility is a distribution, not a score: a single measurement has a standard error of 0.370 (useless); you need 7-10+ runs per prompt (paper "Don't Measure Once", arXiv:2604.07585).
- Google itself (May 2026) declares that "GEO is still SEO" and debunks 5 myths, among them llms.txt and manual chunking.
Key Findings
- Search behaviour has changed structurally, not marginally. Zero-click went from ~50% (SparkToro 2019) to 68.01% (Q1 2026, US). Gartner forecast (Feb 2024) a 25% decline in traditional search volume by 2026 — a forecast, not a final figure.
- Traditional organic ranking remains important but is no longer sufficient. Ahrefs (Jan 2026): only 38% of pages cited in AI Overviews are also in the organic top 10 (it was 76% in July 2025).
- GEO tactics with empirical evidence are few and specific: statistics, source citations, quotations, freshness, answer-first structure. Many popular pieces of advice (llms.txt, schema-as-hack, manual chunking) have no evidence of working and some are contradicted by Google.
- The Italian/EU market is lagging but accelerating fast: AI Overviews in Italy since 26 March 2025; GenAI use in Italy at 20% (below the EU average of 33%, Eurostat 2025); after the FIEG complaint (15 October 2025), AGCOM referred the case to the EU Commission under art. 65 DSA (29 April 2026). Fragmentation is global: in China, Doubao, ERNIE, DeepSeek and Qwen together exceed 900 million users.
- The EU regulatory context is the most stringent in the world: AI Act, GDPR, the TDM opt-out under art. 4 CDSM, the Garante-OpenAI case and publisher disputes shape how GEO can operate in Europe.
The evolution from SEO to GEO
How traditional search worked (and still works)
Classic SEO rests on three phases: crawling (a bot like Googlebot discovers and downloads pages), indexing (pages are analysed and stored in an index), ranking (an algorithm orders the pages for a query, producing the SERP, the list of "blue links"). The economic model of the open web was based on click-through: the user searched, saw a list of results and clicked on a site, generating monetisable traffic.
How AI-driven search works
Generative engines do not primarily return a list of links, but synthesise an answer from multiple sources using an LLM, citing some of them inline. This produces zero-click answers: the user gets the answer without visiting any site. Google integrated the paradigm with AI Overviews (generative boxes at the top of the SERP) and AI Mode (a full conversational experience).
Turning points and timeline
- May 2023: Google announces the Search Generative Experience (SGE) as an experiment in Search Labs.
- 14 May 2024: Google officially launches AI Overviews in the US (at Google I/O).
- 28 Oct 2024: AI Overviews extended to more than 100 countries and territories.
- 31 Oct 2024: OpenAI launches ChatGPT Search.
- 5 Mar 2025: AI Overviews move to Gemini 2.0; Google announces AI Mode as a Labs experiment.
- 26 Mar 2025: AI Overviews arrive in Italy and other European countries.
- 20 May 2025: AI Mode extended to all US users.
- 15 May 2026: Google publishes its official GEO guidance ("Optimizing your website for generative AI features").
- 5 Jun 2026: Google publishes guidance on third-party SEO services and updates "Do you need an SEO?", naming AEO/GEO as a service category.
Perplexity (founded in 2022) popularised the concept of the "answer engine" with transparent citations. Claude (Anthropic) added web search in 2025.
The founding paper "GEO: Generative Engine Optimization"
The term was formalised in the paper by Pranjal Aggarwal, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, Karthik Narasimhan, Ameet Deshpande (IIT Delhi/Princeton/Allen AI), published at KDD 2024 (arXiv:2311.09735, DOI 10.1145/3637528.3671900). Key results, verified against the original text:
- They introduced GEO-bench, a benchmark of 10,000 queries from different domains.
- They tested 9 methods of optimization: Authoritative, Keyword Stuffing, Statistics Addition, Quotation Addition, Cite Sources, Fluency Optimization, Easy-to-Understand, Technical Terms, Unique Words.
- Two metrics: Position-Adjusted Word Count (words attributed to a source, weighted by position in the answer) and Subjective Impression (a qualitative G-Eval score across 7 dimensions).
- The best methods — Quotation Addition +41%, Statistics Addition +31%, Cite Sources +28%, Fluency Optimization +28% — can raise visibility "up to 40%" in generative answers.
- Keyword Stuffing is the only method that worsened visibility (−8%): SEO tactics do not transfer automatically.
- Effectiveness varies by domain: Statistics in "Law & Government" and "Opinion" queries; Quotation in "History" and "People & Society".
- GEO favours low-ranking sites: Cite Sources raised visibility by 115.1% for sites in fifth SERP position.
- The Fluency + Statistics combination beat any single method by more than 5.5%.
- Validation on Perplexity.ai: improvements up to 37%.
Sourcing note: method titles, headline percentages and prose phrases confirmed against arXiv. v3 reports a 15-30% boost for the stylistic methods, against "10-20%" in earlier versions — a version discrepancy worth flagging.
Data on the change in search behaviour
- SparkToro/Datos (2024): 58.5% of US and 59.7% of EU Google searches without a click. For every 1,000 US searches, only 360 clicks to the open web.
- SparkToro/Similarweb (2026): 68.01% of US Google searches without a click in the first 4 months of 2026 (+7.56 points since 2024).
- Ahrefs (Dec 2025): an AIO correlates with an average CTR of −58% for the top page.
- Pew Research (Jul 2025): across 68,879 real searches, clicks on a traditional link at 8% with an AIO vs 15% without (≈ −47%); only 1% click a cited source; 26% of sessions with an AIO end entirely (vs 16%). Google contested the methodology.
- Seer Interactive (Sep 2025, >25M impressions): organic CTR for queries with an AIO −61% (1.76% → 0.61%).
- Gartner (Feb 2024): forecast of −25% in traditional search volume by 2026 (a forecast, not a final figure).
- Publisher impact: Digital Content Next (Aug 2025) a median 10% drop in Google referral traffic; Press Gazette/Chartbeat: −33% globally in 2025 (−38% US, −17% Europe). The damage scales with site size.
How the engines work
All generative engines use some form of RAG (Retrieval-Augmented Generation): instead of relying only on "parametric" knowledge (learned in training), they retrieve fresh content from the web and use it to build the answer. The underlying technical thesis: the page is no longer the unit of competition, the chunk is.
- ChatGPT (OpenAI): retrieval via third-party scraping APIs (historically tied to Bing; Seer found 87% overlap with Bing's top results); query fan-out; source selection that weighs authority, structure and freshness.
- Google Gemini / AI Overviews / AI Mode: its own web index + Knowledge Graph + Shopping; query fan-out documented via API; selection that also draws from outside the top 10 (Ahrefs Jan 2026: only 38% of citations from the top 10).
- Perplexity: RAG with its own crawler (PerplexityBot); strong sensitivity to freshness; typically 3-5 sources per answer; multi-level ML reranking.
- Microsoft Copilot: the Prometheus model on the Bing index; a "Bing Orchestrator" that generates iterative internal queries (fan-out); numbered citations [1][2]; the first engine to codify GEO in its own Webmaster Guidelines (February 2026).
- Claude (Anthropic): retrieval via an external provider (overlap with Brave); sentence-level citation; three bots (ClaudeBot training, Claude-User fetch, Claude-SearchBot indexing).
2-bis · The three families of retrieval
Dense (bi-encoder): query and document encoded separately into a single dense vector (e.g. 768 dimensions); relevance is cosine similarity. Very fast (pre-computed vectors + ANN), but — as Towards Data Science puts it — "the model compresses all meaning into one vector before any comparison happens": query and document never interact at the token level. GEO consequence: a chunk with three concepts produces an "average" vector that represents none of the three well. Sparse (BM25, SPLADE): exact (or expanded) lexical match, unbeatable on proper nouns, product codes and technical terms — the cases where dense fails. Late interaction (ColBERT and successors): keeps embeddings at the token level and computes relevance with MaxSim. Weaviate: dense methods "pool token-wise embeddings into a single representation while ColBERT embeddings keep the token-wise representations in a multi-vector". Advantage: explainability; disadvantage: storage (BEIR: ~20GB/1M docs vs 0.4GB for BM25 and ~3GB for dense).
The figure that matters: hybrid dense+sparse approaches beat every single method. A single study (Jan 2026, on MS MARCO, a low-visibility source with an unusually low dense baseline — to be treated as indicative) reports up to 580% improvement in Recall@10 over dense alone (13.9% → 80.8%); the general principle is in any case confirmed by more solid peer-reviewed literature. Real engines combine "meaning" (dense) and "exact word" (sparse): content must serve both — clear concepts and exact terminology.
Dense · semantic
Sparse · lexical
2-bis · Embeddings
What a vector captures (and what it does not). Three operational implications: (1) the same model is mandatory for index and query — otherwise the vectors live in misaligned spaces and the similarity is noise; it is the #1 cause of silently broken RAG; (2) context rot — Chroma's research (Jul 2025, 18 models including GPT-4.1, Claude 4, Gemini 2.5) shows that retrieval degrades as context length grows, even on simple tasks: burying the answer in the middle of a wall of text makes it less retrievable; (3) an average vector is not a good vector — several distinct concepts in one chunk = a diffuse embedding.
2-bis · Chunking
Common sense says "semantic chunking is best"; the benchmarks say the opposite, and the divergence is instructive. Vecta Benchmark (Feb 2026): recursive splitting at 512 tokens first (69%), semantic at 54% (fragments of ~43 tokens); the author: the conversation about chunking has been "dominated by theory rather than measurement". MDPI Bioengineering (Nov 2025): in the clinical domain, adaptive 87% vs 13-50% for fixed-size (p=0.001). arXiv 2506.17277 (chemistry): recursive up to +45% domain-weighted precision. arXiv 2512.05411 (enterprise): on well-structured documentation, naive beats semantic and recursive. arXiv 2506.06339 (Arabic): sentence-aware best, semantic consistently worst.
They do not reconcile — and that is the point. No strategy is universally optimal; it depends on document structure and query type. A defensible robust default: recursive splitting at 400-512 tokens with 10-20% overlap, when you have no specific reason to do otherwise. Why care even if you don't control the engine's chunker? Because you control how well-splittable the page is: sharp semantic boundaries (clear headings, one idea per section, self-contained answers) produce coherent chunks with any strategy. Content structure is the chunking you can control.
2-bis · Re-ranking
Real retrieval is almost always two-stage: Stage 1 — Recall (bi-encoder/BM25/hybrid), a wide net with 20-150 fast candidates; Stage 2 — Precision (cross-encoder) that re-evaluates each query-chunk pair together ([CLS] query [SEP] document [SEP]), reorders and keeps the top 3-8. Operational numbers: a lightweight cross-encoder (ms-marco-MiniLM-L-6-v2) ~50ms/20 docs; Cohere Rerank ~200ms; an LLM reranker 1-3s. Typical gain +5-15 points nDCG@10 or +10-25% accuracy. Default: top-20→50, rerank, pass top-3→8; beyond 50 candidates it "adds latency without meaningfully improving recall". GEO implication: the cross-encoder rewards direct, query-specific relevance — not keyword density, not length, not domain authority in itself. It is the mechanical foundation of why answer-first works: it is not style, it is alignment with the cross-encoder.
From the mechanism to the markup rule
| Mechanical fact | Operational rule for the content |
|---|---|
| Dense compresses everything into one vector | One idea per section; don't mix 3 themes in a paragraph |
| Sparse rewards exact match | Include exact terms/codes/names, not just synonyms |
| Hybrid beats single methods | Clear concepts and precise terminology together |
| Context rot | Answer up top, not buried halfway down the page |
| Self-contained chunks = robust retrieval | Each section should make sense read on its own |
| Clear headings = sharp boundaries | Semantic HTML: <h2> as a question + answer below |
| Cross-encoder rewards direct relevance | Answer-first: the answer in the first 40-60 tokens |
| Re-ranking cuts to the top 3-8 | You need to be the most relevant chunk, not just relevant |
2-ter · Per-engine reverse-engineering
Epistemic distinction: what follows mixes official facts (documentation, APIs), reverse-engineering analyses not confirmed by the makers, and vendor claims. I flag each case as it comes. This is the most volatile area of the document: the pipelines change from one model update to the next.
ChatGPT — the web.run tool
The most detailed source is the RESONEO/Meteoria study (Olivier de Segonzac, May 2026), which decompiled the mobile app, sniffed the network packets and reconstructed the system prompt. The internal engine is called web.run: before GPT-5.3 it sent compact textual commands separated by pipes (fast|query|recency), after 5.3 structured JSON objects. The tool supports 12 operations (up from 4): search_query, open, find, click, screenshot, product_query and specialised widgets, plus a genui system. The query fan-out chains 2-10+ rounds; the novel product fan-out (browse_rewritten_queries) launches a separate shopping search for each individual product. It is ChatGPT-User (not OAI-SearchBot) that fetches the pages during the conversation; Google tracking markers (strlid) in product URLs reveal a backend that leans on third-party providers and on Google behind the scenes.
With the switch to GPT-5.3 Instant (4 Mar 2026) the unique domains cited per answer dropped from 19 to 15 (−20%) — the "Bigfoot Effect": concentration on a few authoritative domains (URL-per-domain ratio stable at 1.26). Reddit is the only domain exempt from the per-word copyright limits in the reconstructed system prompt. (Reverse-engineering.) Strategic point: the study distinguishes parametric visibility (what the model learned in training — stable, shaped by press coverage, Wikipedia, authoritative sites) from dynamic visibility (what it retrieves in real time, volatile). The link: "the model formulates the web queries pointing at sources it already knows. A brand absent from the parametric memory will not even be considered as a candidate."
Caveat: the same prompt on 5.2/5.3/5.4 produces different fan-out, sources and passages. Citation in ChatGPT is not reproducible like a Google ranking: it must be tested model by model.
Gemini / AI Mode — documented fan-out
Unlike ChatGPT, here there is official documentation, via the Gemini grounding API. The response returns webSearchQueries (the queries actually executed — e.g. for "who won Euro 2024" it generates ["UEFA Euro 2024 winner", "who won euro 2024"]), groundingChunks (the sources, with uri and title) and groundingSupports (the text segment → source chunk mapping, with startIndex/endIndex character by character): every sentence of the answer is anchored to specific chunks. From the Gemini 2.5 technical report: Gemini 2.0 was "the first family of models trained to natively call tools like Google Search"; Gemini 2.5 "interleaves search capabilities with internal thought processes" for multi-hop queries. Search is interleaved with reasoning, not occasional.
Scale: in mid-2025 the models powered ~1.5 billion monthly users in AI Overviews and ~400M in the Gemini app; at the end of 2025/early 2026 the official Alphabet numbers rise to 2 billion users for AI Overviews and 750 million MAU for the Gemini app (Q4 2025). AI Mode uses a "custom version of Gemini" with fan-out, which decomposes the query into many parallel sub-queries — the reason why pages not in the top 10 get cited. GEO consequence: optimising for Gemini means covering the tree of sub-questions of a topic, not a single keyword.
Perplexity / Sonar
RAG with its own crawler (PerplexityBot) + realtime fetch (Perplexity-User). Sonar is the proprietary model built on open Llama architectures; at the product level it is multi-model and selects the best model at runtime per mode (search/reasoning/research). Pipeline: (1) query decomposition; (2) retrieval from its own index + realtime; (3) three-level reranking — Layer 1 candidate retrieval with classic scoring, Layer 2 ranking by authority/relevance, Layer 3 ML reranking that reportedly favours earned media from Tier-1 publications (a citation on TechCrunch or Forbes as an externally verified authority signal — independent analysis by Yeşilyurt, Aug 2025, not officially confirmed); (4) synthesis with inline citations. Citation is a "two-step dance": inclusion in the retrieval set, then selection of the paragraph. Freshness dominates: an article "updated 2 hours ago" was cited +38% more than its identical twin dated a month earlier; the stale twin rarely disappeared from the retrieval set but was demoted in the synthesis. ~780M queries in May 2025 (+20% MoM, Srinivas statement, Bloomberg Tech).
Claude — sentence-level citation
Retrieval via an external provider (Profound analyses indicate strong overlap with Brave Search) and fetch from the result URLs. The web search and citations tool documentation specifies that documents are split into chunks at sentence granularity: the output returns cited_text blocks, with title and url. Consequence: a well-built, self-contained sentence is the smallest citable unit — the most extreme case of the "the chunk is the unit of competition" principle. Three bots: ClaudeBot (training), Claude-User (fetch on user request), Claude-SearchBot (indexing). All declare that they respect robots.txt.
Microsoft Copilot — GEO in policy
Copilot is the only major engine that has codified GEO in its own official policy. Microsoft describes Prometheus as a model that combines "the fresh and comprehensive Bing index, ranking, and answers results with the creative reasoning capabilities of … GPT models"; the Bing Orchestrator "generate[s] a set of internal queries iteratively" — the internal query fan-out mechanism. The citations are numbered [1][2] linked to the source page. The rewrite of the Bing Webmaster Guidelines (27 Feb 2026) treats "grounding results and citations" as a separate eligibility outcome and introduces GEO as an official category: NOARCHIVE prevents the content from being used in Copilot answers; NOCACHE limits it to URL, title and snippet (Microsoft advises against it on pages you want cited); the data-snippet attribute controls which text Bing can show or cite (paragraph level). Seer (6 Feb 2025): 87% of SearchGPT citations coincide with Bing's top-20 organic (vs 56% for Google) — an independent measurement of coincidence; the "~92% via Bing API" is an unconfirmed vendor claim. IndexNow notifies Bing on every change; Google does not support it. Bing Webmaster Tools' AI Performance Report (public preview since Feb 2026) shows citation counts, cited URLs and a sample of the grounding queries.
| Dimension | ChatGPT | Gemini | Perplexity | Copilot | Claude |
|---|---|---|---|---|---|
| Index/source | Third-party scraping (+Google traces) | Google index + KG + Shopping | Own index + realtime | Bing/Prometheus | External (Brave overlap) |
| Retrieval bot | ChatGPT-User | Google-Extended / Search | Perplexity-User | bingbot / Bing | Claude-User / Claude-SearchBot |
| Fan-out | Yes (web.run, 2-10+ rounds) | Yes (documented via API) | Yes (query decomposition) | Yes (Bing Orchestrator) | Yes (multiple searches) |
| Citation | Inline, varies by model | sentence→chunk (groundingSupports) | Inline, paragraph | Numbered [1][2] + panel | Inline, sentence (cited_text) |
| Distinctive | Few authoritative domains (Bigfoot) | Sub-question tree | Freshness + Tier-1 | GEO in policy | Sentence granularity |
| Transparency | Low (rev-eng) | Medium (official API) | Low-medium | High (docs + report) | Medium (docs) |
What makes content citable
Tactics with empirical support
- Statistics and specific data (GEO paper: Statistics Addition +31%).
- Source citations and quotations (Quotation Addition +41%, Cite Sources +28%).
- Freshness (strong for Perplexity and news/trend queries).
- Answer-first structure with a question heading and a direct answer in the first 40-60 tokens (alignment with the cross-encoder).
- Authority/E-E-A-T and third-party citations; non-commodity content with first-hand experience (confirmed by Google's 2026 guidance).
- Brand mentions: Previsible study on 1.96M sessions → brand search volume is the strongest predictor of AI citations (correlation 0.334), more than backlinks.
Caveat: vendor claims such as "data-rich cited 2.7x more" or "FCP <0.4s = 6.7 citations" circulate but are not verifiable against a primary source — see the Anti-hype section.
Correlation with traditional ranking (conflicting data): in July 2025 Ahrefs found 76% overlap between AIO citations and the top 10, in January 2026 only 38% (partly due to better detection, partly to fan-out). Semrush: ~90% of ChatGPT citations from URLs outside Google's top 20. Ranking helps but is not a necessary condition.
robots.txt for AI crawlers: the 2024 "block all AI bots" strategy is counterproductive. Distinguish training bots (GPTBot, ClaudeBot, Google-Extended) from retrieval/search bots (OAI-SearchBot, ChatGPT-User, Claude-SearchBot, PerplexityBot): blocking the latter removes the site from AI citations.
The Italian and European market
Adoption
Eurostat 2025: use of GenAI tools in Italy at 20%, below the EU average of 33% and far from Norway (56%) and Denmark (48%); it reflects the European north-south gap. ChatGPT in Europe: average monthly active users from 11.2 to 41.3 million by March 2025 (~+270%). Italy was the first country in the world to temporarily block ChatGPT (March 2023).
AI Overviews timing in Europe
They arrived in Italy on 26 March 2025 (alongside Austria, Belgium, Germany, Ireland, Poland, Portugal, Spain, Switzerland), ~10 months after the US, in Italian and on Gemini 2.0. They trigger for long-tail informational queries. AI Mode in Italian was not fully launched as of mid-2026.
The global landscape: the Chinese AI engines
Fragmentation is not only a Western affair. The Chinese market is the world's second pole, with a more crowded competition than the US one. Baidu ERNIE: ERNIE 4.5 open-sourced on 30 June 2025 (10 MoE variants up to 424B parameters, Apache 2.0 licence); ERNIE Assistant at 202 million MAU in December 2025; in Q4 2025 subscription revenue from the AI accelerator infrastructure grew +143% YoY (up from +128% in Q3) and the call volume of the AI search API +110% QoQ (Baidu press release and earnings call, 26 February 2026). DeepSeek: cost-efficient open-source models that shook the market in early 2025; in mid-2025 one estimate attributed ~34% of the developer API share to DeepSeek vs ~18% for ERNIE; integrated into Baidu Search and Zhihu. The others: according to QuestMobile (via Caixin), in March 2026 Doubao (ByteDance) is in the lead with ~345 million MAU, ahead of Qwen (Alibaba, ~166M) and DeepSeek (~127M), with Tencent Yuanbao among the top four; the combined MAU of the main players exceed 900 million.
For an Italian freelancer these engines are context, not daily action. The point is structural: the GEO logic (retrieval, grounding, citations, fan-out) is substantially the same everywhere, and fragmentation is a global trend, not a Western anomaly. The Chinese MAU counts diverge widely between sources: they are estimates, always with source and date.
Testing GEO without fooling yourself
This is the section that separates serious work from theatre. Thesis: visibility in AI search is a distribution, not a score. Treating it like Google rank tracking is the underlying methodological error from which almost all the unreliable numbers in circulation derive.
Why a single measurement is useless (with the numbers)
The most rigorous figure comes from the paper "Don't Measure Once: Measuring Visibility in AI Search (GEO)" (Schulte et al., arXiv:2604.07585, 10 April 2026), which measured 4 engines × 8 prompts × 3 campaigns with 10 runs each (1,216-1,726 per-brand series). A single run has a standard error of 0.370 (95% CI ±0.724; Table 16, Appendix J): a true rate of 50% can appear anywhere between −22% and +122% — "essentially uninformative", indistinguishable from noise. At 7 runs the standard error drops to 0.081 (±0.158) and at 8 runs to 0.062 (±0.121). The source overlap between two consecutive days can fall to 34-42%. A second paper (Sielinski, March 2026, arXiv:2603.08924) converges: citation distributions follow a power law and 95% of ChatGPT Shopping titles appear in fewer than 30% of the runs of the same prompt. Minimum defensible floor: 10+ runs per prompt.
Citation drift: volatility over time
Volatility is not only run-to-run, it is also temporal. Monthly drift (% of domains present in July but absent in June for the same prompts): 40-60% (Profound). Half-yearly drift: 70-90% comparing January with July; BrightEdge reports a 70% churn of cited domains within six months (70× volatility gap). Platform shocks: the share of Reddit citations in ChatGPT collapsed from ~60% to ~10% in a few weeks in September 2025 (Semrush, 13 weeks); the model change of 4 March 2026 cut cited domains by 20% from one day to the next. Rule: measure with windows, not snapshots (weekly for strategic queries).
Minimum defensible protocol
- Define 20-30 prompts from a real buyer, not vanity brand queries ("what is the best X for Y", not "tell me about [my brand]").
- Run across multiple engines (ChatGPT, Perplexity, Google AI Overviews/AI Mode, Gemini) — visibility in one does not predict the others.
- Repeat each prompt 7-10 times, spread over several days (not 10 times in the same minute).
- Log three things per run: whether you appeared, which publication was cited, which competitor was named in your place.
- Compute bootstrap confidence intervals on the detection rate per brand, not bare averages.
- Report per-engine (the cross-engine aggregate hides the patterns).
- To test a change: measure the baseline over a window, apply it, wait for the re-crawl, measure over an equivalent window. Compare distributions, not points. Use a control group (unmodified pages) to separate the effect from background drift.
Metrics that matter (and one to drop)
From Nick Lafferty's reference (2026) and the cited studies: Citation Share per engine (the central metric); Time-to-First-Citation (reported as a distribution — median, P75, P90 — never as an average); Inline Brand Hyperlink Share (the share of answers with a clickable link, its weight grown after the ChatGPT change of 7 May 2026 tripled B2B SaaS referrals); Co-citation Rate; Citation Rank Stability (the Schulte paper metric that almost all dashboards skip). To drop if you sell software/services: the Shopping Trigger Rate — across ~2 million prompts, 79% never triggered Shopping and only ~6% trigger reliably; the prompt category alone predicts the trigger with 95-97% accuracy.
Replicating the GEO paper yourself
The original GEO paper is replicable at low cost. (1) Take 10-20 target pages/chunks and create two variants: baseline and treated (+ cited statistics, or + quotations). (2) Build 30-50 realistic queries. (3) Submit the queries with search active, 7-10 runs per query per variant, alternating the order to avoid position bias. (4) Measure the Position-Adjusted Word Count, in addition to presence/absence. (5) Compare the distributions with a non-parametric test (Mann-Whitney). Calibrated expectation: +20-40%, not "10x", with Statistics and Quotation as the strongest levers. If you see +300%, it is almost certainly noise from too small a sample.
Practical rule: if a GEO claim does not state how many times it repeated each prompt and over what period, it is anecdote. With SE 0.370 at one run, any two-decimal figure without repeated runs is statistically suspect.
The framework that constrains GEO in Europe
This section matters for an Italian freelancer more than it seems: the choices around robots.txt, licensing and handling client content have concrete legal implications under EU law, the most stringent in the world on AI and copyright.
TDM exception and opt-out (art. 4 CDSM): the legal cornerstone
The foundation of commercial AI training in Europe is Directive (EU) 2019/790 (CDSM): art. 3 covers TDM for scientific purposes (research organisations, not commercial AI); art. 4 covers general (commercial) TDM, which became "the cornerstone of commercial AI training in the EU", although it was added in the final stages of the legislative process without an impact assessment on GenAI. The key mechanism is the opt-out under art. 4(3): TDM is permitted by default unless a reservation is expressed in an appropriate manner, "for instance by means of machine-readable tools" for content publicly available online (Recital 18). The bridge to the AI Act: art. 53(1)(c) obliges GPAI model providers to "identify and respect … the reservations of rights expressed pursuant to art. 4(3)"; Recital 106 establishes a "Brussels effect" — the obligation applies to any provider placing a GPAI model on the EU market "regardless of the jurisdiction in which the training acts take place". Even a model trained in the US, if offered in the EU, must respect European opt-outs.
The unresolved problem: what makes an opt-out "valid"
The directive does not prescribe a single technical standard and national case law diverges. Kneschke v. LAION (Hamburg Court, 27 Sep 2024): the construction of the dataset was covered by the German equivalent of art. 3, but with doubts about art. 4 for the downstream commercial exploitation; a later ruling held that an opt-out in "natural language" in the ToS may qualify as machine-readable. DPG Media v. HowardsHome (Amsterdam Court, late 2024): the reservation must be "practically detectable and processable by automated systems". The two directions are "markedly different" → relying only on a clause in the terms of use is risky; a technical signal is also needed (robots.txt, metadata, headers). An additional argument (Synodinou-Vrakas, Nov 2025): datasets built by indiscriminate scraping may include works that are publicly accessible but not "lawfully accessed", outside the protection of the TDM exception.
The EU Parliament's push to reform the opt-out
Within the own-initiative procedure 2025/2058(INI) "Copyright and generative AI" (JURI committee, rapporteur Axel Voss), two distinct documents: (1) the JURI study PE 774095 (Prof. Nicola Lucchi, 9 July 2025), which concludes that training "far exceeds the scope of the current TDM exceptions"; (2) the draft report / Motion for a resolution PE775.433 (27 June 2025), which calls for clearer rules, transparency on training data and a remuneration obligation. The resolution was adopted in plenary on 10 March 2026 (T10-0066/2026). The majority nonetheless considers a new legislative instrument unnecessary "at this stage" — a sign of unresolved political tension.
The Italian case: FIEG vs Google, the DSA and the opt-out dilemma
15 October 2025: FIEG files a formal complaint with AGCOM against AI Overview and AI Mode, calling them "traffic killers". The charge is not copyright but a breach of the DSA: they would amount to improper competition, a structural reduction in visibility and revenue, and "a systemic risk to the economic sustainability of the entire information ecosystem". 29 April 2026 (a separate, subsequent act): following hearings with Google, FIEG and FISC, AGCOM — in its role as national Digital Services Coordinator — decides to refer to the European Commission, under art. 65 DSA, a request for assessment of Google's AI Overviews and AI Mode in relation to arts. 27, 34 and 35 DSA (systemic risks to pluralism and freedom of information; transparency of recommender systems). Press release of 30 April 2026; decision taken with the dissenting vote of Commissioner Elisa Giomi. It is a referral aimed at the possible opening of a Commission investigation — not an autonomous AGCOM sanctioning proceeding.
Exercising the opt-out (blocking AI crawlers) protects copyright but removes the content from generative answers, zeroing out GEO visibility. Publishers want to be able to opt out without losing visibility — but technically, today, the two are largely the same lever.
For an SME/e-commerce site (not a publisher) the rational choice is almost always to not opt out of retrieval bots (you want to be cited); opting out of training bots has near-zero cost in immediate visibility. They must be distinguished: blocking GPTBot (training) does not remove you from ChatGPT Search; blocking OAI-SearchBot/ChatGPT-User does.
The Privacy Authority (Garante) and GDPR: the Italian precedent
The Italian Garante was the first regulator in the world to restrict ChatGPT (31 March 2023, Order 112/2023): lack of a privacy notice, no legal basis for training, inadequate protection of minors. Reactivation on 28 April 2023 after corrective measures (privacy notice, right to object including for non-users, age verification). The 2024 fine (Order 755): €15 million; OpenAI appealed and the Court of Rome annulled the fine (judgment no. 4153/2026, filed on 18 March 2026), after which the order was removed from the Garante's website. (The reasoning was not yet public as of mid-2026.) The "right to object" extended to non-users is a privacy-based opt-out precedent parallel to the copyright one.
AI Act operational timeline
- 1 August 2024: entry into force (Reg. EU 2024/1689).
- 2 August 2025: the obligations for GPAI models apply (copyright policy and a "sufficiently detailed summary" of training data); the GPAI Code of Practice published.
- 2 August 2026: full applicability, including the obligations to label AI-generated content and deepfakes.
For a freelancer: the direct obligations fall on model providers, not on those who publish sites. But labelling clients' AI-generated content and correctly managing opt-out/licensing become part of professional due diligence.
Operational implications
What still holds from classic SEO
Crawlability and indexing (extended to Bing for ChatGPT and to AI retrieval bots); authority/E-E-A-T; semantic HTML and speed; server-side rendering (many AI bots fetch but do not execute JS).
What is new in GEO
Optimisation at the chunk/passage level (not the page); topical breadth for query fan-out (pillar + cluster); entity/brand building; surface diversification (YouTube, Reddit); indexing on Bing.
Operational recommendations in three phases
- Phase 1 — Technical foundations (immediate): verify indexing on Bing Webmaster Tools (a prerequisite for ChatGPT) and use IndexNow; audit robots.txt allowing the retrieval bots (OAI-SearchBot, ChatGPT-User, Claude-SearchBot, Claude-User, PerplexityBot, Perplexity-User) even while blocking training; server-side rendering of the key content and clean semantic HTML.
- Phase 2 — Content (1-3 months): rewrite the top pages in answer-first format (question heading, direct answer in the first 40-60 tokens); add cited statistics and quotations (the levers with the most evidence); update the key content quarterly (freshness, especially for Perplexity); structure into self-contained chunks that cover the sub-questions (for fan-out), a pillar + cluster architecture — by writing well, not by splitting artificially.
- Phase 3 — Authority and measurement (3-6 months): build brand mentions and third-party citations (G2/Trustpilot, digital PR, YouTube, Reddit; Wikipedia for entity grounding where relevant); implement an AI tracking tool (Otterly entry-level, Peec AI for European multilingual, Profound enterprise) and monitor trends with repeated runs, not snapshots.
Thresholds that change the choices
- If traffic from AI search exceeds 1-2% (today typically <1% but it converts better than organic), increase the GEO investment.
- If the overlap between AI citations and organic ranking is low, prioritise chunkability and authority.
- For Italy, monitor the launch of AI Mode in Italian and the evolution of the FIEG-AGCOM case and the EU copyright framework.
What to debunk in the GEO discourse
This section isolates the GEO claims that circulate as truths but have weak, no, or contrary evidence. A single criterion: primary source or replicable experiment vs repetition among influencers.
Google debunks 5 myths (15 May 2026)
On 15 May 2026 Google Search Central published the official guide "Optimizing your website for generative AI features on Google Search" (announced by John Mueller). It is the most explicit on-record statement on what works for AI Overviews and AI Mode. The underlying thesis: "AEO and GEO are still SEO", because the AI features run on the same ranking systems as classic Search. Google classifies as unnecessary: (1) llms.txt files and special markup — "You don't need to create new machine readable files, AI text files, markup, or Markdown to appear in generative AI search"; (2) content chunking — the systems "are able to understand the nuance of multiple topics on a page"; (3) AI-specific rewrites; (4) inauthentic mentions (artificial link-building and mentions); (5) excessive use of schema/structured data. What to actually do: solid SEO, non-commodity content with unique perspectives and first-hand experience, multimodal assets.
Reinforcing the position, on 5 June 2026 Google published "Google Search's guidance on using third-party SEO tools, services, and advice" and updated "Do you need an SEO?", explicitly naming AEO and GEO as service categories. Google legitimises the discipline but narrows its scope — it remains "still SEO" — and urges caution toward anyone promising shortcuts or guarantees of citation in AI answers.
The honest tension over chunking
Google is right on one point: you don't have to physically split the page into micro-files or rewrite it in an "AI format" — its systems do the chunking on their side and understand multi-topic pages. But "manual chunking is not needed" ≠ "structure doesn't matter": RAG research shows that clean, self-contained chunks improve retrieval all else being equal, and you get them by writing well (clear headings, one idea per section, direct answers), not by manipulating the structure for the bot. A crucial distinction: the guidance applies to Google (which runs on Search ranking); for ChatGPT, Perplexity and Claude (their own RAG pipelines) structure remains more relevant. Generalising "chunking is dead" to all engines is over-extension. Verdict: stop selling "chunking optimisation" as a service in itself; keep writing well-structured content.
llms.txt — the textbook case of hype
A proposal by Jeremy Howard (September 2024), a Markdown file in the root to help LLMs use a site at inference time; it was born for technical documentation aimed at dev tools, not as an SEO lever. The evidence against: Google (Mueller, Illyes) states it does not use it and compares it to the obsolete "keywords meta tag"; Otterly detects 84 requests out of 62,100 in 90 days (0.1%); Ahrefs (May 2026, ~38,000 valid files across 137,210 domains) finds that 97% of the files receive no requests; SE Ranking (~300,000 domains) finds no correlation with citations; Search Engine Land: "there is no data or evidence showing that llms.txt files boost AI inclusion." The grain of truth: Wix (AI Search Lab, over 1,400 files, Nov 2025) estimates that indexed files rose from ~30-60,000 to ~120,000 (May 2026, a peak of ~200,000 in April) — but it is a self-interested, unverified estimate (the "125,000" figure in the subtitle does not match the "~120,000" in the body) and, in the same source's words, "this will not make or break your GEO strategy." Verdict: an AI-citation lever that is unproven, with mostly contrary evidence. A legitimate and narrow use case (documentation for agents/dev tools). Realistic priority: very low.
Serving pages in Markdown too — marginal, often hype
Is it worth publishing a Markdown version of pages to get cited better? For mainstream engines it is marginal/situational, not proven, and in some forms it is risky. Technical distinction: content negotiation via Accept: text/markdown on the same URL is legitimate, standard HTTP; separate .md files at separate URLs are defined by Google and Bing as potential cloaking and a doubling of the crawl budget. Search crawlers (GPTBot, OAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeBot, Googlebot, Bingbot) do not negotiate Markdown — only some coding agents do in a live session (Claude Code, Cursor, OpenCode — Checkly test, Feb 2026). Empirical evidence of a null/non-significant effect: Profound (a controlled experiment, 381 pages, Jan-Feb 2026) finds a ~16% lift that is not statistically significant, driven by outliers; Otterly finds 0% AI bot traffic and zero citations to .md files. The real advantage of Markdown is tokenisation (~80% fewer tokens according to Cloudflare), but it benefits those who convert the HTML (Jina Reader, Firecrawl, RAG pipelines, Claude Code via Turndown already do this). Google (15 May 2026): "You don't need to create new machine readable files, AI text files, markup, or Markdown … as Google Search itself doesn't use them"; Mueller (Feb 2026) calls converting pages just for the bots "such a stupid idea". Verdict: useful only for technical/SaaS/API documentation consulted by agents in real time; for a generic site, clean semantic HTML matters far more. If you implement it, use content negotiation with Vary: Accept, never separately indexable .md files.
Schema/structured data — useful, but not for the reason you're told
The hype claim: "without schema.org you don't get cited by AI". Google (May 2026) declares it not required to generate AI answers; Pedro Dias and others have shown that schema does not influence ChatGPT citations. Correlation studies exist but are confounded by third variables: sites with schema also tend to be better maintained and more authoritative — the correlation does not isolate schema as the cause. Balanced verdict: it remains useful for its classic purposes (rich results, parsing, entity disambiguation) and for traditional Search, not as an "AI trick".
Vendors' bait numbers
Precise percentages without a named primary source: "data-rich cited 2.7x more", "FCP <0.4s = 6.7 average citations", "well-organised headings = 2.8x more cited" (AirOps — the last at least has a declared dataset of 45,000 citations, more defensible). Methodology, sample size, number of runs and a control group are often missing. In light of SE 0.370 at one run, any two-decimal figure obtained without repeated runs is statistically suspect. Practical rule: if a claim does not state how many times it repeated each prompt and over what period, it is anecdote.
"SEO is dead" — the opposite over-extension
Equally false: the Ahrefs/Semrush data show that traditional organic ranking remains correlated (even if no longer a necessary condition) with AI citation; AI Overviews run on top of the Search ranking system; "parametric visibility" is built with the same signals as classic SEO (authority, press coverage, Wikipedia, backlinks, mentions). Google itself titles its position "AEO and GEO are still SEO". GEO is an extension of SEO, not a replacement. The foundations are more important, not less; what changes is the level of competition (chunk vs page), the surfaces and the metrics.
| GEO claim | Verdict | Priority |
|---|---|---|
| Statistics + source citations | Proven | High |
| Content freshness | Proven | High |
| Non-commodity content / first-hand experience | Confirmed | High |
| Answer-first structure | Solid | High |
| Brand mentions > backlinks | 1 study | Medium |
| Schema for AI citation | Overrated | Low |
| Manual chunking | Myth (Google) | Low |
| llms.txt as a lever | Unproven | Very low |
| Serving pages in Markdown | Marginal | Low |
| "2.7x" bait numbers | Anecdote | Ignore |
| "SEO is dead" | False | — |
web.run, fan-out and system prompt derive from independent analyses (RESONEO/Meteoria, AirOps, Dejan), not from complete official documentation, and they change from one model to the next. The pipelines change quickly (Gemini 3 Jan 2026, ChatGPT 5.3 switch Mar 2026): every figure has a validity date.