| |

Detect Content Cannibalization at Scale with Vector Embeddings: A Python and n8n Pipeline

Keyword cannibalization is one of those problems most SEO teams “kind of” know about but rarely solve at scale. You have 4,000 URLs, dozens of pages targeting overlapping queries, and a vague feeling that Google is rotating the wrong page into the SERP. A manual audit using Search Console filters and a spreadsheet works for a 200-URL site. It collapses the moment you cross a few thousand.

In 2026 there is a better way: treat cannibalization as a semantic similarity problem, compute it with vector embeddings, and run the whole pipeline on a schedule from n8n. This post walks through the exact workflow we use to flag cannibalization across mid-sized sites, including the Python code, the n8n nodes, and the thresholds that actually matter.

Why classic cannibalization audits break at scale

The traditional method goes like this: pull Search Console “Queries” data, find queries where multiple URLs receive clicks, then inspect each one. That works for a handful of cases but misses the bigger problem — cannibalization is not always visible in the queries report. Two pages can compete because their content overlaps even when GSC only shows one of them ranking. The other page is silently dragging the topic down without ever appearing in your filters.

Embeddings catch this. When you convert page bodies into 1,536-dimensional vectors, you can measure how semantically close two pages are regardless of whether Search Console has surfaced a shared query yet. Pages with cosine similarity above ~0.92 are almost certainly competing for the same intent. That’s the signal we want.

If you’ve already built the keyword-research workflow described in our n8n + GSC API guide, you have most of the inputs you need: a list of URLs and their primary query clusters. The pipeline below extends that foundation.

The pipeline at a glance

Five stages, all orchestrated from a single n8n workflow that runs nightly:

  1. Crawl & extract — pull the main content from each indexable URL.
  2. Embed — compute an embedding per page (OpenAI text-embedding-3-small or a local model).
  3. Cluster — group near-duplicates using cosine similarity.
  4. Score — combine semantic similarity with GSC overlap to produce a cannibalization score.
  5. Report — write the flagged pairs to Google Sheets and ping Slack.

The whole job runs in roughly 8–15 minutes for a 5,000-URL site once embeddings are cached. After the first full run, only changed URLs need re-embedding.

Step 1: Crawl and extract main content

You need clean body text, not full HTML. Boilerplate (nav, footer, sidebar) inflates similarity scores because every page on the site shares it. Use trafilatura — it consistently beats Readability and BeautifulSoup-based extraction on real-world layouts.

import trafilatura, requests

def fetch_main_text(url: str) -> str:
    r = requests.get(url, timeout=20, headers={"User-Agent": "SACAuditBot/1.0"})
    r.raise_for_status()
    text = trafilatura.extract(
        r.text,
        include_comments=False,
        include_tables=False,
        favor_precision=True,
    )
    return (text or "").strip()

In n8n, wrap this in a Code (Python) node or call a small FastAPI service if you prefer to keep n8n stateless. The output is one row per URL with a body_text field. Truncate to roughly 8,000 tokens before embedding — anything longer rarely improves similarity for SEO content.

Step 2: Generate embeddings (and cache them)

Embeddings are deterministic for identical inputs, so caching is non-negotiable. We use a tiny SQLite store keyed on (url, content_hash). If the hash hasn’t changed since the last run, we skip the API call entirely.

import hashlib, sqlite3, json
from openai import OpenAI

client = OpenAI()
db = sqlite3.connect("embeddings.db")
db.execute("""CREATE TABLE IF NOT EXISTS emb (
    url TEXT, hash TEXT, vector TEXT,
    PRIMARY KEY (url, hash))""")

def embed(url: str, text: str) -> list[float]:
    h = hashlib.sha256(text.encode()).hexdigest()
    row = db.execute(
        "SELECT vector FROM emb WHERE url=? AND hash=?", (url, h)
    ).fetchone()
    if row:
        return json.loads(row[0])
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=text[:8000],
    )
    vec = resp.data[0].embedding
    db.execute("INSERT OR REPLACE INTO emb VALUES (?,?,?)",
               (url, h, json.dumps(vec)))
    db.commit()
    return vec

If you want to avoid OpenAI, swap in sentence-transformers with all-MiniLM-L6-v2. It’s 384-dimensional, runs on CPU, and is accurate enough for cannibalization work. The downstream code doesn’t care about model choice — only that you embed every URL with the same model.

Step 3: Find the dangerous pairs

Naive pairwise comparison is O(n²), which is fine up to about 5,000 URLs. Beyond that, use FAISS or HNSW for approximate nearest neighbors. Here’s the straightforward version:

import numpy as np
from itertools import combinations

def cosine(a, b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def find_cannibal_pairs(vectors: dict, threshold: float = 0.92):
    items = list(vectors.items())
    pairs = []
    for (u1, v1), (u2, v2) in combinations(items, 2):
        sim = cosine(np.array(v1), np.array(v2))
        if sim >= threshold:
            pairs.append((u1, u2, round(sim, 4)))
    return sorted(pairs, key=lambda x: -x[2])

The 0.92 threshold is the value that has held up best across the dozen sites we’ve audited. Below 0.88 you get false positives — pages on the same topic that legitimately serve different intents. Above 0.95 you mostly catch near-duplicate pages (often templated content or accidental duplicates from CMS migrations).

Step 4: Combine with Search Console signal

Semantic similarity alone produces a long list. To prioritize, intersect it with real-world ranking conflicts. For each pair, ask GSC: are both URLs receiving impressions on at least one shared query?

def shared_queries(gsc_data, url_a, url_b, min_impr=10):
    qa = {row["query"] for row in gsc_data[url_a] if row["impressions"] >= min_impr}
    qb = {row["query"] for row in gsc_data[url_b] if row["impressions"] >= min_impr}
    return qa & qb

def cannibalization_score(sim, shared_q_count):
    # Sim is the embedding similarity (0-1). Shared queries amplify it.
    return round(sim * (1 + min(shared_q_count, 10) / 10), 3)

A pair with similarity 0.93 and 4 shared queries gets a score of 1.302. A pair with the same similarity but zero shared queries gets 0.93. We sort the report by this composite score so editors look at the most actionable conflicts first.

Step 5: Decide what to do with each pair

Detection is the easy part. Resolution requires editorial judgment, but the workflow can pre-classify each conflict to save reviewer time. We use three buckets:

  • Merge — similarity > 0.95 AND shared queries > 5. These are near-duplicates. Pick the stronger URL, 301 the weaker one, and absorb any unique sections.
  • Differentiate — similarity 0.88–0.94 AND shared queries 1–5. The pages cover overlapping but distinct intents. Rewrite one to target a clearly different query or audience.
  • Canonical — pages that must coexist (paginated content, regional variants) but compete in SERP. Add or fix rel=canonical rather than redirect.

The n8n workflow writes each pair to a Google Sheet with the recommended bucket, the similarity score, the shared queries, and links to both URLs. The editorial team works through that sheet weekly.

The n8n wiring

A high-level node sequence:

  1. Schedule Trigger — nightly at 02:00 site-local time.
  2. Google Search Console node — pull last 28 days of query+page data.
  3. Code node (Python) — extract unique URLs.
  4. Code node — run fetch_main_text in parallel batches of 20.
  5. Code node — call the embedding function with caching.
  6. Code node — run pair detection and scoring.
  7. Google Sheets node — append flagged pairs.
  8. Slack node — send a summary if any pair has score > 1.2.

If you need a deeper walkthrough of the n8n stack choice, our n8n vs Make vs Zapier comparison covers when each platform makes sense. For sites under 1,000 URLs you can run the entire workflow inside a single self-hosted n8n container with no external services.

Results from three real audits

We ran this pipeline against three production sites in Q1 2026. Numbers below are real but anonymized.

  • Site A — 1,800 URLs (SaaS blog). 47 pairs flagged. After merging 12 and differentiating 18, organic clicks to the surviving URLs rose 23% in the following 90 days. Six pairs were duplicate-feed bugs in the CMS.
  • Site B — 4,300 URLs (affiliate site). 312 pairs flagged, dominated by accidental category/tag duplication. Cleanup cut indexable URLs by 31% and impressions per surviving URL rose by an average of 38%.
  • Site C — 12,000 URLs (programmatic). 1,100+ pairs. The audit revealed a template bug producing near-identical pages across regions. Fixing the template was a one-line code change that resolved most pairs at once.

The pattern across all three: most cannibalization is structural, not editorial. Once you can see the duplicates as a list of URL pairs sorted by severity, the fix is usually clear — and often a single root cause explains dozens of pairs.

Common pitfalls

A few things you’ll trip over on the first run:

  • Boilerplate inflation. If extraction includes the nav and footer, every page becomes “similar” to every other page. Test extraction on 10 random URLs before trusting the embeddings.
  • Mixed languages. Embedding models handle multiple languages but mixing them within one comparison set lowers the signal. Split by language first.
  • Soft 404s. Empty pages and thin pages all embed close to each other. Filter out URLs with fewer than ~250 words of extracted content before pairing.
  • Threshold drift. Re-validate the 0.92 cutoff on your own corpus. Pick 20 known-duplicate and 20 known-distinct pairs, score them, and adjust until the gap is clean.

Want more pipelines like this? Bookmark SEOAutomationClub.com — we publish a new automation playbook every weekday morning. The pattern of “GSC + embeddings + n8n” generalizes well: the same architecture powers our content decay detector and the cannibalization workflow above.

FAQs

What’s the difference between keyword cannibalization and content cannibalization?

Keyword cannibalization is when two URLs compete for the same query in Search Console. Content cannibalization is broader — two URLs overlap semantically even if SERPs haven’t yet flipped between them. The embedding approach in this post catches both.

Do I need OpenAI to run this pipeline?

No. sentence-transformers with all-MiniLM-L6-v2 runs on CPU and is free. For sites under 10,000 URLs the quality is enough. OpenAI’s text-embedding-3-small is faster and slightly better at long-form content; it costs about $0.02 per 1,000 pages.

How often should this audit run?

Nightly is overkill for most sites. Weekly is sensible. Re-embed only URLs whose content hash has changed — the SQLite cache pattern in Step 2 handles this automatically, so a weekly run typically processes only 1–5% of URLs after the first full pass.

Can I use this on a JavaScript-rendered site?

Yes, but swap requests for a headless browser or a rendering API. The Bright Data Scraping Browser and Playwright both work; the rest of the pipeline is identical because it consumes plain text downstream.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *