|

Reverse-Engineering a Competitor’s Programmatic SEO Setup: A Python Teardown of Sitemaps, Templates, and Internal Link Graphs

Every SEO team eventually runs into the same wall: a competitor is ranking for thousands of long-tail queries you have not even thought to target, and they are doing it with pages that all look suspiciously similar. That is a programmatic SEO (pSEO) setup — one HTML template, one dataset, and an internal linking scheme stamped out across hundreds or thousands of URLs. The instinct is to admire the scale and move on. The better move is to take it apart.

This post is a hands-on teardown. We will treat a competitor’s pSEO site as a black box and reverse-engineer the three things that actually make it work: the URL footprint (what they templated and how big the surface area is), the page template (which blocks are static boilerplate vs. data-driven), and the internal link graph (how authority flows to the money pages). Everything here runs on Python you can copy into a notebook, and we close with an n8n recipe to turn the teardown into a repeatable monitoring job.

What a programmatic SEO setup actually is

Before you can reverse-engineer something, you need a mental model of it. A pSEO site is almost always three components glued together. First, a dataset — a spreadsheet, database table, or API feed with one row per page (cities, products, “X vs Y” pairs, integrations, ZIP codes). Second, a template — a single page layout with slots that get filled from each row. Third, an internal linking layer — hub pages, related-item modules, and breadcrumb trails that connect the templated pages so crawlers can discover and rank them.

When you understand a competitor’s setup at this level, you are not copying their content. You are learning which dataset dimensions Google rewarded, how thin or thick their template is, and where the internal-link equity is concentrated. That is intelligence you can act on, and it is far more durable than scraping their copy.

Step 1 — Map the footprint from sitemaps

The fastest, most polite way to size a pSEO operation is the XML sitemap. pSEO sites almost always expose their full URL inventory there, often split across a sitemap index. Pull the index, walk every child sitemap, and you have the complete list of templated URLs without crawling a single rendered page.

import requests, re
from collections import Counter
from urllib.parse import urlparse
import xml.etree.ElementTree as ET

UA = {"User-Agent": "Mozilla/5.0 (compatible; SEO-research/1.0)"}
NS = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}

def fetch_sitemap(url):
    r = requests.get(url, headers=UA, timeout=20)
    return ET.fromstring(r.content)

def all_urls(sitemap_url):
    root = fetch_sitemap(sitemap_url)
    # If this is a sitemap index, recurse into child sitemaps
    children = [loc.text for loc in root.findall(".//sm:sitemap/sm:loc", NS)]
    if children:
        urls = []
        for child in children:
            urls.extend(all_urls(child))
        return urls
    return [loc.text for loc in root.findall(".//sm:url/sm:loc", NS)]

urls = all_urls("https://competitor.example/sitemap.xml")
print(f"Total indexed URLs: {len(urls)}")

Raw counts are interesting, but the real signal is in the path patterns. Collapse every URL into a normalized template by replacing the variable segments with a placeholder, then count how many URLs share each shape. The patterns with thousands of members are your competitor’s programmatic plays.

def to_pattern(url):
    path = urlparse(url).path.strip("/")
    segs = []
    for seg in path.split("/"):
        # Treat numeric or slug-like segments as variable
        if re.search(r"\d", seg) or "-" in seg:
            segs.append("{var}")
        else:
            segs.append(seg)
    return "/" + "/".join(segs)

patterns = Counter(to_pattern(u) for u in urls)
for pat, n in patterns.most_common(15):
    print(f"{n:>6}  {pat}")

A typical output reveals the strategy instantly: /compare/{var}-vs-{var} with 4,200 URLs, /tools/{var} with 1,800, and a thin /blog/{var} tail. You now know exactly which dataset dimensions they bet on. This same sitemap-first approach is the backbone of our automated competitor content gap analysis workflow, which adds Search Console and Common Crawl data on top of the URL inventory.

Step 2 — Fingerprint the page template

Once you know the patterns, sample a handful of URLs from the largest one and figure out how much of each page is boilerplate versus data. The cheap, effective trick is to fetch two or three pages from the same pattern and diff them. Whatever is identical across pages is the template; whatever changes is the per-row data.

from bs4 import BeautifulSoup
import random

sample = random.sample([u for u in urls if to_pattern(u) == "/compare/{var}-vs-{var}"], 3)
texts = []
for u in sample:
    html = requests.get(u, headers=UA, timeout=20).text
    soup = BeautifulSoup(html, "html.parser")
    for tag in soup(["script", "style", "nav", "footer"]):
        tag.decompose()
    texts.append(soup.get_text("\n", strip=True).split("\n"))

# Lines present on every sampled page = template boilerplate
boiler = set(texts[0])
for t in texts[1:]:
    boiler &= set(t)
unique_per_page = [len([l for l in t if l not in boiler]) for t in texts]
print(f"Boilerplate lines shared across all pages: {len(boiler)}")
print(f"Unique lines per page: {unique_per_page}")

If the “unique lines per page” number is small — say 6 to 10 — you are looking at a thin template that is one Google helpful-content update away from trouble. If it is large and the unique content is substantive, the competitor invested in real per-page value, and you will need to match that depth rather than out-publish them on volume. This ratio is the single most useful number a teardown produces, because it tells you whether the moat is the data or the editorial effort.

Watch for client-side rendering

If the diff shows almost no unique content, do not assume the pages are thin — they may be rendering data client-side with JavaScript. In that case requests only sees the shell. Switch to a headless browser or a rendering API (Bright Data’s Web Unlocker and Scraping Browser both return the post-JavaScript DOM) so your fingerprint reflects what Googlebot actually indexes after rendering.

Step 3 — Reconstruct the internal link graph

Volume without internal linking is dead weight. The pages that rank in a large pSEO site are almost always the ones that receive the most internal links from hub pages and related-item modules. Crawl a sample of the templated URLs, extract on-site links, and build a directed graph so you can rank pages by inbound internal links.

import networkx as nx
from urllib.parse import urljoin

DOMAIN = "competitor.example"
G = nx.DiGraph()

def internal_links(url):
    html = requests.get(url, headers=UA, timeout=20).text
    soup = BeautifulSoup(html, "html.parser")
    out = set()
    for a in soup.select("a[href]"):
        target = urljoin(url, a["href"])
        if urlparse(target).netloc.endswith(DOMAIN):
            out.add(target.split("#")[0])
    return out

for u in random.sample(urls, min(300, len(urls))):
    for target in internal_links(u):
        G.add_edge(u, target)

# Rank pages by how much internal authority points at them
pr = nx.pagerank(G)
top = sorted(pr.items(), key=lambda kv: kv[1], reverse=True)[:15]
for url, score in top:
    print(f"{score:.5f}  {url}")

Running PageRank over the internal graph surfaces the competitor’s intended money pages — the URLs they deliberately funnel equity toward. Cross-reference those against the patterns from Step 1 and you can see, for example, that the /compare/ pattern gets the bulk of internal links while /blog/ is starved. That is a deliberate prioritization decision you can learn from. If you want to apply the same thinking to your own site, our guide on automated internal linking strategies covers how to build these link modules at scale instead of by hand.

Step 4 — Estimate what actually ranks

A footprint of 4,000 pages does not mean 4,000 ranking pages. The final teardown step is separating the indexed-and-ranking pages from the dead inventory. You will not have the competitor’s Search Console, but you can approximate ranking reality two ways: sample the pattern against live SERPs for representative queries, and check which URLs are actually indexed with a site: probe. A SERP API (Bright Data’s SERP API returns structured Google results) makes the first approach scriptable.

# Pseudocode: sample 50 templated URLs, derive their target query
# from the slug, and check whether the URL ranks in the top 20.
hits = 0
for u in random.sample(pattern_urls, 50):
    query = slug_to_query(u)          # "asana-vs-trello" -> "asana vs trello"
    serp = serp_api(query)            # structured top-N results
    if any(urlparse(r).path == urlparse(u).path for r in serp):
        hits += 1
print(f"Estimated coverage: {hits/50:.0%} of sampled pages rank top-20")

An estimated 8% top-20 coverage on a 4,000-page pattern still means roughly 320 ranking pages — a serious asset. A 1% coverage rate, on the other hand, tells you the competitor is mostly generating index bloat, and you can win the same space with a smaller, higher-quality set. This is the difference between copying a strategy and improving on it. If you want to detect where your own templated pages are competing against each other for the same query, our content cannibalization detection pipeline uses vector embeddings to flag the overlap.

Turning the teardown into an n8n monitoring job

A one-time teardown is useful; a scheduled one is a competitive radar. The four steps above map cleanly onto an n8n workflow: a Cron trigger fires weekly, an HTTP Request node pulls the competitor’s sitemap index, a Code node runs the pattern-collapsing logic, and a comparison node diffs this week’s pattern counts against last week’s stored snapshot. When a pattern grows by more than a threshold — say the /compare/ count jumps from 4,200 to 6,000 — the workflow posts an alert to Slack and appends the delta to a Google Sheet.

That single signal — “a competitor just shipped 1,800 new templated pages” — is often the earliest warning you will get that they are expanding into your keyword space. Pipe the weekly pattern counts into a dashboard and the trend becomes obvious at a glance; our BigQuery and Looker Studio dashboard teardown shows how to wire that visualization layer up.

Results and takeaways

A disciplined pSEO teardown gives you four numbers that no keyword tool will hand you: the size of each templated pattern, the content-to-boilerplate ratio of the template, the internal-link distribution across patterns, and an estimated ranking coverage rate. Together they tell you whether a competitor’s moat is the dataset, the editorial depth, or the link architecture — and therefore where you should compete and where you should not bother.

The meta-lesson is that scale in SEO is legible. Anything stamped out from a template leaves a fingerprint in the sitemap and the link graph, and a few hundred lines of Python are enough to read it. Run the teardown once to understand a competitor, then schedule it to watch them.

Found this useful? Bookmark SEOAutomationClub and check back each week — we publish working code and real automation playbooks, not generic listicles. If you are building your own pSEO operation next, start with the internal linking automation guide so your pages actually get discovered and ranked.

Frequently asked questions

Is reverse-engineering a competitor’s programmatic SEO setup legal?

Reading public sitemaps and rendered pages is standard competitive research, but you should respect each site’s robots.txt, rate-limit your requests, and never scrape content to republish it. The goal of a teardown is to understand strategy — pattern sizes, template depth, and link architecture — not to copy editorial content, which raises both legal and duplicate-content risks.

What is the difference between a programmatic SEO teardown and a content gap analysis?

A content gap analysis tells you which keywords or topics a competitor ranks for that you do not. A pSEO teardown goes one level deeper into the mechanics: it reveals which URL patterns were templated, how thin or thick each page is, and how internal links concentrate authority. Gap analysis tells you what to target; a teardown tells you how the competitor built the machine that targets it.

How do I handle JavaScript-rendered pSEO pages?

Standard HTTP libraries like requests only return the initial HTML shell, so JavaScript-rendered content appears empty. Use a headless browser (Playwright or Puppeteer) or a rendering API that executes JavaScript and returns the final DOM. This ensures your template fingerprint reflects what Googlebot indexes after rendering, not the pre-render skeleton.

How often should I re-run a programmatic SEO teardown?

For active competitors, a weekly sitemap snapshot is enough to catch new templated pages early, and that is light enough to schedule in n8n without straining the target site. Reserve the full template-fingerprint and internal-link-graph crawl for monthly deep dives, since those are more request-intensive and change more slowly than raw URL counts.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *