| |

Automated Competitor Content Gap Analysis: A Python Workflow Using Search Console and Common Crawl

Content gap analysis is one of those SEO tasks that gets paid lip service in slide decks and then quietly skipped because the manual version is brutal. Pull your keywords, pull your competitor’s keywords, diff, prioritize, brief, write. By the time a strategist finishes mapping a single competitor for one segment, the SERPs have already moved.

The trick is to stop treating it as a one-off audit and start treating it as a recurring pipeline. With Google Search Console, the Common Crawl index, and a few hundred lines of Python, you can refresh a competitor gap report weekly with almost no human input. Below is the workflow we run, the failure modes you will hit, and the design decisions that matter.

Why most content gap tools quietly underdeliver

Commercial gap tools sell a tidy promise: feed them two domains, get a list of keywords competitor ranks for that you do not. The output looks impressive in a screenshot. In practice the lists are noisy because they pull from a finite keyword universe (whatever the vendor crawls) and rarely consider whether the gap is actionable for your site.

You end up with three problems. The first is irrelevance: a third of suggested keywords belong to a topic cluster you have no business pursuing. The second is duplication: vendors flag keywords you already rank for under a slightly different query intent. The third is staleness: the data is refreshed monthly, but content velocity has moved to weekly cycles.

Building the pipeline yourself fixes all three. You get to define the universe (your own GSC plus a competitor crawl), the relevance filter (your topical clusters), and the cadence (whatever your scheduler runs).

The four data sources that actually matter

You need less data than vendors suggest. The pipeline runs on four inputs, all free or near-free.

1. Google Search Console (your domain)

Pull the last 90 days of query performance via the Search Analytics API. You want query, impressions, clicks, position, and the URL each query maps to. Ninety days smooths out weekly noise without dragging in old seasonal patterns.

2. Common Crawl (competitor pages)

Common Crawl publishes monthly snapshots of the web at petabyte scale. For a single competitor, you only need their domain’s WARC records from the most recent crawl. The CDX index lets you query by domain prefix and pull just the HTML pages you care about, usually a few thousand URLs for a mid-size publisher.

3. A title and H1 extractor

Once you have the competitor’s HTML, extract title tag, H1, meta description, and the first 200 words. That is enough surface area to classify each page into a topic cluster without paying for full embeddings on the entire document. We use BeautifulSoup; lxml works too.

4. An embedding model

OpenAI’s text-embedding-3-small or any open Sentence-BERT model is fine here. You will embed each competitor page (title + H1 + intro) and each of your own ranked queries. Cosine similarity gives you a cheap way to ask “does my site cover this page’s topic anywhere?”

The pipeline in five stages

The workflow takes 20-40 minutes end to end on a mid-size site. Each stage has a clear failure mode, so write checkpoints — pickle the intermediate dataframes between stages and you can rerun the last step without burning the API budget again.

Stage 1: Inventory your own coverage

Hit the GSC API for queries with at least 50 impressions in the last 90 days. Group by landing page, then embed the top 10 queries per page concatenated. That single vector represents what the page is “about” from Google’s perspective, which is the only opinion that matters for gap analysis.

Stage 2: Pull the competitor’s URL inventory

Use the Common Crawl CDX API to list all URLs from the competitor domain in the most recent monthly crawl. Filter to status 200, content-type text/html, exclude tag and category archives. Most competitors have 80% of their value concentrated in 20% of URLs, so this list will be smaller than you expect.

Stage 3: Fetch and extract

For each competitor URL, fetch the WARC record from Common Crawl’s S3 bucket (free egress within AWS). Extract title, H1, meta description, and intro paragraph. Embed the combined string. Cache aggressively — Common Crawl URLs are content-addressable so caches do not invalidate.

Stage 4: Match and score

For each competitor page embedding, find the nearest neighbor among your own page embeddings using cosine similarity. If the nearest distance exceeds your threshold (we use 0.35 in practice), the competitor page is a gap candidate. If it falls below, you already cover that topic and the question becomes “are you ranking?” — which leads to a different workflow we have written up in our technical SEO monitoring guide.

Stage 5: Prioritize

Gap candidates are useless without a priority score. We multiply three numbers: competitor page authority (use referring domains from your backlink tool, or page rank from Common Crawl’s webgraph as a free proxy), estimated search volume (use GSC’s query data from similar pages as a baseline, not a third-party volume estimate), and topical fit (how close the candidate is to your existing strongest cluster).

The output is a CSV of perhaps 30-80 gap candidates, ranked. From there, briefing and writing is human work — or another pipeline entirely, which connects neatly with automated internal linking strategies once the new posts ship.

Failure modes you will hit

Three things tend to break this pipeline in production.

Common Crawl monthly cadence means your competitor data lags 3-6 weeks. For most niches that is fine. For breaking news or fast-moving SaaS verticals, you will want to layer a custom Scrapy crawler on top that hits the competitor’s sitemap weekly and merges with the Common Crawl baseline.

Embedding drift is real. If you change embedding models mid-pipeline (cheaper provider, new version), your stored vectors will not be comparable. Version the embedding model in your cache key and recompute when you upgrade.

GSC API rate limits hit harder than the documentation suggests. Per-property quotas are around 1,200 queries per minute, but the implicit per-account quota is lower. If you run this for multiple clients, spread the runs across the day.

Where this fits in a larger automation stack

Content gap analysis is the upstream half of a publishing pipeline. The downstream half is brief generation, draft writing, and on-page optimization. If your team is already producing automated schema markup and structured data at scale, layering a weekly gap report on top closes the loop: the same script tells you what to write next.

The discipline that separates a useful gap pipeline from a vanity dashboard is the priority score. Without it, you get a list. With it, you get a queue — and queues are what production publishing teams actually consume.

Frequently asked questions

How often should I rerun a content gap analysis?

Weekly is the sweet spot for most sites. Common Crawl refreshes monthly, so daily runs only add value if you layer a custom competitor crawler. Quarterly is too slow for any actively-growing niche.

Can I run this without Common Crawl?

Yes, but you trade off. A direct Scrapy crawl of the competitor gives you fresher data but eats your IP reputation and risks getting blocked. Common Crawl is anonymous and free at the data layer. For most use cases, the freshness penalty is acceptable.

What embedding model is best for SEO content?

OpenAI’s text-embedding-3-small is the cheapest practical option in 2026 at production scale. Voyage AI’s voyage-3-lite is competitive for English. For multilingual sites, multilingual-e5-large is the open-source default. The model matters less than caching and versioning.

How do I avoid flagging keywords I already rank for?

The cosine similarity threshold is your control. Start at 0.35 and tune — too low and you over-trigger on near-duplicates, too high and you miss real gaps. Build a tiny labeled set of 50 gap-or-not examples and pick the threshold that maximizes F1 on that set.

Does this work for ecommerce sites?

The pipeline transfers, but the unit of analysis changes. You compare product category pages to competitor category pages, and the priority score weights GMV potential over impressions. The Common Crawl extraction needs richer rules — product cards, not just title and H1.

Similar Posts