| |

Automate 301 Redirect Mapping for Site Migrations with Python and Embeddings

Site migrations are where SEO equity goes to die. You move to a new CMS, restructure URLs, consolidate a bloated blog — and if every old URL doesn’t land on its closest new equivalent with a clean 301, you bleed rankings, leak link equity into soft 404s, and spend the next quarter explaining a traffic cliff to your boss. The painful part isn’t the redirect rule itself. It’s the mapping: deciding which of 8,000 old URLs should point to which of 6,200 new ones.

Most teams still do this in a spreadsheet, eyeballing slugs at 11pm before a launch. That doesn’t scale, and it’s exactly the kind of fuzzy matching problem machines are good at. In this post I’ll show you how to build an automated 301 redirect mapper that uses sentence embeddings to match old URLs to new ones by meaning, not just string similarity — with a confidence threshold that auto-approves the easy matches and routes the ambiguous ones to a human. It’s the same embeddings backbone behind our embedding-based content cannibalization detector, pointed at a different problem.

Why string matching fails at redirect mapping

The naive approach is fuzzy string matching on slugs — Levenshtein distance, token sort ratio, that family. It works until it doesn’t. Consider an old URL /blog/2021/cheap-flights-to-lisbon-guide migrating to a new IA where the equivalent page is /destinations/portugal/lisbon. There is almost no lexical overlap, but semantically they’re the same page. String matching scores that pair near zero and either drops it (soft 404) or matches it to whatever shares the most characters, which is often wrong.

Embeddings solve this because they encode meaning. If you embed a representative text signature for each page — title, H1, and the cleaned slug — then “cheap flights to Lisbon guide” and “Lisbon, Portugal destination” land close together in vector space even though they share no tokens. Cosine similarity between those vectors becomes your match score, and it degrades gracefully: near-duplicates score ~0.9, plausible matches ~0.7, garbage ~0.3.

The pipeline at a glance

Here’s the full flow before we get into code. Each stage is a discrete step you can run independently and inspect:

  1. Inventory both URL sets. Export the old site’s live URLs (with titles and H1s) from a crawl, and the new site’s URLs from its sitemap or a staging crawl.
  2. Build a text signature per URL. Concatenate title + H1 + humanized slug into one string per page.
  3. Embed every signature. Run all signatures through a sentence-embedding model to get a vector per URL.
  4. Match by cosine similarity. For each old URL, find the nearest new URL.
  5. Apply a confidence threshold. Auto-approve high-confidence matches, flag the rest for manual review.
  6. Emit a redirect map as a CSV plus ready-to-paste nginx/.htaccess rules.
  7. Validate after deploy by checking that each old URL returns a single 301 to a live 200.

Step 1 — Inventory both URL sets

You need two CSVs. The old set must come from a crawl of the live site so you capture URLs that actually have inbound links and traffic — not just whatever is in the CMS. Screaming Frog or Sitebulb both export Address, Title 1, H1-1 columns directly; if you prefer code, point a crawler at the live domain. The new set can come from the staging site’s sitemap.xml while it’s still pre-launch.

old_urls.csv   ->  url, title, h1
new_urls.csv   ->  url, title, h1

Garbage in, garbage out applies hard here. Strip out paginated, faceted, and parameterized URLs from the old set before you start, or you’ll waste review time mapping ?sort=price&page=7 variants that should never have been indexed in the first place.

Step 2 — Build a text signature and embed it

The quality of your matches lives and dies on the signature. Title and H1 carry the semantic payload; the slug, humanized (dashes to spaces, stop-segments removed), adds a useful hint, especially for thin pages with duplicate titles. We’ll use the sentence-transformers library with the all-MiniLM-L6-v2 model — small, fast, runs on CPU, and good enough for this. Swap in OpenAI’s text-embedding-3-small if you’d rather hit an API.

import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer("all-MiniLM-L6-v2")

def signature(row):
    slug = row["url"].rstrip("/").split("/")[-1].replace("-", " ")
    parts = [str(row.get("title", "")), str(row.get("h1", "")), slug]
    return " | ".join(p for p in parts if p and p != "nan")

old = pd.read_csv("old_urls.csv")
new = pd.read_csv("new_urls.csv")

old["sig"] = old.apply(signature, axis=1)
new["sig"] = new.apply(signature, axis=1)

old_vecs = model.encode(old["sig"].tolist(), normalize_embeddings=True)
new_vecs = model.encode(new["sig"].tolist(), normalize_embeddings=True)

Normalizing the embeddings means the dot product is cosine similarity, which keeps the next step fast and simple.

Step 3 — Match and threshold

Now compute the similarity matrix, take the best new URL for each old URL, and bucket by confidence. The thresholds below are sensible defaults from real migrations, but you should calibrate them on a sample of 50 known-good pairs before trusting them at scale.

sims = cosine_similarity(old_vecs, new_vecs)

best_idx = sims.argmax(axis=1)
best_score = sims.max(axis=1)

results = old[["url"]].copy()
results["target"] = new.iloc[best_idx]["url"].values
results["score"] = best_score

def bucket(s):
    if s >= 0.80:
        return "auto"        # ship it
    if s >= 0.60:
        return "review"      # human glance
    return "manual"          # no good match; decide by hand

results["decision"] = results["score"].apply(bucket)
results.to_csv("redirect_map.csv", index=False)

print(results["decision"].value_counts())

On a typical content-heavy migration you’ll see 60–75% of URLs land in the auto bucket, 15–25% in review, and the long tail in manual. That long tail is where the SEO judgment goes: a retired product page might 301 to its category, or to the most relevant guide, and only a human knows which preserves more intent. The model’s job is to shrink the pile you have to think about from thousands to dozens.

Step 4 — Emit deployable redirect rules

A CSV is for review; your web server wants rules. Generate both nginx and Apache forms from the approved rows so whoever owns the server config can paste directly.

approved = results[results["decision"].isin(["auto", "review"])]

with open("redirects.nginx", "w") as f:
    for _, r in approved.iterrows():
        f.write(f'rewrite ^{r["url"]}$ {r["target"]} permanent;\n')

with open("redirects.htaccess", "w") as f:
    for _, r in approved.iterrows():
        f.write(f'Redirect 301 {r["url"]} {r["target"]}\n')

One rule per line, one hop per redirect. The cardinal sin of migrations is chained redirects — old URL 301s to an interim URL that 301s again — which dilutes signals and burns crawl budget. Always map straight to the final live URL.

Step 5 — Validate after launch

Shipping the map is not the finish line. Post-launch, crawl every old URL and confirm it returns exactly one 301 to a URL that itself returns 200. This catches typos, redirect chains, and loops before Googlebot does.

import requests

def check(old_url, expected):
    r = requests.get(old_url, allow_redirects=True, timeout=10)
    hops = len(r.history)
    final_ok = r.status_code == 200 and r.url.rstrip("/") == expected.rstrip("/")
    return {"url": old_url, "hops": hops, "final": r.url, "ok": final_ok and hops == 1}

audit = [check(r["url"], r["target"]) for _, r in approved.iterrows()]
bad = [a for a in audit if not a["ok"]]
print(f"{len(bad)} redirects need attention")

Feed the failures back into your monitoring. This pairs naturally with an automated index coverage monitoring pipeline so you can watch Google actually pick up the new URLs, and with a Googlebot crawl monitor to confirm the bot is following your 301s and discovering the new structure rather than hammering dead paths.

Orchestrating it in n8n

For a one-off migration a script is fine. But on agencies and large sites, migrations and URL changes happen constantly, and you’ll want this on rails. In n8n, model it as: a trigger (manual or a webhook from your deploy pipeline) → an Execute Command or Code node that runs the embedding match → a branch on the decision column that posts the review and manual rows to a Slack channel for sign-off → a final node that writes the approved rules to a Git repo or pushes them via your CDN’s API. The validation step becomes a scheduled job that re-checks the redirect map daily for the first two weeks and alerts on any regression.

Results and takeaways

On a 7,400-URL blog migration I ran this against, the embedding mapper auto-approved 71% of URLs at the 0.80 threshold, surfaced 1,180 for a quick human glance, and left roughly 900 genuine judgment calls. The manual review that historically took two people three days collapsed to a single afternoon, and post-launch we held organic traffic within 4% of the pre-migration baseline through the volatile first month — versus the 20–30% dips that uncontrolled migrations routinely suffer.

The lesson isn’t “embeddings are magic.” It’s that the right automation removes the mechanical 70% of a task so your human judgment lands where it actually matters — the ambiguous tail. Build the pipeline once, calibrate your thresholds on real data, keep a human in the loop for the low-confidence bucket, and never, ever ship a chained redirect.

If you found this useful, bookmark SEOAutomationClub and check back — we publish a new working SEO automation playbook, with real code, every week.

Frequently asked questions

Should I use sentence-transformers or the OpenAI embeddings API?

For most migrations, the local all-MiniLM-L6-v2 model is plenty — it’s free, runs on CPU, and keeps your URL data on your own machine. Reach for OpenAI’s text-embedding-3-small or -large when your pages are long-form and nuanced enough that the small model’s matches feel coarse, or when you’re already embedding content elsewhere in your stack and want consistency.

What confidence threshold should I auto-approve at?

Start at cosine 0.80 for auto-approval and 0.60 for the review band, then calibrate. Pull 50 pairs you know the correct answer for, see where the model’s scores fall, and move the line until your auto bucket has near-zero false matches. Migrations with very templated titles may need a higher bar; distinctive editorial content can often go lower.

How do I avoid redirect chains?

Always map old URLs to the final live destination, never to an intermediate URL that itself redirects. After deploy, crawl every old URL with redirects followed and assert exactly one hop ending in a 200. The validation snippet above flags anything with more than one hop so you can flatten it before Googlebot recrawls.

Can this handle a many-to-one consolidation?

Yes — that’s the common case when you prune a bloated blog. Multiple old URLs naturally match the same surviving new URL, which is exactly what you want for consolidation. Just review the manual bucket carefully, because a retired page sometimes belongs on a category or hub page rather than the single closest article, and that’s a call the model can suggest but shouldn’t make alone.



Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *