Automate Image Alt Text at Scale with a Vision LLM: A Python + n8n Pipeline
Most SEO automation effort goes into words: titles, meta descriptions, internal links, schema. Images get ignored — and on a content site with thousands of media files, that neglect quietly adds up. Empty alt attributes hurt accessibility, leave Google Image Search guessing, and now starve the multimodal models behind AI Overviews and visual search of the one signal they rely on to understand a picture. Writing alt text by hand doesn’t scale past a few dozen images. This post walks through a working pipeline that inventories every image missing alt text, generates accurate descriptions with a vision LLM, runs them through a quality gate, and writes them back to your CMS automatically.
The approach is deliberately model-agnostic and CMS-agnostic in concept, but the code targets the stack most readers here run: Python for the logic, the WordPress REST API for inventory and writeback, a vision model (Claude or GPT-4o) for generation, and n8n to schedule and orchestrate the whole thing. Expect to process roughly 500–1,000 images per hour at a cost of fractions of a cent each.
Why image alt text is an automation problem in 2026
Alt text serves three distinct consumers, and all three have raised the stakes recently. Screen readers have always needed it for accessibility — that hasn’t changed and remains a legal requirement in many jurisdictions. Google Image Search has used alt text as a primary ranking and relevance signal for years. The newer pressure is multimodal AI: when an LLM-driven SERP feature or a tool like Google Lens parses a page, descriptive alt text is a cheap, structured hint about what an image contains and why it’s on the page. A missing or generic alt="" means the model either ignores the image or has to run its own (expensive, error-prone) vision inference to guess.
The catch is that good alt text is contextual. “A laptop on a desk” is technically accurate and useless. “A terminal window showing an n8n workflow exporting Search Console data to BigQuery” is what actually helps. Generic CDN-style auto-alt (“image_4471.jpg”) is no better than nothing. That contextual quality is exactly what a vision LLM can now produce reliably and at scale — which is what makes this a tractable automation target rather than a manual chore.
Pipeline architecture
The pipeline has four stages, each independently testable:
1. Inventory — query the CMS for media items where the alt field is empty. 2. Generate — send each image (plus page context) to a vision model with a constrained prompt. 3. Quality gate — reject anything too long, too short, keyword-stuffed, or obviously hallucinated before it touches the live site. 4. Writeback — PATCH the alt text onto the media item via the REST API. n8n wraps these as a scheduled workflow with error branches and a Slack notification on completion.
Step 1: Inventory images missing alt text
WordPress exposes media as a REST collection. The alt_text field lives under media_details on the response, and there’s no native “empty alt” filter, so you paginate and filter client-side. Pull only the fields you need to keep responses small.
import requests
BASE = "https://example.com/wp-json/wp/v2"
AUTH = ("automation_user", "xxxx xxxx xxxx xxxx") # application password
UA = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14.0) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15"}
def images_missing_alt():
page, missing = 1, []
while True:
r = requests.get(f"{BASE}/media", headers=UA, auth=AUTH,
params={"per_page": 100, "page": page,
"media_type": "image",
"_fields": "id,source_url,alt_text,title,post"})
if r.status_code != 200 or not r.json():
break
for m in r.json():
if not (m.get("alt_text") or "").strip():
missing.append(m)
page += 1
return missing
targets = images_missing_alt()
print(f"{len(targets)} images need alt text")
If your media library is huge, cache this inventory and only re-scan items uploaded since the last run — store the highest media id you processed and pass it as an after/include filter on subsequent runs.
Step 2: Generate alt text with a vision model
The prompt does the heavy lifting. You want descriptions that are specific, under ~125 characters (the practical limit most screen readers announce comfortably), free of “image of” filler, and grounded in what’s actually visible. Passing the surrounding post title as context dramatically improves relevance, because the model can disambiguate a screenshot or chart against the article’s subject.
import anthropic, base64, httpx
client = anthropic.Anthropic()
PROMPT = """You write image alt text for an SEO website.
Rules:
- Describe only what is visibly in the image.
- Be specific and concrete; name tools, charts, or UI if recognizable.
- 8-16 words. No "image of", "picture of", or "photo showing".
- Do not invent text you cannot read. Do not keyword-stuff.
Context (the article this image appears in): "{context}"
Return ONLY the alt text, nothing else."""
def gen_alt(image_url, context):
img = base64.standard_b64encode(httpx.get(image_url).content).decode()
msg = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=80,
messages=[{"role": "user", "content": [
{"type": "image", "source": {"type": "base64",
"media_type": "image/jpeg", "data": img}},
{"type": "text", "text": PROMPT.format(context=context)},
]}],
)
return msg.content[0].text.strip().strip('"')
GPT-4o works equally well here — swap the client and pass the image as a data URL. Whichever you choose, send images at a reduced resolution (1024px on the long edge is plenty) to cut latency and token cost; you’re describing content, not auditing pixels.
Step 3: The quality gate
Never write model output straight to production. A cheap rules layer catches the failure modes that erode trust: descriptions that balloon past the screen-reader limit, one-word non-answers, repeated target keywords (a spam signal), or telltale hallucination phrases. Anything that fails the gate gets logged for human review instead of published.
BANNED = ("image of", "picture of", "photo of", "this image")
def passes_gate(alt, primary_kw=None):
words = alt.split()
if not (3 <= len(words) <= 20):
return False, "length"
if len(alt) > 140:
return False, "too long for screen readers"
if any(b in alt.lower() for b in BANNED):
return False, "filler phrase"
if primary_kw and alt.lower().count(primary_kw.lower()) > 1:
return False, "keyword stuffing"
return True, "ok"
This is the same principle behind the quality gate in our programmatic SEO pipeline with an indexing quality gate: automation only earns trust when a deterministic check stands between the model and the live site.
Step 4: Write the alt text back
WordPress accepts a POST to the media endpoint to update alt_text. Use a form-urlencoded body and the same browser User-Agent — some managed hosts (Bluehost, SiteGround) run ModSecurity rules that 406 large JSON bodies or default Python user agents.
def write_alt(media_id, alt):
r = requests.post(f"{BASE}/media/{media_id}", headers=UA, auth=AUTH,
data={"alt_text": alt})
return r.status_code == 200
for m in targets:
context = m.get("title", {}).get("rendered", "")
alt = gen_alt(m["source_url"], context)
ok, reason = passes_gate(alt)
if ok and write_alt(m["id"], alt):
print(f"[done] {m['id']}: {alt}")
else:
print(f"[skip] {m['id']}: {reason} -> {alt!r}")
Step 5: Orchestrate and schedule in n8n
Wrapping the script in n8n turns a one-off backfill into ongoing maintenance. A Schedule trigger fires nightly; an HTTP Request node hits the media endpoint to pull new uploads; an Item Lists node splits them; the vision call runs per item (the native Anthropic/OpenAI nodes or a Code node); an IF node enforces the quality gate; and a final HTTP Request writes the alt text back. Add a Slack node on the error branch so a failed batch surfaces the same way as your other monitors — the pattern from automating technical SEO monitoring with Python and Slack alerts. Self-hosting n8n keeps per-execution costs at zero, so the only marginal cost is the model tokens.
Results, costs, and what to watch
On a backfill of around 1,200 legacy images, expect roughly 90% to clear the quality gate on the first pass; the rest are usually decorative images (icons, dividers, spacers) that should genuinely carry alt="" rather than a description — worth filtering by dimensions before you spend tokens on them. Token cost lands around US$0.001–0.003 per image with current Sonnet/GPT-4o pricing, so a four-figure backfill costs a couple of dollars, and the nightly maintenance run is effectively free.
Two things to monitor over time. First, re-validate a random sample by hand for the first few runs — vision models occasionally over-describe text in screenshots, and tuning the “do not invent text you cannot read” instruction fixes most of it. Second, track Google Search Console’s image search impressions before and after; alt text is one input among many, but a previously alt-less library usually shows measurable image-impression lift within a few weeks of indexing.
The broader takeaway is that on-page SEO has a long tail of “small, repetitive, judgement-light” tasks — alt text, schema, internal links — that were impractical to automate before vision and language models got cheap and reliable. Alt text is one of the cleanest wins because the input is unambiguous (a single image), the output is short, and a deterministic gate can catch nearly every failure. If you’ve already automated schema markup deployment at scale and internal linking with AI, image alt text is the obvious next module in the same maintenance system.
Want the next automation playbook in your inbox? Bookmark SEOAutomationClub and check back each week — we ship a new working pipeline, with code, every few days.
Frequently asked questions
Should every image get AI-generated alt text?
No. Purely decorative images — icons, background textures, spacers — should have an empty alt="" so screen readers skip them. Filter these out by dimensions or filename before sending anything to the model; describing decoration adds noise for assistive tech and wastes tokens.
Will Google penalize AI-generated alt text?
Google’s guidance targets unhelpful, keyword-stuffed alt text regardless of how it’s produced. Accurate, concise, human-readable descriptions are fine whether a person or a model wrote them. The quality gate exists specifically to keep output within those bounds — that’s the difference between automation and scaled abuse.
Claude or GPT-4o for vision alt text?
Both produce strong results for this task and pricing is comparable. Claude tends to be slightly more conservative about inventing text it can’t read, which suits the “describe only what’s visible” rule; GPT-4o is marginally faster on large batches. Test both on a sample of your own images and pick on accuracy, not benchmarks.
How do I keep new uploads covered without re-scanning everything?
Store the highest media ID you processed and, on each scheduled run, only fetch media created after that point using the REST API’s after date filter or an include range. This turns a one-time backfill into cheap incremental maintenance that runs in seconds.
