How I Built an llms.txt Generator That Actually Works at Scale

1 / 2

How I Built an llms.txt Generator That Actually Works at Scale

DEV Community·David Evdoshchenko·27 days ago

#0eNrOJ44

#stage #llms #typescript #fullscreen #const #pages

Reading 0:00

15s threshold

This is the technical companion to my I Built an llms.txt Generator, Showed It to the Creator of the Standard, and Had to Rewrite Everything — the human side is there, here's just the engineering. The goal: automatically generate a proper llms.txt hierarchy for any website — not a flat index of summaries, but a structured set of MD files where semantically related pages are merged into coherent documents. Here's how each layer works and what broke along the way. The Architecture Sitemap → Crawler → Embedder → Clusterer → Summarizer → llms.txt + MD files Enter fullscreen mode Exit fullscreen mode Five stages. Each runs at a different speed. Each has its own failure modes. Stage 1: Crawling Standard crawling with content extraction. The output per page: path, title, clean text. Pages that fail to crawl are tracked but don't stop the pipeline — a missing page just doesn't contribute to its cluster. Stage 2: Embeddings + Caching Each page gets converted to a vector using Gemini's gemini-embedding-001 model.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

How I Built an llms.txt Generator That Actually Works at Scale