This is the technical companion to my I Built an llms.txt Generator, Showed It to the Creator of the Standard, and Had to Rewrite Everything — the human side is there, here's just the engineering. The goal: automatically generate a proper llms.txt hierarchy for any website — not a flat index of summaries, but a structured set of MD files where semantically related pages are merged into coherent documents. Here's how each layer works and what broke along the way. The Architecture Sitemap → Crawler → Embedder → Clusterer → Summarizer → llms.txt + MD files Enter fullscreen mode Exit fullscreen mode Five stages. Each runs at a different speed. Each has its own failure modes. Stage 1: Crawling Standard crawling with content extraction. The output per page: path, title, clean text. Pages that fail to crawl are tracked but don't stop the pipeline — a missing page just doesn't contribute to its cluster. Stage 2: Embeddings + Caching Each page gets converted to a vector using Gemini's gemini-embedding-001 model.…