A lot of "learn English with YouTube" tools just dump every word from the captions into your face and call it a vocabulary list. The result is 80% noise — pronouns, articles, contractions, proper nouns, and the same 200 high-frequency words repeated until your brain melts. When I was building TubeVocab, the hardest engineering problem wasn't scraping subtitles or shipping the React UI. It was the linguistic plumbing between raw caption text and a card a B1 learner would actually benefit from studying . That plumbing is a 4-stage NLP pipeline I tuned over 14 days and ~3,000 manual quality reviews. Here it is end-to-end, with the spaCy snippets that actually run in prod. Stage 1: Lemmatization (one card per lemma, not per inflection) If your subtitle says "running, ran, runs, has run" in one video, learners don't need 4 cards. They need one card for run with all forms surfaced as examples. spaCy's en_core_web_sm lemmatizer does ~95% of this for free.…