The 4 NLP stages between raw YouTube subtitles and a flashcard you'd actually study

1 / 2

The 4 NLP stages between raw YouTube subtitles and a flashcard you'd actually study

DEV Community·qcrao·23 days ago

#TIQ91JwF

#stage #nlp #spacy #esl #cards #fullscreen

Reading 0:00

15s threshold

A lot of "learn English with YouTube" tools just dump every word from the captions into your face and call it a vocabulary list. The result is 80% noise — pronouns, articles, contractions, proper nouns, and the same 200 high-frequency words repeated until your brain melts. When I was building TubeVocab, the hardest engineering problem wasn't scraping subtitles or shipping the React UI. It was the linguistic plumbing between raw caption text and a card a B1 learner would actually benefit from studying . That plumbing is a 4-stage NLP pipeline I tuned over 14 days and ~3,000 manual quality reviews. Here it is end-to-end, with the spaCy snippets that actually run in prod. Stage 1: Lemmatization (one card per lemma, not per inflection) If your subtitle says "running, ran, runs, has run" in one video, learners don't need 4 cards. They need one card for run with all forms surfaced as examples. spaCy's en_core_web_sm lemmatizer does ~95% of this for free.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

The 4 NLP stages between raw YouTube subtitles and a flashcard you'd actually study