Menu

Post image 1
Post image 2
1 / 2
0

The 4 NLP stages between raw YouTube subtitles and a flashcard you'd actually study

DEV Community·qcrao·23 days ago
#TIQ91JwF
#stage#nlp#spacy#esl#cards#fullscreen
Reading 0:00
15s threshold

A lot of "learn English with YouTube" tools just dump every word from the captions into your face and call it a vocabulary list. The result is 80% noise — pronouns, articles, contractions, proper nouns, and the same 200 high-frequency words repeated until your brain melts. When I was building TubeVocab, the hardest engineering problem wasn't scraping subtitles or shipping the React UI. It was the linguistic plumbing between raw caption text and a card a B1 learner would actually benefit from studying . That plumbing is a 4-stage NLP pipeline I tuned over 14 days and ~3,000 manual quality reviews. Here it is end-to-end, with the spaCy snippets that actually run in prod. Stage 1: Lemmatization (one card per lemma, not per inflection) If your subtitle says "running, ran, runs, has run" in one video, learners don't need 4 cards. They need one card for run with all forms surfaced as examples. spaCy's en_core_web_sm lemmatizer does ~95% of this for free.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More