Originally published on Medium . Let me start with a confession: my first attempt at building a video time prediction model was a disaster. I'd spent 3 months reading papers, collecting datasets, and training models. But when I finally deployed it, the results were laughable. I was trying to use a 3D CNN to extract features from video frames, and then feed those features into an LSTM to predict the time. It sounded good on paper, but in practice, it was a mess. The model was overfitting, underfitting, and just generally not working. I tried tweaking the architecture, adjusting the hyperparameters, and even switching to a different dataset. But no matter what I did, I just couldn't seem to get it to work. And then, one day, I stumbled upon a paper about SlowFast networks, and everything changed. The Before: When Everything Technically Works But Nothing Really Does My model was technically working, in the sense that it was producing outputs and not crashing.…