Hi all, I’ve just finished CS231n and I’m now looking to learn how to build and train multimodal video-audio models. I’m trying to understand the practical side of: video/audio processing models alignment and fusion strategies Are there any good resources, tutorials, or project-based guides to get started with this (preferably implementation-focused rather than theory-heavy)? submitted by /u/DecafToGo [link] [comments]