I don't know exactly how grok works but I'd bet when it's analyzing a "video" it's actually looking at a grid of images sampled from the video. So it's skipping most of the frames and there's no motion data or anything. Someone running the wrong across a track is exactly the kind of "thing that looks like the training set but not quite" that will screw up an AI.
Since video generation is really taking off, I'd expect full video analysis to be part of the next generation of multi-modal LLMs. Give it about a year.
I don't know exactly how grok works but I'd bet when it's analyzing a "video" it's actually looking at a grid of images sampled from the video. So it's skipping most of the frames and there's no motion data or anything. Someone running the wrong across a track is exactly the kind of "thing that looks like the training set but not quite" that will screw up an AI.
Since video generation is really taking off, I'd expect full video analysis to be part of the next generation of multi-modal LLMs. Give it about a year.