Towards Dynamic Key Frames Based Temporal Segmentation for Activity Recognition

Title:Towards Dynamic Key Frames Based Temporal Segmentation for Activity Recognition

Conference:ACIIDS2026

Tags:Activity Recognition, Dynamic Key Frames, Video Temporal Segmentation, Video Understanding and Vision Transformer

Abstract:

Human activity recognition is attracting several scientists in computer vision because of numerous applications such as health care systems, sports video analysis, and human-computer interaction. Deep learning becomes dominant in activity recognition, but it need more resources for computation, time processing. This work has tackled these issues by introducing the key frames extraction method which captures representative frames of activity and redundant video frames. Kernel Temporal Segmentation is applied to divide a video into meaningful segments without training data. Then, dynamic key frames for each video are the middle frames of each segment. Human activity will be represented by these key frames instead of video as a whole. A Vision Transformer is applied to extract the visual features from key frames for activity representation. In addition, RAFT is used to extract optical flow for motion features around each key frame. Finally, an efficient machine learning method is also utilized to recognize the action performed by humans based on the representation of key frames. Our experiments are conducted on four human activity datasets UCF11, UCF50, UCF101, and HMDB51 to demonstrate generalizability, robustness, and scalability in our method. The best recognition rates are 98.32%, 95.53%, 93.72%, and 89.49%, respectively. The recognition rate proves that our proposed method is outperformed, stable, and large scale.