Publication Type : Conference Proceedings
Publisher : IEEE
Source : World Conference on Communication & Computing (WCONF)
Url : https://ieeexplore.ieee.org/abstract/document/10692267
Campus : Bengaluru
School : School of Engineering
Year : 2024
Abstract : This paper introduces a multi-modal automatic video segmentation strategy by incorporating the audio tran-scripts along with the OCR output from video frames. Initially, the audio is segmented into smaller chunks based on the silence duration. Each chunk is subsequently transcribed using Whisper ASR. We also extract the textual content from the video frames using Tesseract OCR. The audio transcript and the OCR output are then embedded using sentence transformer. The resultant embeddings are then clustered using a hierarchical agglomerative clustering approach. To extract the relevant subtopic in each cluster, KeyBERT model is employed. The proposed architecture was tested on the publicly available LPM dataset and NMI, IOU, MOF and Fl score were used for evaluation. It was observed that the proposed method fared relatively better for long duration videos with average MOF, IOU and Fl scores of 0.78, 0.72 and 0.54 respectively.
Cite this Research Publication : M Vasuki, M Arun Gangadharan, Jibin Thomas Daniel, Arjun Sadashiv, Vivek Venugopal,Susmitha Vekkot, Multi-Modal Automatic Video Segmentation with Sentence Transformer Embeddings and KeyBERT-Based Subtopic Extraction, 2024 2nd World Conference on Communication & Computing (WCONF).