Enhancing Bangla video comprehension through multimodal feature integration and attention-based encoder-decoder captioning models for single-action videos

This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science, 2024.

书目详细资料
Main Authors:	Das, Saurav, Biswas, Shammo, Fahim, Taimoor, Sanjan, M.A.B. Siddique, Tarannum, Tasnia Alam
其他作者:	Alam, Md. Ashraful
格式:	Thesis
语言:	English
出版:	Brac University 2024
主题:	Video captioning Bangla language Video processing Natural language processing Feature fusion Encoder-decoder framework Multimodal fusion GRU-Gaussian attention model CIDEr score Natural language processing (Computer science). Neural networks (Computer science).
在线阅读:	http://hdl.handle.net/10361/24342

id	10361-24342
record_format	dspace
spelling	10361-243422024-10-17T21:05:16Z Enhancing Bangla video comprehension through multimodal feature integration and attention-based encoder-decoder captioning models for single-action videos Das, Saurav Biswas, Shammo Fahim, Taimoor Sanjan, M.A.B. Siddique Tarannum, Tasnia Alam Alam, Md. Ashraful Alam, Md. Golam Rabiul Department of Computer Science and Engineering, Brac University Video captioning Bangla language Video processing Natural language processing Feature fusion Encoder-decoder framework Multimodal fusion GRU-Gaussian attention model CIDEr score Natural language processing (Computer science). Neural networks (Computer science). This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science, 2024. Cataloged from PDF version of thesis. Includes bibliographical references (pages 52-55). Video understanding and description have an important role to play in the field of computer vision and natural language processing. The capacity of automatically generating natural language descriptions for video content has many real-world applications, for example, quoting accessibility tools up to multimedia retrieval systems. Although understanding and describing video content in natural language is a challenging job, it is more so in resource-constrained languages like Bangla. This study investigates the integration of a feature fusion method and the attention-based encoder-decoder framework to improve comprehension of videos and to generate accurate captions for single-action video clips in Bangla. We propose a novel model based on multimodal fusion by combining visual features from video frames and motion information derived from optical flow. The adopted multimodal representations are then fed into an attention-based encoder-decoder architecture aiming to generate descriptive captions in the Bangla language. To facilitate our research, we collected and annotated a new dataset comprising single-action videos sourced from various online platforms. Extensive experiments are conducted on this newly created Bangla single-action videos dataset, with the models evaluated using standard metrics like BLEU, METEOR, and CIDEr. Among the models tested, including architectural variations, the GRU-Gaussian Attention model achieves the best performance, generating captions closest to the ground truth. As this is a new dataset with no previous benchmarks, the proposed approach establishes a strong baseline for Bangla video captioning, achieving a BLEU score of 0.53 and a CIDEr score of 0.492. Additionally, we analyze the attention mechanisms to interpret the learned representations, providing insights into the model’s behavior and decision-making process. This work on developing solutions for under-resourced languages paves the way for enhanced video comprehension with potential applications in human-computer interaction, accessibility, and multimedia retrieval. Saurav Das Shammo Biswas Taimoor Fahim M.A.B. Siddique Sanjan Tasnia Alam Tarannum B.Sc. in Computer Science 2024-10-17T05:33:21Z 2024-10-17T05:33:21Z ©2024 2024-05 Thesis ID 20101100 ID 20101359 ID 23241093 ID 19201068 ID 20301179 http://hdl.handle.net/10361/24342 en Brac University theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. 65 pages application/pdf Brac University
institution	Brac University
collection	Institutional Repository
language	English
topic	Video captioning Bangla language Video processing Natural language processing Feature fusion Encoder-decoder framework Multimodal fusion GRU-Gaussian attention model CIDEr score Natural language processing (Computer science). Neural networks (Computer science).
spellingShingle	Video captioning Bangla language Video processing Natural language processing Feature fusion Encoder-decoder framework Multimodal fusion GRU-Gaussian attention model CIDEr score Natural language processing (Computer science). Neural networks (Computer science). Das, Saurav Biswas, Shammo Fahim, Taimoor Sanjan, M.A.B. Siddique Tarannum, Tasnia Alam Enhancing Bangla video comprehension through multimodal feature integration and attention-based encoder-decoder captioning models for single-action videos
description	This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science, 2024.
author2	Alam, Md. Ashraful
author_facet	Alam, Md. Ashraful Das, Saurav Biswas, Shammo Fahim, Taimoor Sanjan, M.A.B. Siddique Tarannum, Tasnia Alam
format	Thesis
author	Das, Saurav Biswas, Shammo Fahim, Taimoor Sanjan, M.A.B. Siddique Tarannum, Tasnia Alam
author_sort	Das, Saurav
title	Enhancing Bangla video comprehension through multimodal feature integration and attention-based encoder-decoder captioning models for single-action videos
title_short	Enhancing Bangla video comprehension through multimodal feature integration and attention-based encoder-decoder captioning models for single-action videos
title_full	Enhancing Bangla video comprehension through multimodal feature integration and attention-based encoder-decoder captioning models for single-action videos
title_fullStr	Enhancing Bangla video comprehension through multimodal feature integration and attention-based encoder-decoder captioning models for single-action videos
title_full_unstemmed	Enhancing Bangla video comprehension through multimodal feature integration and attention-based encoder-decoder captioning models for single-action videos
title_sort	enhancing bangla video comprehension through multimodal feature integration and attention-based encoder-decoder captioning models for single-action videos
publisher	Brac University
publishDate	2024
url	http://hdl.handle.net/10361/24342
work_keys_str_mv	AT dassaurav enhancingbanglavideocomprehensionthroughmultimodalfeatureintegrationandattentionbasedencoderdecodercaptioningmodelsforsingleactionvideos AT biswasshammo enhancingbanglavideocomprehensionthroughmultimodalfeatureintegrationandattentionbasedencoderdecodercaptioningmodelsforsingleactionvideos AT fahimtaimoor enhancingbanglavideocomprehensionthroughmultimodalfeatureintegrationandattentionbasedencoderdecodercaptioningmodelsforsingleactionvideos AT sanjanmabsiddique enhancingbanglavideocomprehensionthroughmultimodalfeatureintegrationandattentionbasedencoderdecodercaptioningmodelsforsingleactionvideos AT tarannumtasniaalam enhancingbanglavideocomprehensionthroughmultimodalfeatureintegrationandattentionbasedencoderdecodercaptioningmodelsforsingleactionvideos
_version_	1814309670384304128

Enhancing Bangla video comprehension through multimodal feature integration and attention-based encoder-decoder captioning models for single-action videos

相似书籍