Towards santali linguistic inclusion: building the first Santali-to-English translation model using mT5 transformer and data augmentation

This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2023.

Detalhes bibliográficos
Main Authors: Billah, Syed Mohammed Mostaque, Subarna, Ateya Ahmed, Sarna, Sudipta Nandi, Wasit, Ahmad Shawkat, Shawkat, Ahmad
Outros Autores: Sadeque, Farig Yousuf
Formato: Thesis
Idioma:English
Publicado em: Brac University 2024
Assuntos:
Acesso em linha:http://hdl.handle.net/10361/23605
id 10361-23605
record_format dspace
spelling 10361-236052024-06-26T21:02:41Z Towards santali linguistic inclusion: building the first Santali-to-English translation model using mT5 transformer and data augmentation Billah, Syed Mohammed Mostaque Subarna, Ateya Ahmed Sarna, Sudipta Nandi Wasit, Ahmad Shawkat Shawkat, Ahmad Sadeque, Farig Yousuf Department of Computer Science and Engineering, Brac University Parallel corpus Machine translation Neural Machine Translation Low resource language Aligner Computer lingiustics This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2023. Cataloged from PDF version of thesis. Includes bibliographical references (pages 38-39). Around seven million individuals in India, Bangladesh, Bhutan, and Nepal speak Santali, positioning it as nearly the third most commonly used Austroasiatic language. Despite its prominence among the Austroasiatic language family’s Munda subfamily, Santali lacks global recognition. Currently, no translation models exist for the Santali language. This paper aims to remove Santali from the NPL spectrum. We aim to examine the feasibility of building Santali-English translation models based on available Santali corpora. This paper successfully addressed the low-resource problem and, with promising results, examined the possibility of using the Santali language. We think that our study will open the door for further exploration into Santali-English machine translation. Syed Mohammed Mostaque Billah Ateya Ahmed Subarnav Sudipta Nandi Sarna Ahmad Shawkat Wasit Anika Fariha Chowdhury B.Sc in Computer Science  2024-06-26T07:24:09Z 2024-06-26T07:24:09Z ©2023 2023-09 Thesis ID 20101057 ID 23341089 ID 20101257 ID 20101398 ID 20101042 http://hdl.handle.net/10361/23605 en Brac University theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. 45 pages application/pdf Brac University
institution Brac University
collection Institutional Repository
language English
topic Parallel corpus
Machine translation
Neural Machine Translation
Low resource language
Aligner
Computer lingiustics
spellingShingle Parallel corpus
Machine translation
Neural Machine Translation
Low resource language
Aligner
Computer lingiustics
Billah, Syed Mohammed Mostaque
Subarna, Ateya Ahmed
Sarna, Sudipta Nandi
Wasit, Ahmad Shawkat
Shawkat, Ahmad
Towards santali linguistic inclusion: building the first Santali-to-English translation model using mT5 transformer and data augmentation
description This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2023.
author2 Sadeque, Farig Yousuf
author_facet Sadeque, Farig Yousuf
Billah, Syed Mohammed Mostaque
Subarna, Ateya Ahmed
Sarna, Sudipta Nandi
Wasit, Ahmad Shawkat
Shawkat, Ahmad
format Thesis
author Billah, Syed Mohammed Mostaque
Subarna, Ateya Ahmed
Sarna, Sudipta Nandi
Wasit, Ahmad Shawkat
Shawkat, Ahmad
author_sort Billah, Syed Mohammed Mostaque
title Towards santali linguistic inclusion: building the first Santali-to-English translation model using mT5 transformer and data augmentation
title_short Towards santali linguistic inclusion: building the first Santali-to-English translation model using mT5 transformer and data augmentation
title_full Towards santali linguistic inclusion: building the first Santali-to-English translation model using mT5 transformer and data augmentation
title_fullStr Towards santali linguistic inclusion: building the first Santali-to-English translation model using mT5 transformer and data augmentation
title_full_unstemmed Towards santali linguistic inclusion: building the first Santali-to-English translation model using mT5 transformer and data augmentation
title_sort towards santali linguistic inclusion: building the first santali-to-english translation model using mt5 transformer and data augmentation
publisher Brac University
publishDate 2024
url http://hdl.handle.net/10361/23605
work_keys_str_mv AT billahsyedmohammedmostaque towardssantalilinguisticinclusionbuildingthefirstsantalitoenglishtranslationmodelusingmt5transformeranddataaugmentation
AT subarnaateyaahmed towardssantalilinguisticinclusionbuildingthefirstsantalitoenglishtranslationmodelusingmt5transformeranddataaugmentation
AT sarnasudiptanandi towardssantalilinguisticinclusionbuildingthefirstsantalitoenglishtranslationmodelusingmt5transformeranddataaugmentation
AT wasitahmadshawkat towardssantalilinguisticinclusionbuildingthefirstsantalitoenglishtranslationmodelusingmt5transformeranddataaugmentation
AT shawkatahmad towardssantalilinguisticinclusionbuildingthefirstsantalitoenglishtranslationmodelusingmt5transformeranddataaugmentation
_version_ 1814308343639965696