A proposed automated extraction procedure of Bangla text for corpus creation in unicode

Includes bibliographical references (page 5).

Λεπτομέρειες βιβλιογραφικής εγγραφής
Κύριοι συγγραφείς: Pavel, Dewan Shahriar Hossain, Sarkar, Asif Iqbal, Khan, Mumit
Άλλοι συγγραφείς: Center for Research on Bangla Language Processing (CRBLP), BRAC University
Μορφή: Άρθρο
Γλώσσα:English
Έκδοση: BRAC University 2010
Θέματα:
Διαθέσιμο Online:http://hdl.handle.net/10361/672
id 10361-672
record_format dspace
spelling 10361-6722019-09-29T05:27:43Z A proposed automated extraction procedure of Bangla text for corpus creation in unicode Pavel, Dewan Shahriar Hossain Sarkar, Asif Iqbal Khan, Mumit Center for Research on Bangla Language Processing (CRBLP), BRAC University Corpus TTF (true type font) OTF (open type font) Unicode Converter Crawler Search engine N-gram Includes bibliographical references (page 5). This paper addresses the issue of automated Bangla corpus creation, which will significantly help the processes of lexicon development, morphological analysis, automatic parts of speech detection and automatic grammar extraction and machine translation. The plan is to collect all free Bangla documents on the world wide web and offline documents available and extract all the words in them to make a huge repository of text. This body of text or corpus will be used for several purposes of Bangla language processing after it is converted to Unicode text. The conversion process is also one of the associated and equally important research and development issue. Among several procedures our research focuses on a combination of font and language detection and Unicode conversion of retrieved Bangla text as a solution for automatic Bangla corpus creation and the methodology has been described in the paper. Dewan Shahriar Hossain Pavel Asif Iqbal Sarkar Mumit Khan 2010-12-08T04:07:37Z 2010-12-08T04:07:37Z 2006 2006 Article http://hdl.handle.net/10361/672 en application/pdf BRAC University
institution Brac University
collection Institutional Repository
language English
topic Corpus
TTF (true type font)
OTF (open type font)
Unicode
Converter
Crawler
Search engine
N-gram
spellingShingle Corpus
TTF (true type font)
OTF (open type font)
Unicode
Converter
Crawler
Search engine
N-gram
Pavel, Dewan Shahriar Hossain
Sarkar, Asif Iqbal
Khan, Mumit
A proposed automated extraction procedure of Bangla text for corpus creation in unicode
description Includes bibliographical references (page 5).
author2 Center for Research on Bangla Language Processing (CRBLP), BRAC University
author_facet Center for Research on Bangla Language Processing (CRBLP), BRAC University
Pavel, Dewan Shahriar Hossain
Sarkar, Asif Iqbal
Khan, Mumit
format Article
author Pavel, Dewan Shahriar Hossain
Sarkar, Asif Iqbal
Khan, Mumit
author_sort Pavel, Dewan Shahriar Hossain
title A proposed automated extraction procedure of Bangla text for corpus creation in unicode
title_short A proposed automated extraction procedure of Bangla text for corpus creation in unicode
title_full A proposed automated extraction procedure of Bangla text for corpus creation in unicode
title_fullStr A proposed automated extraction procedure of Bangla text for corpus creation in unicode
title_full_unstemmed A proposed automated extraction procedure of Bangla text for corpus creation in unicode
title_sort proposed automated extraction procedure of bangla text for corpus creation in unicode
publisher BRAC University
publishDate 2010
url http://hdl.handle.net/10361/672
work_keys_str_mv AT paveldewanshahriarhossain aproposedautomatedextractionprocedureofbanglatextforcorpuscreationinunicode
AT sarkarasifiqbal aproposedautomatedextractionprocedureofbanglatextforcorpuscreationinunicode
AT khanmumit aproposedautomatedextractionprocedureofbanglatextforcorpuscreationinunicode
AT paveldewanshahriarhossain proposedautomatedextractionprocedureofbanglatextforcorpuscreationinunicode
AT sarkarasifiqbal proposedautomatedextractionprocedureofbanglatextforcorpuscreationinunicode
AT khanmumit proposedautomatedextractionprocedureofbanglatextforcorpuscreationinunicode
_version_ 1814307241785819136