A proposed automated extraction procedure of Bangla text for corpus creation in unicode
Includes bibliographical references (page 5).
Κύριοι συγγραφείς: | , , |
---|---|
Άλλοι συγγραφείς: | |
Μορφή: | Άρθρο |
Γλώσσα: | English |
Έκδοση: |
BRAC University
2010
|
Θέματα: | |
Διαθέσιμο Online: | http://hdl.handle.net/10361/672 |
id |
10361-672 |
---|---|
record_format |
dspace |
spelling |
10361-6722019-09-29T05:27:43Z A proposed automated extraction procedure of Bangla text for corpus creation in unicode Pavel, Dewan Shahriar Hossain Sarkar, Asif Iqbal Khan, Mumit Center for Research on Bangla Language Processing (CRBLP), BRAC University Corpus TTF (true type font) OTF (open type font) Unicode Converter Crawler Search engine N-gram Includes bibliographical references (page 5). This paper addresses the issue of automated Bangla corpus creation, which will significantly help the processes of lexicon development, morphological analysis, automatic parts of speech detection and automatic grammar extraction and machine translation. The plan is to collect all free Bangla documents on the world wide web and offline documents available and extract all the words in them to make a huge repository of text. This body of text or corpus will be used for several purposes of Bangla language processing after it is converted to Unicode text. The conversion process is also one of the associated and equally important research and development issue. Among several procedures our research focuses on a combination of font and language detection and Unicode conversion of retrieved Bangla text as a solution for automatic Bangla corpus creation and the methodology has been described in the paper. Dewan Shahriar Hossain Pavel Asif Iqbal Sarkar Mumit Khan 2010-12-08T04:07:37Z 2010-12-08T04:07:37Z 2006 2006 Article http://hdl.handle.net/10361/672 en application/pdf BRAC University |
institution |
Brac University |
collection |
Institutional Repository |
language |
English |
topic |
Corpus TTF (true type font) OTF (open type font) Unicode Converter Crawler Search engine N-gram |
spellingShingle |
Corpus TTF (true type font) OTF (open type font) Unicode Converter Crawler Search engine N-gram Pavel, Dewan Shahriar Hossain Sarkar, Asif Iqbal Khan, Mumit A proposed automated extraction procedure of Bangla text for corpus creation in unicode |
description |
Includes bibliographical references (page 5). |
author2 |
Center for Research on Bangla Language Processing (CRBLP), BRAC University |
author_facet |
Center for Research on Bangla Language Processing (CRBLP), BRAC University Pavel, Dewan Shahriar Hossain Sarkar, Asif Iqbal Khan, Mumit |
format |
Article |
author |
Pavel, Dewan Shahriar Hossain Sarkar, Asif Iqbal Khan, Mumit |
author_sort |
Pavel, Dewan Shahriar Hossain |
title |
A proposed automated extraction procedure of Bangla text for corpus creation in unicode |
title_short |
A proposed automated extraction procedure of Bangla text for corpus creation in unicode |
title_full |
A proposed automated extraction procedure of Bangla text for corpus creation in unicode |
title_fullStr |
A proposed automated extraction procedure of Bangla text for corpus creation in unicode |
title_full_unstemmed |
A proposed automated extraction procedure of Bangla text for corpus creation in unicode |
title_sort |
proposed automated extraction procedure of bangla text for corpus creation in unicode |
publisher |
BRAC University |
publishDate |
2010 |
url |
http://hdl.handle.net/10361/672 |
work_keys_str_mv |
AT paveldewanshahriarhossain aproposedautomatedextractionprocedureofbanglatextforcorpuscreationinunicode AT sarkarasifiqbal aproposedautomatedextractionprocedureofbanglatextforcorpuscreationinunicode AT khanmumit aproposedautomatedextractionprocedureofbanglatextforcorpuscreationinunicode AT paveldewanshahriarhossain proposedautomatedextractionprocedureofbanglatextforcorpuscreationinunicode AT sarkarasifiqbal proposedautomatedextractionprocedureofbanglatextforcorpuscreationinunicode AT khanmumit proposedautomatedextractionprocedureofbanglatextforcorpuscreationinunicode |
_version_ |
1814307241785819136 |