Similarity search for Bangla

This thesis report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2011.

Dettagli Bibliografici
Autori principali:	Morshed, Mahbub, Shahed, Md. Shahid
Altri autori:	Abdullah, Matin Saad
Natura:	Tesi
Lingua:	English
Pubblicazione:	BRAC University 2011
Soggetti:	Computer science and engineering
Accesso online:	http://hdl.handle.net/10361/1518

id	10361-1518
record_format	dspace
spelling	10361-15182022-01-26T10:04:47Z Similarity search for Bangla Morshed, Mahbub Shahed, Md. Shahid Abdullah, Matin Saad Department of Computer Science and Engineering, BRAC University Computer science and engineering This thesis report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2011. Cataloged from PDF version of thesis report. Includes bibliographical references (page 29). Due to typos and misspelling search engines cannot provide users with proper information. Large search engines like Google provides suggestion tab "did you mean". But such options are not included in most of the open source search engines. Our goal was to find a way to implement an exhaustive similarity search in an efficient way and develop such option for Bangla search engine . We used Solr for that. And configured Solr with Lavenstine distance and Jaro Winkler algorithm to provide "Did you mean" for English. But to implement this for Bangla we needed a Stemmer for Bangla and that was not present in SoIr. In order to build a efficient stemmer we need to tag the tokens properly according to their parts of speech as the stemming process for different parts of speech is different. There are different approaches to the problem of assigning a part of speech (POS) tag to each word of a natural language sentence. We have used NLTK toolkit to develop a Regular expression tagger for Bangla verbs using the common suffixes( 1 i r ) found in Bangla grammar. Then we analyzed its performance on main verbs extracted from a 100K token 51 Page tagged-corpus. In this thesis we also compare the performance of a few POS tagging techniques for Bangla language, e.g. statistical approach (ngram) and transformation based approach (Brill's tagger). A supervised POS tagging approach requires a large amount of annotated training corpus to tag properly. We used the 100K token hand tagged corpus developed by Microsoft India to implement these techniques. Mahbub Morshed Md. Shahid Shahed B. Computer Science and Engineering 2011-12-07T10:55:26Z 2011-12-07T10:55:26Z 2011 2011-04 Thesis ID 09201023 ID 07101007 http://hdl.handle.net/10361/1518 en BRAC University thesis are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. 31 pages application/pdf BRAC University
institution	Brac University
collection	Institutional Repository
language	English
topic	Computer science and engineering
spellingShingle	Computer science and engineering Morshed, Mahbub Shahed, Md. Shahid Similarity search for Bangla
description	This thesis report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2011.
author2	Abdullah, Matin Saad
author_facet	Abdullah, Matin Saad Morshed, Mahbub Shahed, Md. Shahid
format	Thesis
author	Morshed, Mahbub Shahed, Md. Shahid
author_sort	Morshed, Mahbub
title	Similarity search for Bangla
title_short	Similarity search for Bangla
title_full	Similarity search for Bangla
title_fullStr	Similarity search for Bangla
title_full_unstemmed	Similarity search for Bangla
title_sort	similarity search for bangla
publisher	BRAC University
publishDate	2011
url	http://hdl.handle.net/10361/1518
work_keys_str_mv	AT morshedmahbub similaritysearchforbangla AT shahedmdshahid similaritysearchforbangla
_version_	1814306833496539136

Similarity search for Bangla

Documenti analoghi