Detecting document similarity in large document collecting using MapReduce and the Hadoop framework

This thesis report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2012.

Бібліографічні деталі
Автори:	Momtaz, Anik, Amreen, Sadika
Інші автори:	Khan, Mumit
Формат:	Дисертація
Мова:	English
Опубліковано:	BRAC University 2013
Предмети:	Computer science and engineering
Онлайн доступ:	http://hdl.handle.net/10361/2379

id	10361-2379
record_format	dspace
spelling	10361-23792022-01-26T10:04:54Z Detecting document similarity in large document collecting using MapReduce and the Hadoop framework Momtaz, Anik Amreen, Sadika Khan, Mumit Department of Computer Science and Engineering, BRAC University Computer science and engineering This thesis report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2012. Cataloged from PDF version of thesis report. Includes bibliographical references (page 46). The everlasting necessity to process data is only becoming more and more challenging due to the exponential growth of the data itself. We are talking about exabytes, zettabytes and even yottabytes of data; generally referred to as Big Data. Hence, the conventional processing methods of data have become obsolete when handling Big Data. It is simply not feasible to use a single machine to analyze data of such tremendous volume. This is where Hadoop comes in. Simply put, using the Hadoop Distributive File System (HDFS), an enormous chunk of data can be divided into smaller pieces and be distributed amongst multiple machines referred to as nodes to parallel process them using a technique called MapReduce. The potential for such a concept is limitless. However, for our thesis, we have used the HDFS to identify similarities between multiple documents. The initial idea was to make an algorithm to detect full or partial plagiarism in documents as there are countless materials of interest readily available on the internet. However, upon successfully being able to implement an algorithm for the English language, we realized that there is no record of any work on document similarity detection carried on upon Bangla language. Therefore, with some modifications to our existing algorithm to fit our specifications (as the Bangla language is completely different from the English language as far as construction is concerned), we were able to develop an algorithm to detect document similarities on a broad scale using the Ferret model. Anik Momtaz Sadika Amreen B. Computer Science and Engineering 2013-04-30T17:29:45Z 2013-04-30T17:29:45Z 2012 2012-12 Thesis ID 08201002 ID 09101003 http://hdl.handle.net/10361/2379 en 54 pages application/pdf BRAC University
institution	Brac University
collection	Institutional Repository
language	English
topic	Computer science and engineering
spellingShingle	Computer science and engineering Momtaz, Anik Amreen, Sadika Detecting document similarity in large document collecting using MapReduce and the Hadoop framework
description	This thesis report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2012.
author2	Khan, Mumit
author_facet	Khan, Mumit Momtaz, Anik Amreen, Sadika
format	Thesis
author	Momtaz, Anik Amreen, Sadika
author_sort	Momtaz, Anik
title	Detecting document similarity in large document collecting using MapReduce and the Hadoop framework
title_short	Detecting document similarity in large document collecting using MapReduce and the Hadoop framework
title_full	Detecting document similarity in large document collecting using MapReduce and the Hadoop framework
title_fullStr	Detecting document similarity in large document collecting using MapReduce and the Hadoop framework
title_full_unstemmed	Detecting document similarity in large document collecting using MapReduce and the Hadoop framework
title_sort	detecting document similarity in large document collecting using mapreduce and the hadoop framework
publisher	BRAC University
publishDate	2013
url	http://hdl.handle.net/10361/2379
work_keys_str_mv	AT momtazanik detectingdocumentsimilarityinlargedocumentcollectingusingmapreduceandthehadoopframework AT amreensadika detectingdocumentsimilarityinlargedocumentcollectingusingmapreduceandthehadoopframework
_version_	1814307021226246144

Detecting document similarity in large document collecting using MapReduce and the Hadoop framework

Схожі ресурси