Clustering web pages based on doc type structure in a distributed manner

This thesis report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2015.

Մատենագիտական մանրամասներ
Հիմնական հեղինակներ: Kader, Kazi Samiul, Nawar, Sagufa, Ananna, Nusrat Sharmin, Khan, Sarah
Այլ հեղինակներ: Mostakim, Moin
Ձևաչափ: Թեզիս
Լեզու:English
Հրապարակվել է: BRAC University 2015
Խորագրեր:
Առցանց հասանելիություն:http://hdl.handle.net/10361/4381
id 10361-4381
record_format dspace
spelling 10361-43812022-01-26T10:10:24Z Clustering web pages based on doc type structure in a distributed manner Kader, Kazi Samiul Nawar, Sagufa Ananna, Nusrat Sharmin Khan, Sarah Mostakim, Moin Department of Computer Science and Engineering, BRAC University Computer science and engineering Web page clustering This thesis report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2015. Cataloged from PDF version of thesis report. Includes bibliographical references (page 56-58). Web page clustering is an important part of modern web technology. By structuring similar web pages together we can find related information, suggest similar choices etc. All modern search engines depend on web page clustering. It is interesting to work on this topic as it presents a novel academic challenge and also practical application. In this thesis we clustered web pages by using the HTML tag structure of web pages. We represented each web page as a vector of tag percentages and clustered them using k-means clustering algorithm and DBSCAN clustering algorithm. We selected k-means and DBSCAN algorithm because they are well known clustering algorithms and also they have not been applied together and compared in the field of web page clustering as we did in this thesis. After clustering on three different category of five websites in three stages, both algorithms produced over minimum 88% accuracy in clustering compared to the original clusters. In this process we used the weka data mining software, because it is well tested in terms of accuracy and efficiency. It is also open source. Kazi Samiul Kader Sagufa Nawar Nusrat Sharmin Ananna Sarah Khan B. Computer Science and Engineering 2015-09-03T10:27:20Z 2015-09-03T10:27:20Z 2015 2015 Thesis ID 11101011 ID 11101007 ID 11101031 ID 11201022 http://hdl.handle.net/10361/4381 en BRAC University thesis are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. 58 pages application/pdf BRAC University
institution Brac University
collection Institutional Repository
language English
topic Computer science and engineering
Web page clustering
spellingShingle Computer science and engineering
Web page clustering
Kader, Kazi Samiul
Nawar, Sagufa
Ananna, Nusrat Sharmin
Khan, Sarah
Clustering web pages based on doc type structure in a distributed manner
description This thesis report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2015.
author2 Mostakim, Moin
author_facet Mostakim, Moin
Kader, Kazi Samiul
Nawar, Sagufa
Ananna, Nusrat Sharmin
Khan, Sarah
format Thesis
author Kader, Kazi Samiul
Nawar, Sagufa
Ananna, Nusrat Sharmin
Khan, Sarah
author_sort Kader, Kazi Samiul
title Clustering web pages based on doc type structure in a distributed manner
title_short Clustering web pages based on doc type structure in a distributed manner
title_full Clustering web pages based on doc type structure in a distributed manner
title_fullStr Clustering web pages based on doc type structure in a distributed manner
title_full_unstemmed Clustering web pages based on doc type structure in a distributed manner
title_sort clustering web pages based on doc type structure in a distributed manner
publisher BRAC University
publishDate 2015
url http://hdl.handle.net/10361/4381
work_keys_str_mv AT kaderkazisamiul clusteringwebpagesbasedondoctypestructureinadistributedmanner
AT nawarsagufa clusteringwebpagesbasedondoctypestructureinadistributedmanner
AT anannanusratsharmin clusteringwebpagesbasedondoctypestructureinadistributedmanner
AT khansarah clusteringwebpagesbasedondoctypestructureinadistributedmanner
_version_ 1814307688677376000