Cleaning of web scraped data with Python

This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2019.

Detalhes bibliográficos
Autor principal: Tarannum, Tasnuva
Outros Autores: Majumdar, Mahbub Alam
Formato: Thesis
Idioma:English
Publicado em: 2019
Assuntos:
Acesso em linha:http://hdl.handle.net/10361/12354
id 10361-12354
record_format dspace
institution Brac University
collection Institutional Repository
language English
topic Web scraped data
Python
Scripting languages (Computer science)
Dataindsamling
spellingShingle Web scraped data
Python
Scripting languages (Computer science)
Dataindsamling
Tarannum, Tasnuva
Cleaning of web scraped data with Python
description This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2019.
author2 Majumdar, Mahbub Alam
author_facet Majumdar, Mahbub Alam
Tarannum, Tasnuva
format Thesis
author Tarannum, Tasnuva
author_sort Tarannum, Tasnuva
title Cleaning of web scraped data with Python
title_short Cleaning of web scraped data with Python
title_full Cleaning of web scraped data with Python
title_fullStr Cleaning of web scraped data with Python
title_full_unstemmed Cleaning of web scraped data with Python
title_sort cleaning of web scraped data with python
publishDate 2019
url http://hdl.handle.net/10361/12354
work_keys_str_mv AT tarannumtasnuva cleaningofwebscrapeddatawithpython
_version_ 1814309110722592768
spelling 10361-123542022-01-26T10:20:02Z Cleaning of web scraped data with Python Tarannum, Tasnuva Majumdar, Mahbub Alam Department of Computer Science and Engineering, Brac University Web scraped data Python Scripting languages (Computer science) Dataindsamling This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2019. Cataloged from PDF version of thesis. Includes bibliographical references (pages 24-25). Today, data expect a basic occupation in individuals' step by step works out. With the help of some database applications, for instance, decision sincerely steady sys- tems and customer relationship the board structures (CRM), accommodating Data or taking in could be gotten from gigantic measures of information. Notwithstand- ing, examinations exhibit that various such applications disregard to work viably. High bore of information is a key to the present business accomplishment. The idea of any sweeping veritable information accumulation depends upon di erent segments among which the wellspring of the information is much of the time the noteworthy factor. It has now been seen that a ridiculous degree of information in most information sources is dingy. Plainly, a database application with a high degree of messy information isn't strong with the ultimate objective of information mining or deciding business understanding and the idea of decisions made depen- dent on such business learning is moreover con icting. In order to ensure high gauge of information, adventures need a system, methodologies and resources for screen and look at the idea of information, theories for foreseeing as Ill as perceiving and xing unsanitary information. This suggestion is focusing on the improvement of information quality in database applications with the help of current information cleaning methods. It gives a conscious and comparative portrayal of the examina- tion issues related to the improvement of the idea of information, and has kept an eye on di erent research issues related to information cleaning. In the underlying fragment of the hypothesis, related composition of infor- mation cleaning and information quality are examined and discussed. Developing this investigation, a standard based logical arrangement of chaotic information is proposed in the second bit of the hypothesis. The proposed logical order compresses the lthiest information types as Ill similar to the reason on which the proposed methodology for grasping the Dirty Data Selection (DDS) issue amid the infor- mation cleaning process was created. This makes us structure the DDS technique in the proposed information cleaning framework delineated in the third bit of the suggestion. This framework holds the most captivating characteristics of existing information cleaning approaches, and improves the capability and feasibility of in- formation cleaning similarly as the dimension of automation in the midst of the information cleaning process. Finally, a great deal of assessed string planning counts are considered and exploratory work has been grasped. Inferred string organizing is a basic part in various information cleaning approaches which has been particularly focused for quite a while. The test work in the recommendation con rmed the clari cation that there is no obvious best framework. It shows that the traits of information, for instance, the proportion of a dataset, the screw up rate in a dataset, the sort of strings in a dataset and even the kind of syntactic oversight in a string will have basic e ect on the execution of the picked frameworks. Similarly, the characteristics of information moreover have sway on the assurance of sensible edge regards for the picked planning counts. The achievements subject to these exploratory results give the key improvement in the structure of "calculation assurance component" in the information cleaning structure, which overhauls the execution of information cleaning system in database applications. Tasnuva Tarannum B. Computer Science and Engineering 2019-07-14T07:13:43Z 2019-07-14T07:13:43Z 2019 2019-04 Thesis ID 14101133 http://hdl.handle.net/10361/12354 en Brac University theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. 25 pages application/pdf