Section 1: Introduction

The development of computers has produced many useful techniques that can create massive databases. One technique is web scraping, which is used most commonly by statisticians, data scientists, computer scientists and web developers to accumulate vast amounts of data that is processed with statistical methods so that it can be analyzed. As the name suggests, web scraping is a way to extract information such as specific numbers, texts and tables from the world wide web, using software that can easily store and manage all the information that has been downloaded.

Regardless of the web browser that we use, every single web page uses computer languages such as XML/HTML, AJAX, and JSON to present the information inside a web page. When a person enters a web page on the Internet, whether it be social media, Wikipedia or search engines like Google or Bing, using a browser means using HTML (Munzert et al., 2014). The information that is presented on any browser from a web page varies from the one presented in HTML; in other words, HTML is the code of the web page, and the browser is capable of ensuring a user-friendly experience. In particular, this article will try to explain some features that HTML has to implement a web scraping tool successfully and how they relate to Python.

The primary purpose of this article is to show the usefulness behind web scraping and how statisticians could take advantage of this method. At the end of Section 4, Python code is provided with an explanation to get an insight into the scope of this technique.

The article will be composed of different sections that go as follows. First, Section 2 will explain why web scraping is useful for statisticians. Section 3 will explain why, in some scenarios, web scraping could be challenging to use and what are the legal consequences of doing web scraping. In Section 4, Python code will be provided to explain a simple implementation of web scrapping using a financial web page. Finally, conclusions are presented in the last section.

#data-science #data-mining #web-scraping #finance #ethics

A Useful Tool to Collect Data: Web Scraping
1.35 GEEK