A Step by Step Guide to Web Scraping in Python

As data scientists, we are always on the look for new data and information to analyze and manipulate. One of the main approaches to find data right now is scraping the web for a particular inquiry.

When we browse the internet, we come across a massive number of websites, these websites display various data on the browser. If we, for some reason want to use this data for a project or an ML algorithm, we can — but shouldn’t — gather this data manually. So, we will copy the sections we want and paste them in a doc or CSV file.

Needless to say, that will be quite a tedious task. That’s why most data scientists and developers go with web scraping using code. It’s easy to write code to extract data from a 100 webpage than do them by hand.

Web Scraping is the technique used by programmers to automate the process of finding and extracting data from the internet within a relatively short time.

The most important question when it comes to web scraping, is it legal?

Is web scraping legal?

Short answer, yes.

The more detailed answer, scraping publically available data for non-commercial purposes was announced to be completely legal in late January 2020.

You might wonder, what does publically available mean?

Publically available information is the information that _anyone _can see/ find on the internet without the need for _special _access. So, information on Wikipedia, social media or Google’s search results are examples of publically available data.

Now, social media is somewhat complicated, because there are parts of it that are not publically available, such as when a user sets their information to be private. In this case, this information is _illegal _to be scraped.

#computer-science #data-science #software-development #web #programming

Is web scraping legal?

towardsdatascience.com

A Step by Step Guide to Web Scraping in Python