Web Scraping Basics: How to scrape data from a website in Python

Web Scraping Basics: How to scrape data from a website in Python

We always say “Garbage in Garbage out” in data science. If you do not have a good quality and quantity of data, mostly likely you would not get much insights out of it.

We always say “Garbage in Garbage out” in data science. If you do not have a good quality and quantity of data, mostly likely you would not get much insights out of it. Web Scraping is one of the important methods to retrieve third party data automatically. In this article, I will be covering the basics of web scraping and use two examples to illustrate the 2 different ways to do it in Python.

What is Web Scraping

Web Scraping is an automatic way to retrieve unstructured data from website into structured data for analysis. For example, if you want to analyse what kind of face mask can sell better in Singapore, you may want to scrape all the face mask information on E-Commerce website like Lazada.

Can you scrape from all the websites?

Scraping makes the website traffic to spike and may cause breakdown of the website server. Thus, not all of the websites allow people to scrape. How do you know which websites are allowed or not? You can look at ‘robots.txt’ file of the website. You just simply put robots.txt after the url that you want to scrape and you will see information on whether website host allow you to scrape the website.

Take Google.com for an example

Image for post

robots.txt file of Google.com

You can see that Google does not allow web scraping for many of its sub-websites. However, it allows certain path like ‘/m/finance’ and thus if you want to collect information on finance then this is a completely legal place to scrape.

Another note is that you can see from the first row on User-agent. Here Google specifies the rules for all of the user-agents but website may give certain user-agent special permission so you may want to refer to information there.

How does web scraping work?

Web scraping just works like a bot person browsing different pages website and copy paste down all the contents. When you run the code, it will send a request to the server and the data is contained in the response you get. What you then do is to parse the response data and extract out the parts you want.

How do we do web scraping?

Alright, finally we are here. There are 2 different approaches for web scraping depending on how does website structure their contents.

Approach 1:_ If website stores all their information on the HTML front end, you can directly use code to download the HTML contents and extract out useful information._

There are roughly 5 steps as below:

  1. Inspect the website HTML that you want to crawl
  2. Access url of the website using code and download all the HTML contents on the page
  3. Format the downloaded content into readable format
  4. Extract out useful information and save into a structured format
  5. For information displayed on multiple pages of website, you may need to repeat step 2–4 to have the complete information.

*Pros and Cons for this approach: *It is simple and direct. However, if website front end structure changes then you need to adjust your code accordingly.

Approach 2: If website stores data in API and the website queries the API each time when user visit the website, you can simulate the request and directly query data from the API

machine-learning web-scraping data-engineering data-science python data analysis

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Web Scraping using Python To Create a Dataset | Data Science | Machine Learning | Python

In this article I will show you how you can create your own dataset by Web Scraping using Python. Web Scraping means to extract a set of data from web. If you are a programmer, a Data Scientist, Engineer or anyone who works by manipulating the data, the skills of Web Scrapping will help you in your career. Suppose you are working on a project where no data is available, then how you are going to collect the data. In this situation Web Scraping skills will help you.

Data Science With Python Training | Python Data Science Course | Intellipaat

🔵 Intellipaat Data Science with Python course: https://intellipaat.com/python-for-data-science-training/In this Data Science With Python Training video, you...

Web Scraping Using Python To Create A Dataset | Data Science | Machine Learning | Python

In this article I will show you how you can create your own dataset by Web Scraping using Python. Web Scraping means to extract a set of data from web

Applied Data Analysis in Python Machine Learning and Data Science | Scikit-Learn

Applied Data Analysis in Python Machine learning and Data science, we will investigate the use of scikit-learn for machine learning to discover things about whatever data may come across your desk.

Scraping Twitter with Python | Data Science | Machine Learning | Python

In this article, I'll walk you through scraping Twitter with Python without API using the twint module, and I'll also analyze some relations