I had a shameful secret. It is one that affects a surprising number of people in the data science community. And I was too lazy to face the problem and tackle it head-on.
I had a shameful secret. It is one that affects a surprising number of people in the data science community. And I was too lazy to face the problem and tackle it head-on.
I didn’t know how to scrape data.
For the majority of the time, it didn’t impact my life — I had access to datasets, or other people had developed custom scrapers / APIs for what I needed.
But every so often I would look at a website and wish that I could grab some of that sweet, original data to do some serious analysis.
Well, no more.
Relatively recently, I taught myself how to scrape websites with Python using a combination of BeautifulSoup, requests and regular expressions.
I had a shameful secret… I didn’t know how to scrape data.
The whole process was far easier than I thought it was going to be, and as a result I am able to make my own data sets.
So here, I wanted to share my experience so you can do it yourself as well. As with my other articles, I include the entire code in my git repo here so you can follow along or adapt the code for your own purposes.
I assume you’re familiar with python. Even if you’re relatively new, this tutorial shouldn’t be too tricky, though.
You’ll need BeautifulSoup
, requests
, and pandas
. Install each (in your virtual environment) with a pip install [PACKAGE_NAME]
.
You can find my code here: https://github.com/databyjp/beginner_scraping
Once we learn how to scrape data, the skill can be applied to almost any site. But it is important to get the fundamentals right; so let’s start somewhere that is easy, while being reflective of the real world.
Many of you know that I’m a sports fan — so let’s get started by scraping our numerical data, which we will get from ScrapeThisSite.com.
As the creative name suggests, this site is designed to practice scraping. Given that the data is in tables, it is also easy to check that the data has been scraped correctly.
Before we do anything, we need the raw data. This is where the requests
library comes in. Getting the data is straightforward, just taking a line of code as follows:
import requests
page = requests.get("https://scrapethissite.com/pages/forms/")
It’s that easy to get a copy of the web page. To check that the page has been loaded correctly, try:
assert page.status_code == 200
If you don’t get an error, it should mean that the page has been downloaded correctly. How good is that? Now to the meat of the problem; getting data from our page.
To scrape a site, we need to identify which part of the website holds the information that we are after. Although this is easy visually, it’s annoyingly difficult to do in code.
Your best friend in this task is the “inspect element” button on your browser. There are different ways to actually address the elements to be scraped, but that’s secondary. First you need to identify the data being scraped.
For instance, let’s say that I would like to scrape this page.
Our first table to scrape (https://scrapethissite.com/pages/forms/)
Before we go any further, take a look at the underlying code. Here’s a small sample of it.
Source code for the page
Given that it’s designed to be used by folks learning scraping, it’s not all that difficult to read. Still, it’s a pain correlating what you see here with what you see rendered.
What you should be doing is to highlight the relevant element on the page, right-click and choose “inspect element”. This will bring up a layout similar to the below, although it will vary by browser.
“Inspect element” button — your new best friend
The code that will be brought up is the DOM (Document Object Model). Without getting too technical, this allows the code to be matched with the rendered end result.
I highly recommend scrolling through the various elements here, selecting them, and generally observing the DOM’s structure.
Explore your DOM
More concretely, let’s see what we would do to scrape the table showing the conference standings as below.
technology programming data-visualization data-science python
🔵 Intellipaat Data Science with Python course: https://intellipaat.com/python-for-data-science-training/In this Data Science With Python Training video, you...
🔥To access the slide deck used in this session for Free, click here: https://bit.ly/GetPDF_DataV_P 🔥 Great Learning brings you this live session on 'Data Vis...
🔥Intellipaat Python for Data Science Course: https://intellipaat.com/python-for-data-science-training/In this python for data science video you will learn e...
Master Applied Data Science with Python and get noticed by the top Hiring Companies with IgmGuru's Data Science with Python Certification Program. Enroll Now
Become a data analysis expert using the R programming language in this [data science](https://360digitmg.com/usa/data-science-using-python-and-r-programming-in-dallas "data science") certification training in Dallas, TX. You will master data...