Got data? How I taught myself how to scrape websites in a few hours (and you can, too)

Got data? How I taught myself how to scrape websites in a few hours (and you can, too)

I had a shameful secret. It is one that affects a surprising number of people in the data science community. And I was too lazy to face the problem and tackle it head-on.

I had a shameful secret. It is one that affects a surprising number of people in the data science community. And I was too lazy to face the problem and tackle it head-on.

I didn’t know how to scrape data.

For the majority of the time, it didn’t impact my life — I had access to datasets, or other people had developed custom scrapers / APIs for what I needed.

But every so often I would look at a website and wish that I could grab some of that sweet, original data to do some serious analysis.

Well, no more.

Relatively recently, I taught myself how to scrape websites with Python using a combination of BeautifulSouprequests and regular expressions.

I had a shameful secret… I didn’t know how to scrape data.

The whole process was far easier than I thought it was going to be, and as a result I am able to make my own data sets.

So here, I wanted to share my experience so you can do it yourself as well. As with my other articles, I include the entire code in my git repo here so you can follow along or adapt the code for your own purposes.

Before we get started

Packages

I assume you’re familiar with python. Even if you’re relatively new, this tutorial shouldn’t be too tricky, though.

You’ll need BeautifulSouprequests, and pandas. Install each (in your virtual environment) with a pip install [PACKAGE_NAME].

You can find my code here: https://github.com/databyjp/beginner_scraping

Let’s make a dataset

Once we learn how to scrape data, the skill can be applied to almost any site. But it is important to get the fundamentals right; so let’s start somewhere that is easy, while being reflective of the real world.

Many of you know that I’m a sports fan — so let’s get started by scraping our numerical data, which we will get from ScrapeThisSite.com.

As the creative name suggests, this site is designed to practice scraping. Given that the data is in tables, it is also easy to check that the data has been scraped correctly.

Get the raw data

Before we do anything, we need the raw data. This is where the requests library comes in. Getting the data is straightforward, just taking a line of code as follows:

import requests
page = requests.get("https://scrapethissite.com/pages/forms/")

It’s that easy to get a copy of the web page. To check that the page has been loaded correctly, try:

assert page.status_code == 200

If you don’t get an error, it should mean that the page has been downloaded correctly. How good is that? Now to the meat of the problem; getting data from our page.

Getting into your element

To scrape a site, we need to identify which part of the website holds the information that we are after. Although this is easy visually, it’s annoyingly difficult to do in code.

Your best friend in this task is the “inspect element” button on your browser. There are different ways to actually address the elements to be scraped, but that’s secondary. First you need to identify the data being scraped.

For instance, let’s say that I would like to scrape this page.

Image for post

Our first table to scrape (https://scrapethissite.com/pages/forms/)

Before we go any further, take a look at the underlying code. Here’s a small sample of it.

Image for post

Source code for the page

Given that it’s designed to be used by folks learning scraping, it’s not all that difficult to read. Still, it’s a pain correlating what you see here with what you see rendered.

What you should be doing is to highlight the relevant element on the page, right-click and choose “inspect element”. This will bring up a layout similar to the below, although it will vary by browser.

Image for post

“Inspect element” button — your new best friend

The code that will be brought up is the DOM (Document Object Model). Without getting too technical, this allows the code to be matched with the rendered end result.

I highly recommend scrolling through the various elements here, selecting them, and generally observing the DOM’s structure.

Image for post

Explore your DOM

More concretely, let’s see what we would do to scrape the table showing the conference standings as below.

technology programming data-visualization data-science python

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Data Science With Python Training | Python Data Science Course | Intellipaat

🔵 Intellipaat Data Science with Python course: https://intellipaat.com/python-for-data-science-training/In this Data Science With Python Training video, you...

Data Visualization With Python | Data Visualization | Python For Data Science

🔥To access the slide deck used in this session for Free, click here: https://bit.ly/GetPDF_DataV_P 🔥 Great Learning brings you this live session on 'Data Vis...

Python for Data Science | Data Science With Python | Python Data Science Tutorial

🔥Intellipaat Python for Data Science Course: https://intellipaat.com/python-for-data-science-training/In this python for data science video you will learn e...

Applied Data Science with Python Certification Training Course -IgmGuru

Master Applied Data Science with Python and get noticed by the top Hiring Companies with IgmGuru's Data Science with Python Certification Program. Enroll Now

Data Science Course in Dallas

Become a data analysis expert using the R programming language in this [data science](https://360digitmg.com/usa/data-science-using-python-and-r-programming-in-dallas "data science") certification training in Dallas, TX. You will master data...