Beginner's guide to Web Scraping in Python 3

So let’s start…

First things first, we’ll need to install a few essential libraries.

The five packages we’ll need are requests, bs4, re, time, and selenium. re and time should come packaged already with your installation of Python 3.

This is done by typing in pip install requests bs4 selenium in your terminal.

pip install requests bs4 selenium

#  You may need to add --user to the end, depending on your system

You will also need to install the Chrome webdriver which can be found here. Make sure you check that you have the correct version, and instructions are on the website.

Now we will start scraping the Hacker News front page!

The first thing we need to do in any Python project is to import the libraries we need.

import requests, bs4, re, time

So let’s make our first page request, by getting Python to download the page data into a variable by using requests.get():

pagedata = requests.get("https://news.ycombinator.com/")

In order to parse the variable into readable HTML, we’ll use BeautifulSoup.

cleanpagedata = bs4.BeautifulSoup(pagedata.text, 'html.parser')

We use BeautifulSoup because it parses the HTML correctly, and makes it look like this:

<html op="news"><head><meta content="origin" name="referrer"/><meta content="width=device-width, initial-scale=1.0" name="viewport"/><link href="news.css?Swdnfjd2lvQXPAqH2Hs6" rel="stylesheet" type="text/css"/>
<link href="favicon.ico" rel="shortcut icon"/>
<link href="rss" rel="alternate" title="RSS" type="application/rss+xml"/>
<title>Hacker News</title></head><body><center><table bgcolor="#f6f6ef" border="0" cellpadding="0" cellspacing="0" id="hnmain" width="85%">
<tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" style="padding:2px" width="100%"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a></td>
<td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
<a href="newest">new</a> | <a href="front">past</a> | <a href="newcomments">comments</a> | <a href="ask">ask</a> | <a href="show">show</a> | <a href="jobs">jobs</a> | <a href="submit">submit</a> </span></td><td style="text-align:right;padding-right:4px;"><span class="pagetop">
<a href="login?goto=news">login</a>
</span></td>

Now that we have the HTML, we can use some Regex magic to grab the links to the discussion threads.

If we use Chrome Devtools, right clicking on the comments link and selecting ‘inspect’, we can see that the code for the link includes an ID number:
Beginner's guide to Web Scraping in Python 3

If we go to the actual site and hover over each comment thread link, we can see that the links are in a common format, which is https://news.ycombinator.com/item?id= + the ID link. What we can do then is make a regular expression to find the ID and then use it to search through our page data for all the IDs:

IDsearch = re.compile(r'id=(\d+)')		# This searches for anything that starts with ‘id=’ and ends with a string of numbers, capturing the string of numbers
threadIDs = IDsearch.findall(str(cleanpagedata))		# We need to convert the BeautifulSoup output to a string in order to search with regex

But this gives us a bit of a problem. If we look at the results, we actually have 120 results, when we only have 30 links to scrape!

threadIDs = {list} ['19279396', '19279396', '19279396', '19279396', '19277809', '19277809', '19277809', '19277809', '19279003', '19279003', '19279003', '19279003', '19278075', '19278075', '19278075', '19278075', '19273955', '19273955', '19273955', '19273955', '19278555', '19
 000 = {str} '19279396'
 001 = {str} '19279396'
 002 = {str} '19279396'
 003 = {str} '19279396'
 004 = {str} '19277809'
 005 = {str} '19277809'
 006 = {str} '19277809'
 007 = {str} '19277809'
 008 = {str} '19279003'
 009 = {str} '19279003'
 010 = {str} '19279003'
...
 110 = {str} '19271487'
 111 = {str} '19271487'
 112 = {str} '19271487'
 113 = {str} '19277272'
 114 = {str} '19277272'
 115 = {str} '19277272'
 116 = {str} '19277272'
 117 = {str} '19274175'
 118 = {str} '19274175'
 119 = {str} '19274175'
 120 = {str} '19274175'
 __len__ = {int} 121

The reason is, if you look at the code, the ID actually comes up 3 times if we use that regular expression. Now, we could solve this by converting our list into a set and back into a list, but looking at the HTML we could also just another part of the code that only appears once per list. In this example, I’ll use vote?id=(\d+)&amp instead:

IDsearch = re.compile(r'vote\?id=(\d+)&amp')		# don’t forget the \ before the ? in the regular expression - certain characters, such as the ? are special in regex and thus need to have an escape character otherwise it will count as part of the regex search
threadIDs = IDsearch.findall(str(cleanpagedata))

Which comes up with a much better result:

threadIDs = {list} ['19279396', '19277809', '19279003', '19278075', '19273955', '19278555', '19278936', '19274941', '19277846', '19277653', '19274406', '19278891', '19278302', '19276113', '19277263', '19276977', '19277978', '19275755', '19276751', '19277910', '19275738', '19
 00 = {str} '19279396'
 01 = {str} '19277809'
 02 = {str} '19279003'
 03 = {str} '19278075'
 04 = {str} '19273955'
 05 = {str} '19278555'
 06 = {str} '19278936'
 07 = {str} '19274941'
 08 = {str} '19277846'
 09 = {str} '19277653'
 10 = {str} '19274406'
 11 = {str} '19278891'
 12 = {str} '19278302'
 13 = {str} '19276113'
 14 = {str} '19277263'
 15 = {str} '19276977'
 16 = {str} '19277978'
 17 = {str} '19275755'
 18 = {str} '19276751'
 19 = {str} '19277910'
 20 = {str} '19275738'
 21 = {str} '19276542'
 22 = {str} '19277411'
 23 = {str} '19270646'
 24 = {str} '19277765'
 25 = {str} '19273403'
 26 = {str} '19277473'
 27 = {str} '19271487'
 28 = {str} '19277272'
 29 = {str} '19274175'
 __len__ = {int} 30

Now that we have the IDs and we know the format of the links, we can easily combine the two with a quick loop:

commentlinks = []
for i in range(len(threadIDs)):
    commentlinks.append("https://news.ycombinator.com/item?id=" + threadIDs[i])

And we have our list of links to the top 30 threads on Hacker News!

commentlinks = {list} ['https://news.ycombinator.com/item?id=19279396', 'https://news.ycombinator.com/item?id=19277809', 'https://news.ycombinator.com/item?id=19279003', 'https://news.ycombinator.com/item?id=19278075', 'https://news.ycombinator.com/item?id=19273955', 'https://n
 00 = {str} 'https://news.ycombinator.com/item?id=19279396'
 01 = {str} 'https://news.ycombinator.com/item?id=19277809'
 02 = {str} 'https://news.ycombinator.com/item?id=19279003'
 03 = {str} 'https://news.ycombinator.com/item?id=19278075'
 04 = {str} 'https://news.ycombinator.com/item?id=19273955'
 05 = {str} 'https://news.ycombinator.com/item?id=19278555'
 06 = {str} 'https://news.ycombinator.com/item?id=19278936'
 07 = {str} 'https://news.ycombinator.com/item?id=19274941'
 08 = {str} 'https://news.ycombinator.com/item?id=19277846'
 09 = {str} 'https://news.ycombinator.com/item?id=19277653'
 10 = {str} 'https://news.ycombinator.com/item?id=19274406'
 11 = {str} 'https://news.ycombinator.com/item?id=19278891'
 12 = {str} 'https://news.ycombinator.com/item?id=19278302'
 13 = {str} 'https://news.ycombinator.com/item?id=19276113'
 14 = {str} 'https://news.ycombinator.com/item?id=19277263'
 15 = {str} 'https://news.ycombinator.com/item?id=19276977'
 16 = {str} 'https://news.ycombinator.com/item?id=19277978'
 17 = {str} 'https://news.ycombinator.com/item?id=19275755'
 18 = {str} 'https://news.ycombinator.com/item?id=19276751'
 19 = {str} 'https://news.ycombinator.com/item?id=19277910'
 20 = {str} 'https://news.ycombinator.com/item?id=19275738'
 21 = {str} 'https://news.ycombinator.com/item?id=19276542'
 22 = {str} 'https://news.ycombinator.com/item?id=19277411'
 23 = {str} 'https://news.ycombinator.com/item?id=19270646'
 24 = {str} 'https://news.ycombinator.com/item?id=19277765'
 25 = {str} 'https://news.ycombinator.com/item?id=19273403'
 26 = {str} 'https://news.ycombinator.com/item?id=19277473'
 27 = {str} 'https://news.ycombinator.com/item?id=19271487'
 28 = {str} 'https://news.ycombinator.com/item?id=19277272'
 29 = {str} 'https://news.ycombinator.com/item?id=19274175'
 __len__ = {int} 30

Now that we have the thread links, we will get Python to scrape each page for the link and the name of the first commenter. Let’s just start with one page first.

First, I got Python to just grab the first link in the list:

thread = requests.get(commentlinks[0])
cleanthread = bs4.BeautifulSoup(thread.text, 'html.parser')

Using Chrome DevTools, we can see that the link we want to scrape is coded as:

<a class="storylink" href="https://babluboy.github.io/bookworm/">Bookworm: A Simple, Focused eBook Reader</a>

Beginner's guide to Web Scraping in Python 3

So we can write our regular expression and then put the result into a variable:

singlethreadlinksearch = re.compile(r'\<a class="storylink" href="(.+?)"\>')		# again, don’t forget the escape \ before characters like < and >
singlethreadlink = singlethreadlinksearch.findall(str(cleanthread))

Easy!

Next, we need the top commenter.

When we look through Chrome DevTools, we can see that user IDs are tagged as “user?id=[userID]”
Beginner's guide to Web Scraping in Python 3

So all we need to do is get our regular expression set up and then grab all the user IDs off the page:

commenterIDsearch = re.compile(r'user\?id=(.+?)"')
commenterIDs = commenterIDsearch.findall(str(cleanthread))

If we look at the actual page, we can see that the OP is actually the first user ID that shows up, which means that the top commenter’s ID will be the second ID in our list, so to get that we can use

firstcommenter = commenterIDs[1]		# Remember that Python lists start with 0

Easy!

Now, to put this all together we will need to loop everything so it gives us all the results automatically.

First, let’s make a function from our previous code to scrape the threads and return our results into a list:

def scrapethread(cleanthread):		# We need to feed the thread data into the function
    singlethreadlinksearch = re.compile(r'\<a class="storylink" href="(.+?)"\>')		
    singlethreadlink = singlethreadlinksearch.findall(str(cleanthread))
    commenterIDsearch = re.compile(r'user\?id=(.+?)"')
    commenterIDs = commenterIDsearch.findall(str(cleanthread))
    try:
        firstcommenter = commenterIDs[1]		# If there are no commenters this will fail, so we wrap it in a try/except just in case
    except:
        firstcommenter = "No commenters"
    return singlethreadlink, firstcommenter		# Return the variables

And then make the loop to scrape the results

results = []        # We want our results to come back as a list
for i in range(len(commentlinks)):
    thread = requests.get(commentlinks[i])      # Go to each link
    cleanthread = bs4.BeautifulSoup(thread.text, 'html.parser')
    link, commenter = scrapethread(cleanthread)        # Scrape the data and return them to these variables
    results.append(link + [commenter])      # Append the results - note that the link actually returns as a list, rather than a string
    time.sleep(30)

Were you wondering why I asked you to import time in the beginning? Well, most sites will block multiple fast requests especially just to stop you from spamming their servers with scraping requests (it’s also just impolite to overload other people’s servers with requests).

Now, when we run the code, we have a complete list of the links and first commenters in our results variable!

results = {list} [['https://ourworldindata.org/the-link-between-life-expectancy-and-health-spending-us-focus', 'chrismeller'], ['https://www.sqlite.org/json1.html', 'Sean1708'], ['https://github.com/Microsoft/nni', 'No commenters'], ['https://github.com/pugwonk/gif2xlsx/bl
 00 = {list} ['https://ourworldindata.org/the-link-between-life-expectancy-and-health-spending-us-focus', 'chrismeller']
 01 = {list} ['https://www.sqlite.org/json1.html', 'Sean1708']
 02 = {list} ['https://github.com/Microsoft/nni', 'No commenters']
 03 = {list} ['https://github.com/pugwonk/gif2xlsx/blob/master/README.md', 'cosmie']
 04 = {list} ['https://www.universityofcalifornia.edu/press-room/uc-terminates-subscriptions-worlds-largest-scientific-publisher-push-open-access-publicly', 'pwthornton']
 05 = {list} ['https://github.com/triska/lisprolog', 'jasim']
 06 = {list} ['https://adnauseam.io/', 'Spare_account']
 07 = {list} ['item?id=19274941', 'zrail']
 08 = {list} ['https://www.nytimes.com/2019/02/24/business/china-pig-technology-facial-recognition.html', 'rectang']
 09 = {list} ['No commenters']
 10 = {list} ['https://www.bbc.co.uk/news/technology-47408969', 'jawns']
 11 = {list} ['https://twitter.com/joose_rajamaeki/status/1096397000520749056', 'hpaavola']
 12 = {list} ['https://energy.stanford.edu/news/cheap-renewables-won-t-stop-global-warming-says-bill-gates', 'wjnc']
 13 = {list} ['http://tonsky.me/blog/github-redesign/', 'scott_s']
 14 = {list} ['No commenters']
 15 = {list} ['https://sod.pixlab.io/articles/license-plate-detection.html', 'king_magic']
 16 = {list} ['No commenters']
 17 = {list} ['https://www.tesla.com/blog/35000-tesla-model-3-available-now', 'BenoitEssiambre']
 18 = {list} ['https://github.com/remacs/remacs', 'melling']
 19 = {list} ['https://lineageos.org/Changelog-22/', 'fro0116']
 20 = {list} ['No commenters']
 21 = {list} ['https://bitbucket.org/blog/meet-bitbucket-pipes-30-ways-to-automate-your-ci-cd-pipeline', 'imcotton']
 22 = {list} ['https://erlef.org/', 'yetihehe']
 23 = {list} ['https://blog.mozilla.org/blog/2019/02/28/sharing-our-common-voices-mozilla-releases-the-largest-to-date-public-domain-transcribed-voice-dataset/', 'sgc']
 24 = {list} ['https://github.com/zelon88/xPress', 'kstenerud']
 25 = {list} ['https://pagedraw.io/', 'abraae']
 26 = {list} ['http://cs.lmu.edu/~ray/notes/nasmtutorial/', 'jdsully']
 27 = {list} ['https://hacks.mozilla.org/2019/02/rewriting-a-browser-component-in-rust/', 'atoav']
 28 = {list} ['No commenters']
 29 = {list} ['https://www.sendthemtomir.com/blog/cli-2-factor-authentication', 'spectralblu']
 __len__ = {int} 30

Ok so now that we’ve gone through a standard HTML page, let’s try again with a JavaScript page.

For this part, we’ll try to scrape https://vuejs.github.io/vue-hackernews/#!/news/1

We’ll start by getting requests to grab the data

jspagedata = requests.get("https://vuejs.github.io/vue-hackernews/#!/news/1")
jspagedataclean = bs4.BeautifulSoup(jspagedata.text, 'html.parser')

Hmm, but what’s this? When we look at our jspagedataclean variable, there’s nothing in there

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Vue.js HN Clone</title>
<meta content="initial-scale=1, maximum-scale=1, user-scalable=no, minimal-ui" name="viewport"/>
<link href="static/logo.png" rel="icon" type="image/x-icon"/>
</head>
<body>
<div id="app"></div>
<script src="static/build.js"></script>
</body>
</html>

That’s because the page relies on JavaScript to load the data, and the requests module isn’t able to load it.

This is where the Selenium headless browser comes in.

Let’s start again from the beginning by importing all the modules we need.

from selenium import webdriver
import requests, bs4, re, time

We’ll launch the browser and direct it to the site

driver = webdriver.Chrome()
driver.get("https://vuejs.github.io/vue-hackernews/#!/news/1")

Now we can load the page code in BeautifulSoup and repeat the process

jspagedata = bs4.BeautifulSoup(driver.page_source, 'html.parser')

We can quickly create our regular expressions by copying the outerHTML of the code
Beginner's guide to Web Scraping in Python 3

And use the same method to create our link list

jsIDsearch = re.compile(r'\<a href="#\/item\/(\d+)"')		# Don’t forget the \ before the /’s and <’s!
jsthreadIDs = jsIDsearch.findall(str(jspagedata))

jscommentlinks = []
for i in range(len(jsthreadIDs)):
    jscommentlinks.append('https://vuejs.github.io/vue-hackernews/#/item/' + jsthreadIDs[i])

Note that the regular expressions and URLs are different.

And then, just like before, we use Chrome DevTools to find the information we need and create a function to scrape the page

def jsscrapethread(jscleanthread):
    jssinglethreadlinksearch = re.compile(r'\<a class="title" href="(.+?)"')
    jssinglethreadlink = jssinglethreadlinksearch.findall(str(jscleanthread))
    jscommenterIDsearch = re.compile(r'#\/user\/(.+?)"')
    jscommenterIDs = jscommenterIDsearch.findall(str(jscleanthread))
    try:
        jsfirstcommenter = jscommenterIDs[1]
    except:
        jsfirstcommenter = "No Commenters"
    return jssinglethreadlink, jsfirstcommenter

jsresults = []        # We want our results to come back as a list
for i in range(len(jscommentlinks)):
    jsthread = driver.get(jscommentlinks[i])      # Go to each link
    jscleanthread = bs4.BeautifulSoup(driver.page_source, 'html.parser')
    jslink, jscommenter = jsscrapethread(jscleanthread)        # Scrape the data and return them to these variables
    jsresults.append(jslink + [jscommenter])      # Append the results - note that the link actually returns as a list, rather than a string
    time.sleep(30)

jsresults = {list} [['https://techcrunch.com/2019/03/03/facebook-phone-number-look-up/', 'https://twitter.com/rqou_/status/1101331385632022528', 'https://qmlbook.github.io/', 'https://retrobitch.wordpress.com/2019/02/12/pac-man-the-untold-story-of-how-we-really-played-the-ga
 00 = {list} ['https://techcrunch.com/2019/03/03/facebook-phone-number-look-up/', 'https://twitter.com/rqou_/status/1101331385632022528', 'https://qmlbook.github.io/', 'https://retrobitch.wordpress.com/2019/02/12/pac-man-the-untold-story-of-how-we-really-played-the-gam
 01 = {list} ['https://twitter.com/rqou_/status/1101331385632022528', 'kpcyrd']
 02 = {list} ['https://qmlbook.github.io/', 'swang']
 03 = {list} ['https://retrobitch.wordpress.com/2019/02/12/pac-man-the-untold-story-of-how-we-really-played-the-game/', 'mattigames']
 04 = {list} ['https://github.com/jakevdp/PythonDataScienceHandbook', 'wespiser_2018']
 05 = {list} ['https://arxiv.org/abs/1902.07254', 'fulafel']
 06 = {list} ['https://www.wbez.org/shows/wbez-news/the-middle-class-is-shrinking-everywhere-in-chicago-its-almost-gone/e63cb407-5d1e-41b1-9124-a717d4fb1b0b', 'undefined']
 07 = {list} ['https://thegeez.net/2019/03/03/serverless_collab.html', 'jseliger']
 08 = {list} ['https://github.com/Munksgaard/session-types', 'undefined']
 09 = {list} ['https://www.institutionalinvestor.com/article/b1db3jy3201d38/The-MBA-Myth-and-the-Cult-of-the-CEO', 'wcrichton']
 10 = {list} ['https://www.scriptcrafty.com/2019/02/sensible-software-engineering/', 'lordnacho']
 11 = {list} ['#/item/19288954', 'AdieuToLogic']
 12 = {list} ['https://spectrum.ieee.org/energy/the-smarter-grid/chinas-ambitious-plan-to-build-the-worlds-biggest-supergrid', 'spricket']
 13 = {list} ['https://www.cbc.ca/news/politics/facebook-canada-data-pressure-1.5041063', 'jcoffland']
 14 = {list} ['https://www.neowin.net/news/google-reveals-high-severity-flaw-in-macos-kernel/', 'jszymborski']
 15 = {list} ['https://getpolarized.io/2019/03/01/polar-personal-knowledge-repository.html', 'Someone1234']
 16 = {list} ['https://beeisbeautiful.wordpress.com/2012/07/16/bumblebees-sleeping-in-flowers/', 'jively']
 17 = {list} ['http://www.cs.cmu.edu/~aada/courses/15251s15/www/notes/godel-letter.pdf', 'presscast']
 18 = {list} ['https://blog.adacore.com/ten-years-of-using-spark-to-build-cubesat-nano-satellites-with-students', 'TaupeRanger']
 19 = {list} ['http://sbjs.rocks/#/', 'tectonic']
 20 = {list} ['https://www.eugenewei.com/blog/2019/2/19/status-as-a-service', 'um_ya']
 21 = {list} ['https://www.bloomberg.com/news/articles/2019-03-03/spacex-crew-dragon-docks-with-international-space-station', 'soneca']
 22 = {list} ['http://wambook.sourceforge.net/wambook.pdf', 'Klathmon']
 23 = {list} ['https://www.rust-lang.org/static/pdfs/Rust-npm-Whitepaper.pdf', 'greenyoda']
 24 = {list} ['https://news.harvard.edu/gazette/story/2019/02/harvard-geneticist-no-populations-dna-is-pure/', 'nothrabannosir']
 25 = {list} ['https://www.xaprb.com/blog/three-vacation-policies/', 'thaumasiotes']
 26 = {list} ['#/item/19296733', 'closeparen']
 27 = {list} ['http://buffettfaq.com/', 'ioseph']
 28 = {list} ['https://github.com/markehammons/Wayland-McWayface_JVM-edition', 'nur0n']
 29 = {list} ['https://circleci.com/blog/designing-a-package-manager-from-the-ground-up/', 'Sir_Cmpwn']
 __len__ = {int} 30

#python #web-development