Jake Whittaker

Jake Whittaker


An Introduction to Web Scraping in Python

Web scraping is a technique of extracting information from websites. In this article we will learn the basics of web scraping with Python using the “requests” and “BeautifulSoup” packages.

Table of Contents

  • Setting Up Your **Python **Web Scraper
  • Making Web Requests
  • Wrangling HTML With BeautifulSoup
  • Using **BeautifulSoup **to Get Mathematician Names
  • Getting the Popularity Score
  • Putting It All Together
  • Conclusion

What is web scraping all about?

Imagine that one day, out of the blue, you find yourself thinking “Gee, I wonder who the five most popular mathematicians are?”

You do a bit of thinking, and you get the idea to use Wikipedia’s XTools to measure the popularity of a mathematician by equating popularity with pageviews. For example, look at the page on Henri Poincaré. There, you can see that Poincaré’s pageviews for the last 60 days are, as of December 2017, around 32,000.

Next, you Google “famous mathematicians” and find this resource that lists 100 names. Now you have a page listing mathematicians’ names as well as a website that provides information about how “popular” that mathematician is. Now what?

This is where **Python **and web scraping come in. Web scraping is about downloading structured data from the web, selecting some of that data, and passing along what you selected to another process.

In this tutorial, you will be writing a Python program that downloads the list of 100 **mathematicians **and their **XTools **pages, selects data about their popularity, and finishes by telling us the top 5 most popular mathematicians of all time! Let’s get started.

Important: We’ve received an email from an XTools maintainer informing us that scraping XTools is harmful and that automation **APIs **should be used instead:

This article on your site is essentially a guide to scraping XTools […] This is not necessary, and it’s causing problems for us. We have APIs that should be used for automation, and furthermore, for pageviews specifically folks should be using the official pageviews API.
The example code in the article was modified to no longer make requests to the XTools website. The web scraping techniques demonstrated here are still valid, but please do not use them on web pages of the XTools project. Use the provided automation API instead.

Setting Up Your Python Web Scraper

You will be using Python 3 and **Python virtual environments **throughout the tutorial. Feel free to set things up however you like. Here is how I tend to do it:

$ python3 -m venv venv
$ . ./venv/bin/activate

You will need to install only these two packages:

Let’s install these dependencies with pip:

$ pip install requests BeautifulSoup4

Finally, if you want to follow along, fire up your favorite text editor and create a file called mathematicians.py. Get started by including these import statements at the top:

from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

Making Web Requests

Your first task will be to download web pages. The requests package comes to the rescue. It aims to be an easy-to-use tool for doing all things **HTTP **in Python, and it doesn’t dissappoint. In this tutorial, you will need only the requests.get() function, but you should definitely checkout the full documentation when you want to go further.

First, here’s your function:

def simple_get(url):
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None

def is_good_response(resp):
    Returns True if the response seems to be HTML, False otherwise.
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)

def log_error(e):
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.

The simple_get() function accepts a single url argument. It then makes a GET request to that URL. If nothing goes wrong, you end up with the raw HTML content for the page you requested. If there were any problems with your request (like the URL is bad, or the remote server is down), then your function returns None.

You may have noticed the use of the closing() function in your definition of simple_get(). The closing() function ensures that any network resources are freed when they go out of scope in that with block. Using closing() like that is good practice and helps to prevent fatal errors and network timeouts.

You can test simple_get() like this:

>>> from mathematicians import simple_get
>>> raw_html = simple_get('https://realpython.com/blog/')
>>> len(raw_html)

>>> no_html = simple_get('https://realpython.com/blog/nope-not-gonna-find-it')
>>> no_html is None

Wrangling HTML With BeautifulSoup

Once you have raw HTML in front of you, you can start to select and extract. For this purpose, you will be using BeautifulSoup. The BeautifulSoup constructor parses raw HTML strings and produces an object that mirrors the HTML document’s structure. The object includes a slew of methods to select, view, and manipulate DOM nodes and text content.

Consider the following quick and contrived example of an HTML document:

<!DOCTYPE html>
  <title>Contrived Example</title>
<p id="eggman"> I am the egg man </p>
<p id="walrus"> I am the walrus </p>

If the above HTML is saved in the file contrived.html, then you can use BeautifulSoup like this:

>>> from bs4 import BeautifulSoup
>>> raw_html = open('contrived.html').read()
>>> html = BeautifulSoup(raw_html, 'html.parser')
>>> for p in html.select('p'):
...     if p['id'] == 'walrus':
...         print(p.text)

'I am the walrus'

Breaking down the example, you first parse the raw HTML by passing it to the BeautifulSoup constructor. BeautifulSoup accepts multiple back-end parsers, but the standard back-end is 'html.parser', which you supply here as the second argument. (If you neglect to supply that 'html.parser', then the code will still work, but you will see a warning print to your screen.)

The select() method on your html object lets you use CSS selectors to locate elements in the document. In the above case, html.select('p') returns a list of paragraph elements. Each p has HTML attributes that you can access like a dict. In the line if p['id'] == 'walrus', for example, you check if the id attribute is equal to the string 'walrus', which corresponds to <p id="walrus"> in the HTML.

Using BeautifulSoup to Get Mathematician Names

Now that you have given the select() method in BeautifulSoup a short test drive, how do you find out what to supply to select()? The fastest way is to step out of Python and into your web browser’s developer tools. You can use your browser to examine the document in some detail. I usually look for id or class element attributes or any other information that uniquely identifies the information I want to extract.

To make matters concrete, turn to the list of mathematicians you saw earlier. If you spend a minute or two looking at this page’s source, you can see that each mathematician’s name appears inside the text content of an <li> tag. To make matters even simpler, <li> tags on this page seem to contain nothing but names of mathematicians.

Here’s a quick look with Python:

>>> raw_html = simple_get('http://www.fabpedigree.com/james/mathmen.htm')
>>> html = BeautifulSoup(raw_html, 'html.parser')
>>> for i, li in enumerate(html.select('li')):
        print(i, li.text)

0  Isaac Newton
 Carl F. Gauss
 Leonhard Euler
 Bernhard Riemann

1  Archimedes
 Carl F. Gauss
 Leonhard Euler
 Bernhard Riemann

2  Carl F. Gauss
 Leonhard Euler 
 Bernhard Riemann

 3  Leonhard Euler
 Bernhard Riemann

4  Bernhard Riemann

# 5 ... and many more...

The above experiment shows that some of the <li> elements contain multiple names separated by newline characters, while others contain just a single name. With this information in mind, you can write your function to extract a single list of names:

def get_names():
    Downloads the page where the list of mathematicians is found
    and returns a list of strings, one per mathematician
    url = 'http://www.fabpedigree.com/james/mathmen.htm'
    response = simple_get(url)

    if response is not None:
        html = BeautifulSoup(response, 'html.parser')
        names = set()
        for li in html.select('li'):
            for name in li.text.split('\n'):
                if len(name) > 0:
        return list(names)

    # Raise an exception if we failed to get any data from the url
    raise Exception('Error retrieving contents at {}'.format(url))

The get_names() function downloads the page and iterates over the <li> elements, picking out each name that occurs. Next, you add each name to a Python set, which ensures that you don’t end up with duplicate names. Finally, you convert the set to a list and return it.

Getting the Popularity Score

Nice, you’re nearly done! Now that you have a list of names, you need to pick out the pageviews for each one. The function you write is similar to the function you made to get the list of names, only now you supply a name and pick out an integer value from the page.

Again, you should first check out an example page in your browser’s developer tools. It looks as if the text appears inside an <a> element, and the href attribute of that element always contains the string 'latest-60' as a substring. That’s all the information you need to write your function:

def get_hits_on_name(name):
    Accepts a `name` of a mathematician and returns the number
    of hits that mathematician's Wikipedia page received in the 
    last 60 days, as an `int`
    # url_root is a template string that is used to build a URL.
    response = simple_get(url_root.format(name))

    if response is not None:
        html = BeautifulSoup(response, 'html.parser')

        hit_link = [a for a in html.select('a')
                    if a['href'].find('latest-60') > -1]

        if len(hit_link) > 0:
            # Strip commas
            link_text = hit_link[0].text.replace(',', '')
                # Convert to integer
                return int(link_text)
                log_error("couldn't parse {} as an `int`".format(link_text))

    log_error('No pageviews found for {}'.format(name))
    return None

Putting It All Together

You have reached a point where you can finally find out which mathematician is most beloved by the public! The plan is simple:

  • Get a list of names
  • Iterate over the list to get a “popularity score” for each name
  • Finish by sorting the names by popularity

Simple, right? Well, there’s one thing that hasn’t been mentioned yet: errors.

Working with real-world data is messy, and trying to force messy data into a uniform shape will invariably result in the occasional error jumping in to mess with your nice clean vision of how things ought to be. Ideally, you would like to keep track of errors when they occur in order to get a better sense of the of quality your data.

For your present purposes, you will track instances in which you could not find a popularity score for a given mathematician’s name. At the end of the script, you will print a message showing the number of mathematicians who were left out of the rankings.

Here’s the code:

if __name__ == '__main__':
    print('Getting the list of names....')
    names = get_names()
    print('... done.\n')

    results = []

    print('Getting stats for each name....')

    for name in names:
            hits = get_hits_on_name(name)
            if hits is None:
                hits = -1
            results.append((hits, name))
            results.append((-1, name))
            log_error('error encountered while processing '
                      '{}, skipping'.format(name))

    print('... done.\n')


    if len(results) > 5:
        top_marks = results[:5]
        top_marks = results

    print('\nThe most popular mathematicians are:\n')
    for (mark, mathematician) in top_marks:
        print('{} with {} pageviews'.format(mathematician, mark))

    no_results = len([res for res in results if res[0] == -1])
    print('\nBut we did not find results for '
          '{} mathematicians on the list'.format(no_results))

That’s it!

When you run the script, you should see at the following report:

The most popular mathematicians are:

Albert Einstein with 1089615 pageviews
Isaac Newton with 581612 pageviews
Srinivasa Ramanujan with 407141 pageviews
Aristotle with 399480 pageviews
Galileo Galilei with 375321 pageviews

But we did not find results for 19 mathematicians on our list


Web scraping is a big field, and you have just finished a brief tour of that field, using Python as you guide. You can get pretty far using just requests and BeautifulSoup, but as you followed along, you may have come up with few questions:

  • What happens if page content loads as a result of asynchronous JavaScript requests? (Check out Selenium’s Python API.)
  • How do I write a web spider or search engine bot that traverses large portions of the web?
  • What is this Scrapy thing I keep hearing about?

#python #web-development

What is GEEK

Buddha Community

An Introduction to Web Scraping in Python
Ray  Patel

Ray Patel


top 30 Python Tips and Tricks for Beginners

Welcome to my Blog , In this article, you are going to learn the top 10 python tips and tricks.

1) swap two numbers.

2) Reversing a string in Python.

3) Create a single string from all the elements in list.

4) Chaining Of Comparison Operators.

5) Print The File Path Of Imported Modules.

6) Return Multiple Values From Functions.

7) Find The Most Frequent Value In A List.

8) Check The Memory Usage Of An Object.

#python #python hacks tricks #python learning tips #python programming tricks #python tips #python tips and tricks #python tips and tricks advanced #python tips and tricks for beginners #python tips tricks and techniques #python tutorial #tips and tricks in python #tips to learn python #top 30 python tips and tricks for beginners

Sival Alethea

Sival Alethea


Beautiful Soup Tutorial - Web Scraping in Python

The Beautiful Soup module is used for web scraping in Python. Learn how to use the Beautiful Soup and Requests modules in this tutorial. After watching, you will be able to start scraping the web on your own.
📺 The video in this post was made by freeCodeCamp.org
The origin of the article: https://www.youtube.com/watch?v=87Gx3U0BDlo&list=PLWKjhJtqVAbnqBxcdjVGgT3uVR10bzTEB&index=12
🔥 If you’re a beginner. I believe the article below will be useful to you ☞ What You Should Know Before Investing in Cryptocurrency - For Beginner
⭐ ⭐ ⭐The project is of interest to the community. Join to Get free ‘GEEK coin’ (GEEKCASH coin)!
☞ **-----CLICK HERE-----**⭐ ⭐ ⭐
Thanks for visiting and watching! Please don’t forget to leave a like, comment and share!

#web scraping #python #beautiful soup #beautiful soup tutorial #web scraping in python #beautiful soup tutorial - web scraping in python

Ray  Patel

Ray Patel


Lambda, Map, Filter functions in python

Welcome to my Blog, In this article, we will learn python lambda function, Map function, and filter function.

Lambda function in python: Lambda is a one line anonymous function and lambda takes any number of arguments but can only have one expression and python lambda syntax is

Syntax: x = lambda arguments : expression

Now i will show you some python lambda function examples:

#python #anonymous function python #filter function in python #lambda #lambda python 3 #map python #python filter #python filter lambda #python lambda #python lambda examples #python map

Osiki  Douglas

Osiki Douglas


How POST Requests with Python Make Web Scraping Easier

When scraping a website with Python, it’s common to use the

urllibor theRequestslibraries to sendGETrequests to the server in order to receive its information.

However, you’ll eventually need to send some information to the website yourself before receiving the data you want, maybe because it’s necessary to perform a log-in or to interact somehow with the page.

To execute such interactions, Selenium is a frequently used tool. However, it also comes with some downsides as it’s a bit slow and can also be quite unstable sometimes. The alternative is to send a

POSTrequest containing the information the website needs using the request library.

In fact, when compared to Requests, Selenium becomes a very slow approach since it does the entire work of actually opening your browser to navigate through the websites you’ll collect data from. Of course, depending on the problem, you’ll eventually need to use it, but for some other situations, a

POSTrequest may be your best option, which makes it an important tool for your web scraping toolbox.

In this article, we’ll see a brief introduction to the

POSTmethod and how it can be implemented to improve your web scraping routines.

#python #web-scraping #requests #web-scraping-with-python #data-science #data-collection #python-tutorials #data-scraping

AutoScraper Introduction: Fast and Light Automatic Web Scraper for Python

In the last few years, web scraping has been one of my day to day and frequently needed tasks. I was wondering if I can make it smart and automatic to save lots of time. So I made AutoScraper!

The project code is available on Github.

This project is made for automatic web scraping to make scraping easy. It gets a url or the html content of a web page and a list of sample data which we want to scrape from that page. This data can be text, url or any html tag value of that page. It learns the scraping rules and returns the similar elements. Then you can use this learned object with new urls to get similar content or the exact same element of those new pages!


Install latest version from git repository using pip:

$ pip install git+https://github.com/alirezamika/autoscraper.git

How to use

Getting similar results

Say we want to fetch all related post titles in a stackoverflow page:

from autoscraper import AutoScraper

url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'

## We can add one or multiple candidates here.
## You can also put urls here to retrieve urls.
wanted_list = ["How to call an external command?"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)

#python #web-scraping #web-crawling #data-scraping #website-scraping #open-source #repositories-on-github #web-development