Scraping Data from Nested HTML Pages with Python Selenium

Scraping involves the extraction of information from HTML Pages available over the Web. In Python, scraping can be performed through the Selenium library.

In this tutorial, I illustrate how to scrape a list of terms, distributed over two levels of nested pages, through Python selenium. As example, I scrape the list of terms from Bocardi.

The full code of this tutorial can be downloaded from my Github Repository.

Installation

The selenium library can be easily installed via pip through the command pip install selenium. In addition to the library, I also need to install the driver for my browser, which depends on the version of my browser. In this tutorial, I exploit the Chrome browser. I can check its version by entering chrome://settings/help on the address bar of my browser.

In my case the version if 80, thus I can download the chrome driver from this link. Once downloaded, I can put the file into a generic folder of my file system and I need to configure the $PATH variable with the path to the chrome driver:

Windows users - in this video I explain how to install Chrome driver for selenium on Windows 10.
Mac OS/ Linux - edit the .bash_profile or .profile file by adding the following line export PATH = "<path to web driver>: $ PATH"and then restart your computer.

Recognise the Web Site Structure

In order to scrape data from a Web site, firstly I need to study the URIs structure. In my example, the list of terms is organized alphabetically, and for each letter of the alphabet there is a dedicated page, available at <basic_url>/dizionario/<current_letter>/ (first level of URI). For example, for the letter a, the dedicated page is available at https://www.brocardi.it/dizionario/a/. In addition, the list of terms for each letter is paginated in different pages. For each letter, the first page is available the the first level of URI, while starting from the second page, the URI changes and is available at <basic_url>/dizionario/<current_letter>/?page=<page_number>. For example, for the letter a, the list of terms in the second page is available at the link https://www.brocardi.it/dizionario/a/?page=2 .

Environment Setup

In my code, I need to implement two loops: an external loop for letters and an internal loop for pages. I note that some letters are missing (jkwxy). For the external loop, I build a list containing all the letters, but the missing ones. I exploit string.ascii_lowercase to build the list of letters.

#data-collection #python #data-science

Installation

Recognise the Web Site Structure

Environment Setup

towardsdatascience.com

Scraping Data from Nested HTML Pages with Python Selenium