Overview

Selenium is a portable framework for testing web applications. It is open-source software released under the Apache License 2.0 that runs on Windows, Linux and macOS. Despite serving its major purpose, Selenium is also used as a web scraping tool. Without delving into the components of Selenium, we shall focus on a single component that is useful for web scraping, WebDriver. Selenium WebDriver provides us with an ability to control a web browser through a programming interface to create and execute test cases.

In our case, we shall be using it for scraping data from websites. Selenium comes in handy when websites display content dynamically i.e. use JavaScripts to render content. Even though Scrapy is a powerful web scraping framework, it becomes useless with these dynamic websites. My goal for this tutorial is to make you familiarize with Selenium and carry out some basic web scraping using it.

Let us start by installing selenium and a webdriver. WebDrivers support 7 Programming Languages: Python, Java, C#, Ruby, PHP, .Net and Perl. The examples in this manual are with Python language. There are tutorials available on the internet with other languages.

This is the third part of a 4 part tutorial series on web scraping using Scrapy and Selenium. You can reach part-1 by clicking here and part-2 by clicking here. These two parts dealt with web scraping using Scrapy.


Installing Selenium and WebDriver

Installing Selenium

Installing Selenium on any Linux OS is easy. Just execute the following command in a terminal and Selenium would be installed automatically.

pip install selenium

Installing WebDriver

Selenium officially has WebDrivers for 5 Web Browsers. Here, we shall see the installation of WebDriver for two of the most widely used browsers: Chrome and Firefox.

Installing Chromedriver for Chrome

First, we need to download the latest stable version of chromedriver from Chrome’s official site. It would be a zip file. All we need to do is extract it and put it in the executable path.

wget https://chromedriver.storage.googleapis.com/83.0.4103.39/chromedriver_linux64.zip

unzip chromedriver_linux64.zip
sudo mv chromedriver /usr/local/bin/

Installing Geckodriver for Firefox

Installing geckodriver for Firefox is even simpler since it is maintained by Firefox itself. All we need to do is execute the following line in a terminal and you are ready to play around with selenium and geckodriver.

sudo apt install firefox-geckodriver

Examples

There are two examples with increasing levels of complexity. First one would be a simpler webpage opening and typing into textboxes and pressing key(s). This example is to showcase how a webpage can be controlled through Selenium using a program. The second one would be a more complex web scraping example involving mouse scrolling, mouse button clicks and navigating to other pages. The goal here is to make you feel confident to start web scraping with Selenium.

Example 1 — Logging into Facebook using Selenium

Let us try out a simple automation task using Selenium and chromedriver as our training wheel exercise. For this, we would try to log into a Facebook account and we are not performing any kind of data scraping. I am assuming that you have some knowledge of identifying HTML tags used in a webpage using the browser’s developer tools. The following is a piece of python code that opens up a new Chrome browser, opens the Facebook main page, enters a username, password and clicks Login button.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

user_name = "Your E-mail"
password = "Your Password"
# Creating a chromedriver instance
driver = webdriver.Chrome()  # For Chrome
# driver = webdriver.Firefox() # For Firefox
# Opening facebook homepage
driver.get("https://www.facebook.com")
# Identifying email and password textboxes
email = driver.find_element_by_id("email")
passwd = driver.find_element_by_id("pass")
# Sending user_name and password to corresponding textboxes
email.send_keys(user_name)
passwd.send_keys(password)
# Sending a signal that RETURN key has been pressed
passwd.send_keys(Keys.RETURN)
# driver.quit()

#web-scraping #selenium #web-scraping-series #python

Web Scraping with Selenium
4.25 GEEK