Web Scraping Using Selenium and BeautifulSoup

Scrapy framework to solve lots of common web scraping problems.

Today we are going to take a look at Selenium and BeautifulSoup (with Python ❤️ ) with a step by step tutorial.

It’s time to use Selenium

Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.

The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.

At the beginning of the project (almost 20 years ago!) it was mostly used for cross-browser end-to-end testing (acceptance tests).

Now it is still used for testing, but also as a general browser automation platform and of course, web scraping!

Selenium is really useful when you have to perform action on a website such as:

* clicking on buttons
* filling forms
* scrolling
* taking a screenshot

It is also very useful in order to execute Javascript code. Let’s say that you want to scrape a Single Page application, and that you don’t find an easy way to directly call the underlying APIs, then Selenium might be what you need.

Installation

We will use Chrome in our example, so make sure you have it installed on your local machine:

* Chrome download page
* Chrome driver binary
* selenium package

In order to install the Selenium package, as always, I recommend that you create a virtual environnement, using virtualenv for example, and then:

## !pip install selenium

Quickstart

Once you have downloaded both Chrome and Chromedriver, and installed the selenium package you should be ready to start the browser:

from selenium import webdriver

DRIVER_PATH = './chromedriver' #the path where you have "chromedriver" file.
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://google.com')

This will launch Chrome in headfull mode (like a regular Chrome, which is controlled by your Python code). You should see a message stating that the browser is controlled by an automated software.

In order to run Chrome in headless mode (without any graphical user interface), to run it on a server for example:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get("https://www.thewindpower.net/country_media_es_3_espana.php")
#print(driver.page_source)
driver.quit()

#python #web-scraping #beautifulsoup #selenium #wind-farm

It’s time to use Selenium

Installation

Quickstart

medium.com

Web Scraping Using Selenium and BeautifulSoup