Scrapy Vs Selenium Vs Beautiful Soup for Web Scraping

Scrapy Vs Selenium Vs Beautiful Soup for Web Scraping

A Complete Explanation about Scrapy, Selenium and Beautiful soup scraping tools.

A Complete Explanation about Scrapy, Selenium and Beautiful soup scraping tools.

The most popular libraries used by web scraping developers in Python are Beautiful soup, Scrapy, and Selenium but every library has its own pros and cons *Nothing is perfect in this world. *To explain the various aspects of each library and its differences, first of all, I would like to start with each module core implementation and its working mechanism. after that, we will dive into the various differences of each module. Let’s start our explanation with Scrapy library.

Originally published by Sri Manikanta Palakollu*** ***at https://towardsdatascience.com## Scrapy

Scrapy is an open source collaborative framework for extracting the data from the websites what we need. Its performance is ridiculously fast and it is one of the most powerful libraries available out there. One of the key advantages of scrapy is that it is built on top of Twisted, an asynchronous networking framework, that means scrapy uses the non-blocking mechanism while sending the requests to the users.

The asynchronous requests follows non-blocking I/O calls to the server. It is having much more advantages than synchronous requests.
The key features of Scrapy are

  1. Scrapy has built-in support for extracting data from HTML sources using XPath expression and CSS expression.
  2. It is a portable library i.e(written in Python and runs on Linux, Windows, Mac, and BSD)
  3. It can be*** Easily Extensible.***
  4. It is faster than other existing scraping libraries. It can able to extract the websites with 20 times faster than other tools.
  5. It consumes a lot less memory and CPU usage.
  6. It can help us to build a Robust, and flexible application with a bunch of functions.
  7. It has good community support for the developers but the documentation is not that much great for the beginners because it is not having a beginner friendly documentation.
Beautiful Soup

When it comes to Beautiful soup, it is really a beautiful tool for web scrappers because of its core features. It can help the programmer to quickly extract the data from a certain web page. This library will help us to pull the data out of HTML and XML files. But the problem with Beautiful Soup is it can’t able to do the entire job on its own. this library requires specific modules to work done.

The dependencies of the Beautiful soup are

  1. A library is needed to make a request to the website because it can’t able to make a request to a particular server. To overcome this issue It takes the help of the most popular library named ***Requests ***or ***urlib2. ***these libraries will help us to make our request to the server.
  2. After downloading the HTML, XML data into our local Machine, Beautiful Soup require an External parser to parse the downloaded data. The most famous parsers are — lxml’s XML parser, lxml’s HTML parser, HTML5lib, html.parser.

The advantages of Beautiful soup are

  1. It is easy to learn and master. for example, if we want to extract all the links from the webpage. It can be simply done as follows
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

for link in soup.find_all('a'): # It helps to find all anchor tag's
    print(link.get('href'))

In the above code, we are using the*** html.parser*** to parse the content of the ***html_doc. ***this is one of the strongest reason for developers to use Beautiful soup as a web scraping tool.

  1. It has good comprehensive documentation which helps us to learn the things quickly.

  2. It has good community support to figure out the issues that arise while we are working with this library.

Selenium

Finally, when it comes to Selenium for web scraping! first of all, you should need to remember that Selenium is designed to automate test for Web Applications. It provides a way for the developer to write tests in a number of popular programming languages such as C#, Java, Python, Ruby, etc. This framework is developed to perform browser automation. Let’s have a look at the sample code that automates the browser.

# Importing the required Modules.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.send_keys("selenium")
elem.send_keys(Keys.RETURN)
assert "Google" in driver.title
driver.close()

From the above code, we can conclude that API is very beginner-friendly, you can easily write code with Selenium. That is why it is so popular in the developer community. Even Selenium is mainly used to automate tests for web applications, it can also be used to develop web spider, many people have done this before.

The Key feature of Selenium is

  1. It can easily work with core Javascript concepts(DOM)
  2. It can easily handle AJAX and PJAX requests.
Choosing the Appropriate Library

When it comes to the selection of a particular library to perform web scraping operation we need to consider various key factors because every library has it’s own pros and cons so In this selection criteria we will discuss the various factors that we need to consider while we are selecting a library for our project. The key factors that we must point out are

Extensibility

**Scrapy: **The architecture of Scrapy is well designed to customize the middleware to add our own custom functionality. This feature helps us our project to be more Robust and flexible.

One of the biggest advantages of Scrapy is that we can able to migrate our existing project to another project very easily. So for the large/Complex projects, Scrapy is the best choice to work out.

If Your project needs proxies, data pipeline, then Scrapy would be the best choice.

**Beautiful Soup: **When it comes to a small project, Or low-level complex project Beautiful Soup can do the task pretty amazing. It helps us to maintain our code simple and flexible.

If you are a beginner and if you want to learn things quickly and want to perform web scraping operations then Beautiful Soup is the best choice.

Selenium: When you are dealing with Core Javascript featured website then Selenium would be the best choice. but the Data size should be limited.

Performance

**Scrapy: **It can do things quickly because of its built-in feature i.e usage of asynchronous system calls. The Existing libraries out there not able to beat the performance of Scrapy.

**Beautiful Soup: **Beautiful Soup is pretty slow to perform a certain task but we can overcome this issue with the help of Multithreading concept but However the programmer need to know the concept of multithreading very effectively. This is the downside of Beautiful Soup.

**Selenium: **It can handle up to some range butn’t equivalent to Scrapy.

EcoSystem

**Scrapy: **It has a good ecosystem, we can use proxies and VPN’s to automate the task. This is one of the reasons for choosing the library for complex projects. we can able to send multiple requests from the multiple proxy addresses.

BeautifulSoup: This library has a lot of dependencies in the ecosystem. This is one of the downsides of this library for a complex project

Selenium: It has a good ecosystem for the development but the problem is we can’t utilize the proxies very easily.

From the above three common factors, you need to decide which one should be the right choice for your next project.## Conclusion

I hope you got a clear understanding of Scrapy, Selenium, and Beautiful Soup. I discussed pretty much everything about the most popular web scraping libraries in a detailed manner. But the Selection of the library is really a big task. But I would suggest —

if you are dealing with complex Scraping operation that requires huge speed and with low power consumption then Scrapy would be a great choice.
If you’re new to programmer want to work with web scraping projects then you should go for*** Beautiful Soup***. you can easily learn it and able to perform the operations very quickly up to a certain level of complexity.
When you want to deal with Core Javascript based web Applications and want to make browser automation with AJAX/PJAX Requests. then Selenium would be a great choice.

Mobile App Development Company India | Ecommerce Web Development Company India

Mobile App Development Company India | Ecommerce Web Development Company India

Best Mobile App Development Company India, WebClues Global is one of the leading web and mobile app development company. Our team offers complete IT solutions including Cross-Platform App Development, CMS & E-Commerce, and UI/UX Design.

We are custom eCommerce Development Company working with all types of industry verticals and providing them end-to-end solutions for their eCommerce store development.

Know more about Top E-Commerce Web Development Company

Building A Concurrent Web Scraper With Python and Selenium

Building A Concurrent Web Scraper With Python and Selenium

This is a quick post that looks at how to speed up a simple, Python-based web scraping and crawling script with parallel processing via the multiprocessing library. We'll also break down the script itself and show how to test the parsing functionality.

This is a quick post that looks at how to speed up a simple, Python-based web scraping and crawling script with parallel processing via the multiprocessing library. We'll also break down the script itself and show how to test the parsing functionality.

After completing this tutorial you should be able to:

  1. Scrape and crawl websites with Selenium and parse HTML with Beautiful Soup
  2. Set up unittest to test the scraping and parsing functionalities
  3. Set up multiprocessing to execute the web scraper in parallel
  4. Configure headless mode for ChromeDriver with Selenium
Project Setup

Clone down the repo if you'd like to follow along. From the command line run the following commands:

$ git clone [email protected]:calebpollman/web-scraping-parallel-processing.git
$ cd web-scraping-parallel-processing
$ python3.7 -m venv env
$ source env/bin/activate
(env)$ pip install -r requirements.txt

The above commands may differ depending on your environment.
Install ChromeDriver globally. (We're using version 73.0.3683.20).

Script Overview

The script traverses and scrapes the first 20 pages of Hacker News for information about the current articles listed using Selenium to automate interaction with the site and Beautiful Soup to parse the HTML.

script.py:

import datetime
from time import sleep, time

from scrapers.scraper import get_driver, connect_to_base, \
    parse_html, write_to_file


def run_process(page_number, filename, browser):
    if connect_to_base(browser, page_number):
        sleep(2)
        html = browser.page_source
        output_list = parse_html(html)
        write_to_file(output_list, filename)
    else:
        print('Error connecting to hackernews')


if __name__ == '__main__':
    start_time = time()
    current_page = 1
    output_timestamp = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
    output_filename = f'output_{output_timestamp}.csv'
    browser = get_driver()
    while current_page <= 20:
        print(f'Scraping page #{current_page}...')
        run_process(current_page, output_filename, browser)
        current_page = current_page + 1
    browser.quit()
    end_time = time()
    elapsed_time = end_time - start_time
    print(f'Elapsed run time: {elapsed_time} seconds')

Let's start with the main-condition block. After setting a few variables, the browser is initialized via get_driver() from scrapers/scraper.py.

if __name__ == '__main__':
    # set variables
    start_time = time()
    current_page = 1
    output_timestamp = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
    output_filename = f'output_{output_timestamp}.csv'

    ########
    # here #
    ########
    browser = get_driver()
    # scrape and crawl
    while current_page <= 20:
        print(f'Scraping page #{current_page}...')
        run_process(current_page, output_filename, browser)
        current_page = current_page + 1
    # exit
    browser.quit()
    end_time = time()
    elapsed_time = end_time - start_time
    print(f'Elapsed run time: {elapsed_time} seconds')

A while loop is then configured to control the flow of the overall scraper.

if __name__ == '__main__':
    # set variables
    start_time = time()
    current_page = 1
    output_timestamp = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
    output_filename = f'output_{output_timestamp}.csv'
    browser = get_driver()
    # scrape and crawl

    ########
    # here #
    ########
    while current_page <= 20:
        print(f'Scraping page #{current_page}...')
        run_process(current_page, output_filename, browser)
        current_page = current_page + 1
    # exit
    browser.quit()
    end_time = time()
    elapsed_time = end_time - start_time
    print(f'Elapsed run time: {elapsed_time} seconds')

Within the loop, run_process() is called, which houses the connection and scraping functions.

def run_process(page_number, filename, browser):
    if connect_to_base(browser, page_number):
        sleep(2)
        html = browser.page_source
        output_list = parse_html(html)
        write_to_file(output_list, filename)
    else:
        print('Error connecting to hackernews')

In run_process(), the browser instance and a page number are passed to connect_to_base().

def run_process(page_number, filename, browser):

    ########
    # here #
    ########
    if connect_to_base(browser, page_number):
        sleep(2)
        html = browser.page_source
        output_list = parse_html(html)
        write_to_file(output_list, filename)
    else:
        print('Error connecting to hackernews')

This function attempts to connect to Hacker News and then uses Selenium's explicit wait functionality to ensure the element with id='hnmain' has loaded before continuing.

def connect_to_base(browser, page_number):
    base_url = f'https://news.ycombinator.com/news?p={page_number}'
    connection_attempts = 0
    while connection_attempts < 3:
        try:
            browser.get(base_url)
            # wait for table element with id = 'hnmain' to load
            # before returning True
            WebDriverWait(browser, 5).until(
                EC.presence_of_element_located((By.ID, 'hnmain'))
            )
            return True
        except Exception as ex:
            connection_attempts += 1
            print(f'Error connecting to {base_url}.')
            print(f'Attempt #{connection_attempts}.')
    return False

Review the Selenium docs for more information on explicit wait.
To emulate a human user, sleep(2) is called after the browser has connected to Hacker News.

def run_process(page_number, filename, browser):
    if connect_to_base(browser, page_number):

        ########
        # here #
        ########
        sleep(2)
        html = browser.page_source
        output_list = parse_html(html)
        write_to_file(output_list, filename)
    else:
        print('Error connecting to hackernews')

Once the page has loaded and sleep(2) has executed, the browser grabs the HTML source, which is then passed to parse_html().

def run_process(page_number, filename, browser):
    if connect_to_base(browser, page_number):
        sleep(2)

        ########
        # here #
        ########
        html = browser.page_source

        ########
        # here #
        ########
        output_list = parse_html(html)
        write_to_file(output_list, filename)
    else:
        print('Error connecting to hackernews')

parse_html() uses Beautiful Soup to parse the HTML, generating a list of dicts with the appropriate data.

def parse_html(html):
    # create soup object
    soup = BeautifulSoup(html, 'html.parser')
    output_list = []
    # parse soup object to get article id, rank, score, and title
    tr_blocks = soup.find_all('tr', class_='athing')
    article = 0
    for tr in tr_blocks:
        article_id = tr.get('id')
        article_url = tr.find_all('a')[1]['href']
        # check if article is a hacker news article
        if 'item?id=' in article_url:
            article_url = f'https://news.ycombinator.com/{article_url}'
        load_time = get_load_time(article_url)
        try:
            score = soup.find(id=f'score_{article_id}').string
        except Exception as ex:
            score = '0 points'
        article_info = {
            'id': article_id,
            'load_time': load_time,
            'rank': tr.span.string,
            'score': score,
            'title': tr.find(class_='storylink').string,
            'url': article_url
        }
        # appends article_info to output_list
        output_list.append(article_info)
        article += 1
    return output_list

This function also passes the article URL to get_load_time(), which loads the URL and records the subsequent load time.

def get_load_time(article_url):
    try:
        # set headers
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
        # make get request to article_url
        response = requests.get(
            article_url, headers=headers, stream=True, timeout=3.000)
        # get page load time
        load_time = response.elapsed.total_seconds()
    except Exception as ex:
        load_time = 'Loading Error'
    return load_time

The output is added to a CSV file.

def run_process(page_number, filename, browser):
    if connect_to_base(browser, page_number):
        sleep(2)
        html = browser.page_source
        output_list = parse_html(html)

        ########
        # here #
        ########
        write_to_file(output_list, filename)
    else:
        print('Error connecting to hackernews')

write_to_file():

def write_to_file(output_list, filename):
    for row in output_list:
        with open(filename, 'a') as csvfile:
            fieldnames = ['id', 'load_time', 'rank', 'score', 'title', 'url']
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writerow(row)

Finally, back in the while loop, the page_number is incremented and the process starts over again.

if __name__ == '__main__':
    # set variables
    start_time = time()
    current_page = 1
    output_timestamp = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
    output_filename = f'output_{output_timestamp}.csv'
    browser = get_driver()
    # scrape and crawl
    while current_page <= 20:
        print(f'Scraping page #{current_page}...')
        run_process(current_page, output_filename, browser)

        ########
        # here #
        ########
        current_page = current_page + 1
    # exit
    browser.quit()
    end_time = time()
    elapsed_time = end_time - start_time
    print(f'Elapsed run time: {elapsed_time} seconds')

Want to test this out? Grab the full script here.
It took about 355 seconds (nearly 6 minutes) to run:

(env)$ python script.py
Scraping page #1...
Scraping page #2...
Scraping page #3...
Scraping page #4...
Scraping page #5...
Scraping page #6...
Scraping page #7...
Scraping page #8...
Scraping page #9...
Scraping page #10...
Scraping page #11...
Scraping page #12...
Scraping page #13...
Scraping page #14...
Scraping page #15...
Scraping page #16...
Scraping page #17...
Scraping page #18...
Scraping page #19...
Scraping page #20...
Elapsed run time: 355.06936597824097 seconds

Keep in mind that there may not be content on all 20 pages, so the elapsed time may be different on your end. This script was ran when there was content on 16 pages (461 records).
Got it? Great! Let's add some basic testing.

Testing

To test the parsing functionality without initiating the browser and, thus, making repeated GET requests to Hacker News, you can download the page HTML and parse it locally. This can help avoid scenarios where you may get your IP blocked for making too many requests too quickly while writing and testing your parsing function, as well as saving you time by not needing to fire up a browser every time you run the script.

test/test_scraper.py:

import unittest

from scrapers.scraper import parse_html


class TestParseFunction(unittest.TestCase):

    def setUp(self):
        with open('test/test.html', encoding='utf-8') as f:
            html = f.read()
            self.output = parse_html(html)

    def tearDown(self):
        self.output = []

    def test_output_is_not_none(self):
        self.assertIsNotNone(self.output)

    def test_output_is_a_list(self):
        self.assertTrue(isinstance(self.output, list))

    def test_output_is_a_list_of_dicts(self):
        self.assertTrue(all(isinstance(elem, dict) for elem in self.output))


if __name__ == '__main__':
    unittest.main()

Ensure all is well:

(env)$ python test/test_scraper.py
...
----------------------------------------------------------------------
Ran 3 tests in 64.225s

OK

64 seconds?! Want to mock get_load_time() to bypass the GET request?

import unittest
from unittest.mock import patch

from scrapers.scraper import parse_html


class TestParseFunction(unittest.TestCase):

    @patch('scrapers.scraper.get_load_time')
    def setUp(self, mock_get_load_time):
        mock_get_load_time.return_value = 'mocked!'
        with open('test/test.html', encoding='utf-8') as f:
            html = f.read()
            self.output = parse_html(html)

    def tearDown(self):
        self.output = []

    def test_output_is_not_none(self):
        self.assertIsNotNone(self.output)

    def test_output_is_a_list(self):
        self.assertTrue(isinstance(self.output, list))

    def test_output_is_a_list_of_dicts(self):
        self.assertTrue(all(isinstance(elem, dict) for elem in self.output))


if __name__ == '__main__':
    unittest.main()

Test:

(env)$ python test/test_scraper.py
...
----------------------------------------------------------------------
Ran 3 tests in 0.423s

OK

Configure Multiprocessing

Now comes up the fun part! By making just a few changes to the script, we can speed things up:

import datetime
from itertools import repeat
from time import sleep, time
from multiprocessing import Pool, cpu_count

from scraper.scraper import get_driver, connect_to_base, \
    parse_html, write_to_file


def run_process(page_number, filename):
    browser = get_driver()
    if connect_to_base(browser, page_number):
        sleep(2)
        html = browser.page_source
        output_list = parse_html(html)
        write_to_file(output_list, filename)
        browser.quit()
    else:
        print('Error connecting to hackernews')
        browser.quit()


if __name__ == '__main__':
    start_time = time()
    output_timestamp = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
    output_filename = f'output_{output_timestamp}.csv'
    with Pool(cpu_count()-1) as p:
        p.starmap(run_process, zip(range(1, 21), repeat(output_filename)))
    p.close()
    p.join()
    end_time = time()
    elapsed_time = end_time - start_time
    print(f'Elapsed run time: {elapsed_time} seconds')

With the multiprocessing library, Pool is used to spawn a number of subprocesses based on the number of CPUs available on the system (minus one since the system processes take up a core).

This script is tested on a i7 Macbook Pro that has 8 cores.
Run:

(env)$ python script_parallel.py
Elapsed run time: 62.95027780532837 seconds

Check out the completed script here.## Configure Headless ChromeDriver

To speed things up even further we can run Chrome in headless mode by simply updating get_driver() in scrapers/scraper.py:

def get_driver():
    # initialize options
    options = webdriver.ChromeOptions()
    # pass in headless argument to options
    options.add_argument('--headless')
    # initialize driver
    driver = webdriver.Chrome(chrome_options=options)
    return driver

Run:

(env)$ python script_parallel.py
Elapsed run time: 58.14033889770508 seconds

Conclusion

With a small amount of variation from the original code, we were able to configure parallel processing in the script and set up ChromeDriver to run a headless browser to take the script's run time from around 355 seconds to just over 58 seconds. In this specific scenario that's 89.3% faster, which is a huge improvement.

I hope this helps your scripts. You can find the code in the repo.

Python Tutorial for Beginners (2019) - Learn Python for Machine Learning and Web Development

Python Tutorial for Beginners (2019) - Learn Python for Machine Learning and Web Development




TABLE OF CONTENT

00:00:00 Introduction

00:01:49 Installing Python

00:06:10 Your First Python Program

00:08:11 How Python Code Gets Executed

00:11:24 How Long It Takes To Learn Python

00:13:03 Variables

00:18:21 Receiving Input

00:22:16 Python Cheat Sheet

00:22:46 Type Conversion

00:29:31 Strings

00:37:36 Formatted Strings

00:40:50 String Methods

00:48:33 Arithmetic Operations

00:51:33 Operator Precedence

00:55:04 Math Functions

00:58:17 If Statements

01:06:32 Logical Operators

01:11:25 Comparison Operators

01:16:17 Weight Converter Program

01:20:43 While Loops

01:24:07 Building a Guessing Game

01:30:51 Building the Car Game

01:41:48 For Loops

01:47:46 Nested Loops

01:55:50 Lists

02:01:45 2D Lists

02:05:11 My Complete Python Course

02:06:00 List Methods

02:13:25 Tuples

02:15:34 Unpacking

02:18:21 Dictionaries

02:26:21 Emoji Converter

02:30:31 Functions

02:35:21 Parameters

02:39:24 Keyword Arguments

02:44:45 Return Statement

02:48:55 Creating a Reusable Function

02:53:42 Exceptions

02:59:14 Comments

03:01:46 Classes

03:07:46 Constructors

03:14:41 Inheritance

03:19:33 Modules

03:30:12 Packages

03:36:22 Generating Random Values

03:44:37 Working with Directories

03:50:47 Pypi and Pip

03:55:34 Project 1: Automation with Python

04:10:22 Project 2: Machine Learning with Python

04:58:37 Project 3: Building a Website with Django


Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading

Complete Python Bootcamp: Go from zero to hero in Python 3

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python and Django Full Stack Web Developer Bootcamp

Complete Python Masterclass

Python Programming Tutorial | Full Python Course for Beginners 2019 👍

Top 10 Python Frameworks for Web Development In 2019

Python for Financial Analysis and Algorithmic Trading

Building A Concurrent Web Scraper With Python and Selenium