Building A Concurrent Web Scraper With Python and Selenium

Building A Concurrent Web Scraper With Python and Selenium

This is a quick post that looks at how to speed up a simple, Python-based web scraping and crawling script with parallel processing via the multiprocessing library. We'll also break down the script itself and show how to test the parsing functionality.

This is a quick post that looks at how to speed up a simple, Python-based web scraping and crawling script with parallel processing via the multiprocessing library. We'll also break down the script itself and show how to test the parsing functionality.

After completing this tutorial you should be able to:

  1. Scrape and crawl websites with Selenium and parse HTML with Beautiful Soup
  2. Set up unittest to test the scraping and parsing functionalities
  3. Set up multiprocessing to execute the web scraper in parallel
  4. Configure headless mode for ChromeDriver with Selenium
Project Setup

Clone down the repo if you'd like to follow along. From the command line run the following commands:

$ git clone [email protected]:calebpollman/web-scraping-parallel-processing.git
$ cd web-scraping-parallel-processing
$ python3.7 -m venv env
$ source env/bin/activate
(env)$ pip install -r requirements.txt

The above commands may differ depending on your environment.
Install ChromeDriver globally. (We're using version 73.0.3683.20).

Script Overview

The script traverses and scrapes the first 20 pages of Hacker News for information about the current articles listed using Selenium to automate interaction with the site and Beautiful Soup to parse the HTML.

script.py:

import datetime
from time import sleep, time

from scrapers.scraper import get_driver, connect_to_base, \
    parse_html, write_to_file


def run_process(page_number, filename, browser):
    if connect_to_base(browser, page_number):
        sleep(2)
        html = browser.page_source
        output_list = parse_html(html)
        write_to_file(output_list, filename)
    else:
        print('Error connecting to hackernews')


if __name__ == '__main__':
    start_time = time()
    current_page = 1
    output_timestamp = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
    output_filename = f'output_{output_timestamp}.csv'
    browser = get_driver()
    while current_page <= 20:
        print(f'Scraping page #{current_page}...')
        run_process(current_page, output_filename, browser)
        current_page = current_page + 1
    browser.quit()
    end_time = time()
    elapsed_time = end_time - start_time
    print(f'Elapsed run time: {elapsed_time} seconds')

Let's start with the main-condition block. After setting a few variables, the browser is initialized via get_driver() from scrapers/scraper.py.

if __name__ == '__main__':
    # set variables
    start_time = time()
    current_page = 1
    output_timestamp = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
    output_filename = f'output_{output_timestamp}.csv'

    ########
    # here #
    ########
    browser = get_driver()
    # scrape and crawl
    while current_page <= 20:
        print(f'Scraping page #{current_page}...')
        run_process(current_page, output_filename, browser)
        current_page = current_page + 1
    # exit
    browser.quit()
    end_time = time()
    elapsed_time = end_time - start_time
    print(f'Elapsed run time: {elapsed_time} seconds')

A while loop is then configured to control the flow of the overall scraper.

if __name__ == '__main__':
    # set variables
    start_time = time()
    current_page = 1
    output_timestamp = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
    output_filename = f'output_{output_timestamp}.csv'
    browser = get_driver()
    # scrape and crawl

    ########
    # here #
    ########
    while current_page <= 20:
        print(f'Scraping page #{current_page}...')
        run_process(current_page, output_filename, browser)
        current_page = current_page + 1
    # exit
    browser.quit()
    end_time = time()
    elapsed_time = end_time - start_time
    print(f'Elapsed run time: {elapsed_time} seconds')

Within the loop, run_process() is called, which houses the connection and scraping functions.

def run_process(page_number, filename, browser):
    if connect_to_base(browser, page_number):
        sleep(2)
        html = browser.page_source
        output_list = parse_html(html)
        write_to_file(output_list, filename)
    else:
        print('Error connecting to hackernews')

In run_process(), the browser instance and a page number are passed to connect_to_base().

def run_process(page_number, filename, browser):

    ########
    # here #
    ########
    if connect_to_base(browser, page_number):
        sleep(2)
        html = browser.page_source
        output_list = parse_html(html)
        write_to_file(output_list, filename)
    else:
        print('Error connecting to hackernews')

This function attempts to connect to Hacker News and then uses Selenium's explicit wait functionality to ensure the element with id='hnmain' has loaded before continuing.

def connect_to_base(browser, page_number):
    base_url = f'https://news.ycombinator.com/news?p={page_number}'
    connection_attempts = 0
    while connection_attempts < 3:
        try:
            browser.get(base_url)
            # wait for table element with id = 'hnmain' to load
            # before returning True
            WebDriverWait(browser, 5).until(
                EC.presence_of_element_located((By.ID, 'hnmain'))
            )
            return True
        except Exception as ex:
            connection_attempts += 1
            print(f'Error connecting to {base_url}.')
            print(f'Attempt #{connection_attempts}.')
    return False

Review the Selenium docs for more information on explicit wait.
To emulate a human user, sleep(2) is called after the browser has connected to Hacker News.

def run_process(page_number, filename, browser):
    if connect_to_base(browser, page_number):

        ########
        # here #
        ########
        sleep(2)
        html = browser.page_source
        output_list = parse_html(html)
        write_to_file(output_list, filename)
    else:
        print('Error connecting to hackernews')

Once the page has loaded and sleep(2) has executed, the browser grabs the HTML source, which is then passed to parse_html().

def run_process(page_number, filename, browser):
    if connect_to_base(browser, page_number):
        sleep(2)

        ########
        # here #
        ########
        html = browser.page_source

        ########
        # here #
        ########
        output_list = parse_html(html)
        write_to_file(output_list, filename)
    else:
        print('Error connecting to hackernews')

parse_html() uses Beautiful Soup to parse the HTML, generating a list of dicts with the appropriate data.

def parse_html(html):
    # create soup object
    soup = BeautifulSoup(html, 'html.parser')
    output_list = []
    # parse soup object to get article id, rank, score, and title
    tr_blocks = soup.find_all('tr', class_='athing')
    article = 0
    for tr in tr_blocks:
        article_id = tr.get('id')
        article_url = tr.find_all('a')[1]['href']
        # check if article is a hacker news article
        if 'item?id=' in article_url:
            article_url = f'https://news.ycombinator.com/{article_url}'
        load_time = get_load_time(article_url)
        try:
            score = soup.find(id=f'score_{article_id}').string
        except Exception as ex:
            score = '0 points'
        article_info = {
            'id': article_id,
            'load_time': load_time,
            'rank': tr.span.string,
            'score': score,
            'title': tr.find(class_='storylink').string,
            'url': article_url
        }
        # appends article_info to output_list
        output_list.append(article_info)
        article += 1
    return output_list

This function also passes the article URL to get_load_time(), which loads the URL and records the subsequent load time.

def get_load_time(article_url):
    try:
        # set headers
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
        # make get request to article_url
        response = requests.get(
            article_url, headers=headers, stream=True, timeout=3.000)
        # get page load time
        load_time = response.elapsed.total_seconds()
    except Exception as ex:
        load_time = 'Loading Error'
    return load_time

The output is added to a CSV file.

def run_process(page_number, filename, browser):
    if connect_to_base(browser, page_number):
        sleep(2)
        html = browser.page_source
        output_list = parse_html(html)

        ########
        # here #
        ########
        write_to_file(output_list, filename)
    else:
        print('Error connecting to hackernews')

write_to_file():

def write_to_file(output_list, filename):
    for row in output_list:
        with open(filename, 'a') as csvfile:
            fieldnames = ['id', 'load_time', 'rank', 'score', 'title', 'url']
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writerow(row)

Finally, back in the while loop, the page_number is incremented and the process starts over again.

if __name__ == '__main__':
    # set variables
    start_time = time()
    current_page = 1
    output_timestamp = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
    output_filename = f'output_{output_timestamp}.csv'
    browser = get_driver()
    # scrape and crawl
    while current_page <= 20:
        print(f'Scraping page #{current_page}...')
        run_process(current_page, output_filename, browser)

        ########
        # here #
        ########
        current_page = current_page + 1
    # exit
    browser.quit()
    end_time = time()
    elapsed_time = end_time - start_time
    print(f'Elapsed run time: {elapsed_time} seconds')

Want to test this out? Grab the full script here.
It took about 355 seconds (nearly 6 minutes) to run:

(env)$ python script.py
Scraping page #1...
Scraping page #2...
Scraping page #3...
Scraping page #4...
Scraping page #5...
Scraping page #6...
Scraping page #7...
Scraping page #8...
Scraping page #9...
Scraping page #10...
Scraping page #11...
Scraping page #12...
Scraping page #13...
Scraping page #14...
Scraping page #15...
Scraping page #16...
Scraping page #17...
Scraping page #18...
Scraping page #19...
Scraping page #20...
Elapsed run time: 355.06936597824097 seconds

Keep in mind that there may not be content on all 20 pages, so the elapsed time may be different on your end. This script was ran when there was content on 16 pages (461 records).
Got it? Great! Let's add some basic testing.

Testing

To test the parsing functionality without initiating the browser and, thus, making repeated GET requests to Hacker News, you can download the page HTML and parse it locally. This can help avoid scenarios where you may get your IP blocked for making too many requests too quickly while writing and testing your parsing function, as well as saving you time by not needing to fire up a browser every time you run the script.

test/test_scraper.py:

import unittest

from scrapers.scraper import parse_html


class TestParseFunction(unittest.TestCase):

    def setUp(self):
        with open('test/test.html', encoding='utf-8') as f:
            html = f.read()
            self.output = parse_html(html)

    def tearDown(self):
        self.output = []

    def test_output_is_not_none(self):
        self.assertIsNotNone(self.output)

    def test_output_is_a_list(self):
        self.assertTrue(isinstance(self.output, list))

    def test_output_is_a_list_of_dicts(self):
        self.assertTrue(all(isinstance(elem, dict) for elem in self.output))


if __name__ == '__main__':
    unittest.main()

Ensure all is well:

(env)$ python test/test_scraper.py
...
----------------------------------------------------------------------
Ran 3 tests in 64.225s

OK

64 seconds?! Want to mock get_load_time() to bypass the GET request?

import unittest
from unittest.mock import patch

from scrapers.scraper import parse_html


class TestParseFunction(unittest.TestCase):

    @patch('scrapers.scraper.get_load_time')
    def setUp(self, mock_get_load_time):
        mock_get_load_time.return_value = 'mocked!'
        with open('test/test.html', encoding='utf-8') as f:
            html = f.read()
            self.output = parse_html(html)

    def tearDown(self):
        self.output = []

    def test_output_is_not_none(self):
        self.assertIsNotNone(self.output)

    def test_output_is_a_list(self):
        self.assertTrue(isinstance(self.output, list))

    def test_output_is_a_list_of_dicts(self):
        self.assertTrue(all(isinstance(elem, dict) for elem in self.output))


if __name__ == '__main__':
    unittest.main()

Test:

(env)$ python test/test_scraper.py
...
----------------------------------------------------------------------
Ran 3 tests in 0.423s

OK

Configure Multiprocessing

Now comes up the fun part! By making just a few changes to the script, we can speed things up:

import datetime
from itertools import repeat
from time import sleep, time
from multiprocessing import Pool, cpu_count

from scraper.scraper import get_driver, connect_to_base, \
    parse_html, write_to_file


def run_process(page_number, filename):
    browser = get_driver()
    if connect_to_base(browser, page_number):
        sleep(2)
        html = browser.page_source
        output_list = parse_html(html)
        write_to_file(output_list, filename)
        browser.quit()
    else:
        print('Error connecting to hackernews')
        browser.quit()


if __name__ == '__main__':
    start_time = time()
    output_timestamp = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
    output_filename = f'output_{output_timestamp}.csv'
    with Pool(cpu_count()-1) as p:
        p.starmap(run_process, zip(range(1, 21), repeat(output_filename)))
    p.close()
    p.join()
    end_time = time()
    elapsed_time = end_time - start_time
    print(f'Elapsed run time: {elapsed_time} seconds')

With the multiprocessing library, Pool is used to spawn a number of subprocesses based on the number of CPUs available on the system (minus one since the system processes take up a core).

This script is tested on a i7 Macbook Pro that has 8 cores.
Run:

(env)$ python script_parallel.py
Elapsed run time: 62.95027780532837 seconds

Check out the completed script here.## Configure Headless ChromeDriver

To speed things up even further we can run Chrome in headless mode by simply updating get_driver() in scrapers/scraper.py:

def get_driver():
    # initialize options
    options = webdriver.ChromeOptions()
    # pass in headless argument to options
    options.add_argument('--headless')
    # initialize driver
    driver = webdriver.Chrome(chrome_options=options)
    return driver

Run:

(env)$ python script_parallel.py
Elapsed run time: 58.14033889770508 seconds

Conclusion

With a small amount of variation from the original code, we were able to configure parallel processing in the script and set up ChromeDriver to run a headless browser to take the script's run time from around 355 seconds to just over 58 seconds. In this specific scenario that's 89.3% faster, which is a huge improvement.

I hope this helps your scripts. You can find the code in the repo.

Mobile App Development Company India | Ecommerce Web Development Company India

Mobile App Development Company India | Ecommerce Web Development Company India

Best Mobile App Development Company India, WebClues Global is one of the leading web and mobile app development company. Our team offers complete IT solutions including Cross-Platform App Development, CMS & E-Commerce, and UI/UX Design.

We are custom eCommerce Development Company working with all types of industry verticals and providing them end-to-end solutions for their eCommerce store development.

Know more about Top E-Commerce Web Development Company

Python Tutorial - Learn Python for Machine Learning and Web Development

Python Tutorial - Learn Python for Machine Learning and Web Development

Python tutorial for beginners - Learn Python for Machine Learning and Web Development. Can Python be used for machine learning? Python is widely considered as the preferred language for teaching and learning ML (Machine Learning). Can I use Python for web development? Python can be used to build server-side web applications. Why Python is suitable for machine learning? How Python is used in AI? What language is best for machine learning?

Python tutorial for beginners - Learn Python for Machine Learning and Web Development

TABLE OF CONTENT

  • 00:00:00 Introduction
  • 00:01:49 Installing Python 3
  • 00:06:10 Your First Python Program
  • 00:08:11 How Python Code Gets Executed
  • 00:11:24 How Long It Takes To Learn Python
  • 00:13:03 Variables
  • 00:18:21 Receiving Input
  • 00:22:16 Python Cheat Sheet
  • 00:22:46 Type Conversion
  • 00:29:31 Strings
  • 00:37:36 Formatted Strings
  • 00:40:50 String Methods
  • 00:48:33 Arithmetic Operations
  • 00:51:33 Operator Precedence
  • 00:55:04 Math Functions
  • 00:58:17 If Statements
  • 01:06:32 Logical Operators
  • 01:11:25 Comparison Operators
  • 01:16:17 Weight Converter Program
  • 01:20:43 While Loops
  • 01:24:07 Building a Guessing Game
  • 01:30:51 Building the Car Game
  • 01:41:48 For Loops
  • 01:47:46 Nested Loops
  • 01:55:50 Lists
  • 02:01:45 2D Lists
  • 02:05:11 My Complete Python Course
  • 02:06:00 List Methods
  • 02:13:25 Tuples
  • 02:15:34 Unpacking
  • 02:18:21 Dictionaries
  • 02:26:21 Emoji Converter
  • 02:30:31 Functions
  • 02:35:21 Parameters
  • 02:39:24 Keyword Arguments
  • 02:44:45 Return Statement
  • 02:48:55 Creating a Reusable Function
  • 02:53:42 Exceptions
  • 02:59:14 Comments
  • 03:01:46 Classes
  • 03:07:46 Constructors
  • 03:14:41 Inheritance
  • 03:19:33 Modules
  • 03:30:12 Packages
  • 03:36:22 Generating Random Values
  • 03:44:37 Working with Directories
  • 03:50:47 Pypi and Pip
  • 03:55:34 Project 1: Automation with Python
  • 04:10:22 Project 2: Machine Learning with Python
  • 04:58:37 Project 3: Building a Website with Django

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading

Complete Python Bootcamp: Go from zero to hero in Python 3

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python and Django Full Stack Web Developer Bootcamp

Complete Python Masterclass

Python Programming Tutorial | Full Python Course for Beginners 2019 👍

Top 10 Python Frameworks for Web Development In 2019

Python for Financial Analysis and Algorithmic Trading

Building A Concurrent Web Scraper With Python and Selenium

Hire PHP Developer and Web Developer for your Online Business

Hire PHP Developer and Web Developer for your Online Business

PHP is widely used open-source scripting language it helps in making dynamically easy your websites and web application. Mobiweb Technology is your best technical partner and offering you solution for any kind of website and application...

PHP is widely used open-source scripting language it helps in making dynamically easy your websites and web application. Mobiweb Technology is your best technical partner and offering you solution for any kind of website and application development. To hire PHP developer and web developer at affordable prices contact Mobiweb Technology via [email protected]