Write a highly efficient python Web Crawler

As my previous blog, I use the python web Crawler library to help crawl the static website. For the Scrapy, there can be customize download middle ware, which…

Mark Duan

Mark Duan

July 14, 2015

As my previous blog, I use the python web Crawler library to help crawl the static website. For the Scrapy, there can be customize download middle ware, which can deal with static content in the website like JavaScript.

However, the Scrapy already helps us with much of the underlying implementation, for example, it uses it own dispatcher and it has pipeline for dealing the parsing word after download. One drawback for using such library is hard to deal with some strange bugs occurring because they run the paralleled jobs.

For this tutorial, I want to show the structure of a simple and efficient web crawler.

First of all, we need a scheduler, who can paralleled the job. Because the most of the time is on the requesting. I use the gevent to schedule the jobs. Gevent uses the libevent as its underlying library, which combines the multithreading and event-based techniques to parallel the job.

There is the sample code:

import gevent
from gevent import Greenlet
from gevent import monkey
from selenium import webdriver
monkey.patch_socket()
class WebCrawler:
    def __init__(self,urls=[],num_worker = 1):
        self.url_queue = Queue()
        self.num_worker = num_worker
    def worker(self,pid):
        driver = self.initializeAnImegaDisabledDriver()  #initilize the webdirver
#TODO catch the exception
        while not self.url_queue.empty():
            url = self.url_queue.get()
            self.driver.get(url)
            elem = self.driver.find_elements_by_xpath("//script | //iframe | //img") ## get such element from webpage
    def run(self):
        jobs = [gevent.spawn(self.worker,i) for i in xrange(self.num_worker)]

#python #efficient python #efficient python web crawler

Write a highly efficient python Web Crawler
1.30 GEEK