In the previous tutorials, we have understood and worked with Scrapy and Selenium individually. In this tutorial, I shall be highlighting the need to combine these two and explaining how to do it. Let us start with the need to combine Selenium with Scrapy.

Why combine Selenium with Scrapy?

The main drawback of Scrapy is its inability to natively handle dynamic websites, i.e. websites that use JavaScript (React, Vue, etc.) to render content as and when needed. For example, trying to extract the list of countries from http://openaq.org/#/countries using Scrapy would return an empty list. To demonstrate this scrapy shell is used with the command

scrapy shell ``[https://openaq.org/#/countries](https://openaq.org/#/countries)

The processing of this command is shown below.

In [1]: response.xpath('//h1[@class="card__title"]/a/text()').get()
In [2]: response.xpath('//h1[@class="card__title"]/a/text()').getall()                                                                                                                        
Out[2]: []
In [5]: print(response)                                                                                                                                                                       
<200 https://openaq.org/>
In [6]: print(response.text)                                                                                                                                                                  
<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
    <meta name="description" content="OpenAQ is a community of scientists, software developers, and lovers of open environmental data" />
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no" />
<title>OpenAQ</title>
<!-- Twitter -->
    <meta name="twitter:card" content="summary" />
    <meta name="twitter:site" content="@OpenAQ" />
    <meta name="twitter:title" content="OpenAQ">
    <meta name="twitter:description" content="OpenAQ is a community of scientists, software developers, and lovers of open environmental data" />
    <meta name="twitter:image:src" content="assets/graphics/meta/default-meta-image.png" />
    <!--/ Twitter -->
<!-- OG -->
    <meta property="og:site_name" content="OpenAQ" />
    <meta property="og:title" content="OpenAQ" />
    <meta property="og:url" content="https://openaq.org/" />
    <meta property="og:type" content="website" />
    <meta property="og:description" content="OpenAQ is a community of scientists, software developers, and lovers of open environmental data" />
    <meta property="og:image" content="assets/graphics/meta/default-meta-image.png" />
    <!--/ OG -->
<link rel="icon" type="image/png" sizes="96x96" href="assets/graphics/meta/favicon.png" />
    <link rel="icon" type="image/png" sizes="192x192" href="assets/graphics/meta/android-chrome.png" />
    <link rel="apple-touch-icon" sizes="180x180" href="assets/graphics/meta/apple-touch-icon.png" />
<link href="https://fonts.googleapis.com/css?family=Oxygen:400,700|Source+Sans+Pro:300,300i,400,400i,700,700i&display=swap" rel="stylesheet">
<link rel="stylesheet" href="/assets/styles/main-e543e76bd1.css">
</head>
  <body>
    <div id="app-container">
      <!-- page -->
    </div>
<script>
      (function(b,o,i,l,e,r){b.GoogleAnalyticsObject=l;b[l]||(b[l]=
      function(){(b[l].q=b[l].q||[]).push(arguments)});b[l].l=+new Date;
      e=o.createElement(i);r=o.getElementsByTagName(i)[0];
      e.src='//www.google-analytics.com/analytics.js';
      r.parentNode.insertBefore(e,r)}(window,document,'script','ga'));
      ga('create','UA-66787377-1');ga('send','pageview');
    </script>
<script src="/assets/scripts/vendor-48750b5367.js"></script>
<script src="/assets/scripts/bundle-44f67f4c5c.js"></script>
</body>
</html>

#python #scrapy #web-scraping-series #selenium

Web Scraping With Selenium & Scrapy
1.90 GEEK