Using Python and Selenium to Scrape Infinite Scroll Web Pages

Web scraping can be an important tool for data collection. While big social media, such as Twitter and Reddit, supports APIs to quickly extract data using existing python packages, you may sometimes encounter tasks that are difficult to solve using APIs. For instance, the Reddit API allows you to extract posts and comments from subreddits (online communities in Reddit), but it is hard to get posts and comments by keyword search (you will see more clearly what I mean in the next section). Moreover, not every web page has API for web scraping. In these cases, manual web scraping becomes the optimum choice. However, nowadays many web pages implement a web-design technique: infinite scrolling. Infinite scroll web pages automatically expand the content when users scroll down to the bottom of the page, to replace the traditional pagination. While it is very convenient for the users, it adds difficulty to the web scrapping. In this story, I will show the python code I developed to auto-scrolling web pages, and demonstrate how to use it to scrape URLs in Reddit as an example.

Selenium for infinite scroll web pages: What Is The Problem?

Let’s say that I want to extract the posts and comments about COVID-19 on Reddit for sentiment analysis. I then go to Reddit.com and search “COVID-19”, the resulting page is as follow:

Image for post

Search Results for COVID-19 on Reddit.com (Before Scrolling)

The texts highlighted in blue boxes are the subreddits. Notice that they are all different. Therefore, if I want to get all these posts through Reddit API, I would have to first get the posts from each subreddit, and write extra code to filter the posts that are related to COVID-19. This is a very complicated process, and thus in this case, manual scraping is favored.

#python #infinite-scroll #web-scraping #selenium

Selenium for infinite scroll web pages: What Is The Problem?

medium.com

Using Python and Selenium to Scrape Infinite Scroll Web Pages