Why?

Web scrapping has made my life SO MUCH EASIER. Yet, the process for actually extracting content from websites which lock their content down using proprietary systems is never really mentioned. This makes it extremely difficult if not impossible to reformat information into a desirable format. Over a few years, I’ve found several (nearly) fail proof techniques to help me out, and now I’d like to pass them on.

I’m going to walk you through the process of converting a web-only book to a PDF. The idea here though is to highlight how you can replicate/modify this for your own circumstances!

If you have any other tricks (or even useful scripts) for tasks like these, make sure to let me know, as creating these life-hack scripts is an interesting hobby!

Reproducibility/Applicability?

The example I’m outlining is from a website which provides only-only study guides (to protect their security I’m excluding specific URL’s). I’m outlining several flaws/hiccups which often come up when web scrapping!

Mistakes to Make?

I’ve made several mistakes when trying to web scrape for limited access information. Each mistake consumed large amounts of time and energy, so here they are:

  • Using AutoHotKey or similar to directly affect the mouse/keyboard (this produces dodgy inconsistent behavior)
  • Load all pages and then export a HAR file (HAR files don’t actual data and take ages to load)
  • Attempt to use GET/HEAD requests (most pages use authorization approaches which aren’t realistically reversible)

Slow Progress

It seems easy/quick to write a 300 line short script for web scrapping these websites, but they are always more difficult than that. Here are potential hurdles with solutions:

  • Browser profile used by Selenium changing
  • Programmatically find the profile
  • Not knowing how long to wait for a link to load
  • Detect when the link isn’t equal to the current one
  • Or use browser JavaScript (where possible, described more below)
  • Needing to find information about the current web page’s content
  • Look at potential JavaScript functions and URL’s
  • Restarting a long script when it fails
  • Reduce the number of lookups for files
  • Copy files to predictable locations
  • Before beginning doing anything complex check these files
  • Not knowing what a long script is up to
  • Print any necessary output (only for that which takes considerable time and doesn’t have another metric)

#web-scraping #data #python #caching #selenium

Life Hack Web Scrapping
1.55 GEEK