Web scrapping has made my life SO MUCH EASIER. Yet, the process for actually extracting content from websites which lock their content down using proprietary systems is never really mentioned. This makes it extremely difficult if not impossible to reformat information into a desirable format. Over a few years, I’ve found several (nearly) fail proof techniques to help me out, and now I’d like to pass them on.
I’m going to walk you through the process of converting a web-only book to a PDF. The idea here though is to highlight how you can replicate/modify this for your own circumstances!
If you have any other tricks (or even useful scripts) for tasks like these, make sure to let me know, as creating these life-hack scripts is an interesting hobby!
The example I’m outlining is from a website which provides only-only study guides (to protect their security I’m excluding specific URL’s). I’m outlining several flaws/hiccups which often come up when web scrapping!
I’ve made several mistakes when trying to web scrape for limited access information. Each mistake consumed large amounts of time and energy, so here they are:
It seems easy/quick to write a 300 line short script for web scrapping these websites, but they are always more difficult than that. Here are potential hurdles with solutions:
#web-scraping #data #python #caching #selenium