Whether it be Kaggle, Google Cloud, or the federal government, there’s plenty of reliable open-sourced data on the web. While there are plenty of reasons to hate being alive in our current chapter of humanity, open data is one of the few redeeming qualities of life on Earth today. But what is the opposite of “open” data, anyway?
Like anything free and easily accessible, the only data inherently worth anything is either harvested privately or stolen from sources that would prefer you didn’t. This is the sort of data business models can be built around, as social media platforms such as LinkedIn have shown us as our personal information is bought and sold by data brokers. These companies attempted to sue individual programmers like ourselves for scraping the data they collected via the same means, and epically lost in a court of law:
LinkedIn Data Scraping Ruled Legal
The topic of scraping data on the web tends to raise questions about the ethics and legality of scraping, to which I plea: don’t hold back. If you aren’t personally disgusted by the prospect of your life being transcribed, sold, and frequently leaked, the court system has ruled that you legally have a right to scrape data. The name of this publication is not People Who Play It Safe And Slackers. We’re a home for those who fight to take power back, and we’re going to scrape the shit out of you.
Web scraping in Python is dominated by three major libraries: BeautifulSoup, Scrapy, and Selenium. Each of these libraries intends to solve for very different use cases. Thus it’s essential to understand what we’re choosing and why.
We’ll be using BeautifulSoup, which should genuinely be anybody’s default choice until the circumstances ask for more. BeautifulSoup is more than enough to steal data.
#python #scrapy #beautifulsoup #web-development #machine-learning