Scraping Data on the Web with Python's BeautifulSoup Library

Scraping Data on the Web with Python's BeautifulSoup Library

Use Python's BeautifulSoup library to assist in the honest act of systematically stealing data without permission. Web scraping in Python is dominated by three major libraries: BeautifulSoup, Scrapy, and Selenium. Each of these libraries intends to solve for very different use cases. Thus it's essential to understand what we're choosing and why.

Whether it be Kaggle, Google Cloud, or the federal government, there's plenty of reliable open-sourced data on the web. While there are plenty of reasons to hate being alive in our current chapter of humanity, open data is one of the few redeeming qualities of life on Earth today. But what is the opposite of "open" data, anyway?

Like anything free and easily accessible, the only data inherently worth anything is either harvested privately or stolen from sources that would prefer you didn't. This is the sort of data business models can be built around, as social media platforms such as LinkedIn have shown us as our personal information is bought and sold by data brokers. These companies attempted to sue individual programmers like ourselves for scraping the data they collected via the same means, and epically lost in a court of law:

LinkedIn Data Scraping Ruled Legal

The topic of scraping data on the web tends to raise questions about the ethics and legality of scraping, to which I plea: don't hold back. If you aren't personally disgusted by the prospect of your life being transcribed, sold, and frequently leaked, the court system has ruled that you legally have a right to scrape data. The name of this publication is not People Who Play It Safe And Slackers. We're a home for those who fight to take power back, and we're going to scrape the shit out of you.

Tools for the Job

Web scraping in Python is dominated by three major libraries: BeautifulSoup, Scrapy, and Selenium. Each of these libraries intends to solve for very different use cases. Thus it's essential to understand what we're choosing and why.

  • BeautifulSoup is one of the most prolific Python libraries in existence, in some part having shaped the web as we know it. BeautifulSoup is a lightweight, easy-to-learn, and highly effective way to programmatically isolate information on a single webpage at a time. It's common to use BeautifulSoupin conjunction with the requests library, where requests will fetch a page, and BeautifulSoup will extract the resulting data.
  • Scrapy has an agenda much closer to mass pillaging than BeautifulSoup. Scrapy is a tool for building crawlers: these are absolute monstrosities unleashed upon the web like a swarm, loosely following links, and haste-fully grabbing data where data exists to be grabbed. Because Scrapy serves the purpose of mass-scraping, it is much easier to get in trouble with Scrapy.
  • Selenium isn't exclusively a scraping tool as much as an automation tool that can be used to scrape sites. Selenium is the nuclear option for attempting to navigate sites programmatically, and should be treated as such: there are much better options for simple data extraction.

We'll be using BeautifulSoup, which should genuinely be anybody's default choice until the circumstances ask for more. BeautifulSoup is more than enough to steal data.

python scrapy beautifulsoup web-development machine-learning

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Hire Machine Learning Developers in India

We supply you with world class machine learning experts / ML Developers with years of domain experience who can add more value to your business.

Python Tutorial - Learn Python for Machine Learning and Web Development

Learn Python for Machine Learning and Web Development. Can Python be used for machine learning? Python is widely considered as the preferred language for teaching and learning ML (Machine Learning). Can I use Python for web development? Python can be used to build server-side web applications. Why Python is suitable for machine learning? How Python is used in AI? What language is best for machine learning?

Hire Python Developers

Are you looking for experienced, reliable, and qualified Python developers? If yes, you have reached the right place. At **[HourlyDeveloper.io](https://hourlydeveloper.io/ "HourlyDeveloper.io")**, our full-stack Python development services...

Hire Python Developers India

Looking to build robust, scalable, and dynamic responsive websites and applications in Python? At **[HourlyDeveloper.io](https://hourlydeveloper.io/ "HourlyDeveloper.io")**, we constantly endeavor to give you exactly what you need. If you need to...

Basic Data Types in Python | Python Web Development For Beginners

In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.