BeautifulSoup : Everything a Data Scientist Should Know

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. Here we will useBeautiful Soup 4.

What is Web Scraping?

Web Scraping is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.

There are mainly two ways to extract data from a website:

Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.

2 . Access the HTML of the webpage and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extraction.

BeautifulSoup Library’s Advantages & Disadvantages :

This table summarizes the advantages and disadvantages of each parser library.

Image for post

Install The BeautifulSoup Library:

To install this library in Python Environment can be done by using** _pip _**command. Also install other support i.e. lxml, html5lib, requests etc.

pip install lxml
pip install html5lib
pip install beautifulsoup4
pip install requests

#beautifulsoup #web-scraping #machine-learning #python #data-science

medium.com

BeautifulSoup : Everything a Data Scientist Should Know