Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. Here we will useBeautiful Soup 4.

  • What is Web Scraping?

Web Scraping is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.

There are mainly two ways to extract data from a website:

  1. Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.

2 . Access the HTML of the webpage and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extraction.

  • BeautifulSoup Library’s Advantages & Disadvantages :

This table summarizes the advantages and disadvantages of each parser library.

Image for post

  • Install The BeautifulSoup Library:

To install this library in Python Environment can be done by using** _pip _**command. Also install other support i.e. lxml, html5lib, requests etc.

pip install lxml
pip install html5lib
pip install beautifulsoup4
pip install requests

#beautifulsoup #web-scraping #machine-learning #python #data-science

BeautifulSoup : Everything a Data Scientist Should Know
1.60 GEEK