If you are a data scientist, engineer, analyst, or just a simple guy who collects data as a hobby, you will often need to create your dataset despite the huge amount of datasets over the internet by scratching the messy, spacious, and wild web. To do so, you need to get yourself familiar with what we call web scraping, crawling, or harvesting.

Objective: Using the BeautifulSoup library in Python create a bot that aims to crawl private universities names along with the URL of their home websites in a user-specified country and downloading them as xlsx file.

We will be using the following libraries:

## Required libraries
import pandas as pd
from bs4 import BeautifulSoup
import requests
from progressbar import ProgressBar

How does web scraping work?

When you open your browser and click on a page’s link, the browser sends a request to the webserver which contain the web page files, we call this a **GET**request as we are getting the page files from the server. The server then processes the incoming request over HTTP and several other protocols and sends back the required information (files) that are required to display the page. The browser then displays the HTML source of the page in an elegant and clearer shape.

In Web scraping, we create a **GET**request mimicking the one sent by the browser so we can get the raw HTML source of the page, then we start wrangling to extract the desired data by filtering HTML tags.

#python #beautifulsoup #html #data-scraping

Data Scraping Tutorial: an easy project for beginners.
1.25 GEEK