Web Scraping is the most important concept of data collection. In Python, BeautifulSoup, Selenium and **XPath **are the most important tools that can be used to accomplish the task of web scraping.
In this article, we will focus on BeautifulSoup and how to use it to scrape GDP data from Wikipedia page. The data we need on this site is in form of a table.
Take a look at the following image then we can go ahead and define the components of an HTML table
From the above image we can deduce the following:
The tag defines an HTML table.
An HTML table consists of one
Our interest is to inspect the elements of a given site (in this case the site we want to scrap — on the far right of Figure 1 shows the elements of the site). In most computers you visit the site and click **Ctrl+Shift+I **to inspect the page you wish to scrap.
Note: Elements of a web page are identified by using a class or id options on the tag. Ids are unique but classes are not. This means that a given class can identify more than one web element while one id identifies one and only one element.
Lets now see the image of the site we want to scrape
Fig 3
From this Figure note the following:
class = “wikitable sortable jquery”
.Note that the tag element contains 3 classes identifying one table (classes are separated by white space). Apart from general reference of a site element as we will use here, classes and ids are used as reference to support styling using languages like CSS.Required packages: bs4, lxml, pandas and requests.
Once you have the said packages we can now go through the code.
In this snippet, we import necessary packages and parse HTML content of the site.
, and | elements.
The |
---|---|
element defines a table header, and the | element defines a table cell.
An HTML table may also include |
#web-scraping #editors-pick #python #html