Have you ever heard of web scraping? It is is an automated technique that is used to extract large amounts of data from websites. If you are interested in getting started with web scraping, then this tutorial is for you!
Imagine you have to pull out a huge amount of data from a particular website. Is it possible to do so, without manually going to each webpage and getting the data? Well yes, it is definitely possible using a technique called “Web Scraping”.
Web Scraping is an automated technique that is used to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer. Web Scraping is becoming increasingly popular since the data extracted from the web can serve a lot of different purposes like:
Web Scraping has a lot of applications but implementing it can be slightly intimidating, so in this article, I will break down the process in elaborate steps to help you understand it better.
But before we get into that, here are some important points-to-remember about Web Scraping:
To extract data using Web Scraping with Python, you need to follow the below steps:
Now, let us implement these steps in an example and see how to extract data from the Flipkart website using Python
Here are some libraries used for Web Scraping:
Now, let’s get started with the demonstration.
Pre-requisites: Python 2.x or Python 3.x with Selenium, BeautifulSoup, pandas libraries installed; Google-chrome browser; Ubuntu Operating System
We are going scrape the Flipkart website to extract the data for Price, Name, and Rating of Laptops. URL (find more information here).
The data on the website is nested in tags. So, we need to inspect the page to see under which tag the data we want to scrape is nested. To inspect, just right click on the element and click on “Inspect”.
When you click on “Inspect”, a “Browser Inspector Box” will open on your screen.
For this example, let us extract the Price, Name, and Rating which is nested in the “div” tag.
First, create a Python file. For this, open a terminal in Ubuntu and type gedit <your file name> with .py extension.
Let the file name is “web-s”. Now, here is the command:
gedit web-s.py
Now, let’s write our code in this file.
Before that, you need to import all the necessary libraries:
1
2
3
from selenium import
webdriver
from BeautifulSoup import
BeautifulSoup
import
pandas as pd
We have to set the path to chromedriver, in order to configure webdriver to use Chrome browser
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
Refer the below code to open the URL:
1
2
3
4
products=[] #List to store name of the product
prices=[] #List to store price of the product
ratings=[] #List to store rating of the product
driver.get("https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&amp;amp;uniq")
Now that we have written the code to open the URL, let’s extract the data from the website. As mentioned earlier, the data we want to extract is nested in tags. So, we have to find the <div> tags with those respective class-names, extract the data and store it in a variable. Refer to the code below:
1
2
3
4
5
6
7
8
9
content = driver.page_source
soup = BeautifulSoup(content)
for
a in soup.findAll('a',href=True, attrs={'class':'_31qSD5'}):
name=a.find('div', attrs={'class':'_3wU53n'})
price=a.find('div', attrs={'class':'_1vC4OE _2rQ-NK'})
rating=a.find('div', attrs={'class':'hGSR34 _2beYZw'})
products.append(name.text)
prices.append(price.text)
ratings.append(rating.text)
Use the below command to run the code:
Python web-s.py
After extracting the data, you might want to store it in the desired format. For this example, we will store it in a CSV (Comma Separated Value) format. To do this, add the following lines to your code:
1
2
df = pd.DataFrame({'Product Name':products,'Price':prices,'Rating':ratings})
df.to_csv('products.csv', index=False, encoding='utf-8')
Now, run the whole code again and you will get a file named “products.csv” which will contain your extracted data.
Python really makes the Web Scraping easy because of its easily understandable syntax and a large collection of Libraries.
I hope this article was informative and helped you guys get familiar with the concept of Web Scraping using Python. Now, you can go ahead and try Web Scraping by experimenting with different modules and applications of Python. If you don’t already know this language, why not learn Python this year?
#python