A lot of people at different levels of an organization may need to collect external data from the internet for various reasons: analyzing the competition, aggregating news feeds to follow trends in particular markets, or collecting daily stock prices to build predictive models…

Whether you’re a data scientist or a business analyst, you may be in this situation from time to time and ask yourself this ever-lasting question: How can I possibly extract this website’s data to conduct market analysis?

One possible free way to extract website data and structure it is scraping.

In this post, you’ll learn about data scraping and how to easily build your first scraper in python.

Image for post

ps: This article supports an introductory video tutorial 🎥 I made about data scraping in Python. If you’re interested, you can watch it here.

What is data scraping? 🧹

Let me spare you long definitions.

Broadly speaking, data scraping is the process of extracting a website’s data programmatically and structuring it according to one’s needs. Many companies are using data scraping to gather external data and support their business operations: this is currently a common practice in multiple fields.

What do I need to know to learn data scraping in python?

Not much. To build small scrapers, you’ll have to be a little bit familiar with Python and HTML syntaxes.

To build scalable and industrial scrapers, you’ll need to know one or two frameworks such as Scrapyor Selenium.

Build your first scraper in Python

Setup your environment

Let’s learn how to turn a website into structured data! To do this, you’ll first need to install the following libraries:

  • requests: to simulate HTTP requests like GET and POST. We’ll mainly use it to access the source page of any given website.
  • BeautifulSoup: to parse HTML and XML data very easily
  • lxml: to increase the parsing speed of XML files
  • pandas: to structure the data in dataframes and export it in the format of your choice (JSON, Excel, CSV, etc.)

If you’re using Anaconda, you should be good to go: all these packages are already installed. Otherwise, you should run the following commands:

pip install requests
pip install beautifulsoup4
pip install lxml
pip install pandas

To make people easily follow along with my video tutorial, I also used a jupyter notebook to make the process interactive.

What website and data are we going to scrape?

One friend of mine asked me if I could help him scrape this website.

So I decided to do it in a tutorial.

This website is called Premium Beauty News. It publishes recent trends in the beauty market. If you look at the front page, you’ll see that the articles that we want to scrape are organized in a grid.

Image for post

Screenshot made by the author — Article headlines

Over multiple pages:

Image for post

Screenshot made by the author — Pagination: here’s where scraping comes in handy

Of course, we won’t extract the header of each article appearing on these pages only. We’ll go inside each post and grab everything we need:

the title, the date, the abstract:

Image for post

Screenshot made by the author

And of course the remaining full content of the post.

#machine-learning #python #data-science #programming #developer

Introduction to Scraping in Python with BeautifulSoup and Requests
2.60 GEEK