A lot of people at different levels of an organization may need to collect external data from the internet for various reasons: analyzing the competition, aggregating news feeds to follow trends in particular markets, or collecting daily stock prices to build predictive models…
Whether you’re a data scientist or a business analyst, you may be in this situation from time to time and ask yourself this ever-lasting question: How can I possibly extract this website’s data to conduct market analysis?
One possible free way to extract website data and structure it is scraping.
In this post, you’ll learn about data scraping and how to easily build your first scraper in python.
ps: This article supports an introductory video tutorial 🎥 I made about data scraping in Python. If you’re interested, you can watch it here.
Let me spare you long definitions.
Broadly speaking, data scraping is the process of extracting a website’s data programmatically and structuring it according to one’s needs. Many companies are using data scraping to gather external data and support their business operations: this is currently a common practice in multiple fields.
Not much. To build small scrapers, you’ll have to be a little bit familiar with Python and HTML syntaxes.
To build scalable and industrial scrapers, you’ll need to know one or two frameworks such as Scrapyor Selenium.
Let’s learn how to turn a website into structured data! To do this, you’ll first need to install the following libraries:
If you’re using Anaconda, you should be good to go: all these packages are already installed. Otherwise, you should run the following commands:
pip install requests
pip install beautifulsoup4
pip install lxml
pip install pandas
To make people easily follow along with my video tutorial, I also used a jupyter notebook to make the process interactive.
One friend of mine asked me if I could help him scrape this website.
So I decided to do it in a tutorial.
This website is called Premium Beauty News. It publishes recent trends in the beauty market. If you look at the front page, you’ll see that the articles that we want to scrape are organized in a grid.
Screenshot made by the author — Article headlines
Over multiple pages:
Screenshot made by the author — Pagination: here’s where scraping comes in handy
Of course, we won’t extract the header of each article appearing on these pages only. We’ll go inside each post and grab everything we need:
the title, the date, the abstract:
Screenshot made by the author
And of course the remaining full content of the post.
#machine-learning #python #data-science #programming #developer