Web scraping is a technique to extract data from a website. Many tools can be used to scrape a website. And now I want to explain how we can extract data from a website using scrapy python.

And now we will scrap data using scrapy from https://www.jobstreet.vn/j?sp=search&q=C%C3%B4ng+ngh%E1%BB%87+th%C3%B4ng+tin&l.

Image for post

We will take the URL for each job title such as Giang vien…., Nhan vien …, and many more. After that, we could have to extract the data from each page.

Requirements

  1. Must understand about scrapy theory (https://docs.scrapy.org/en/latest/index.html).
  2. Must understand python programming language (especially the OOP theory).
  3. Of course, we need a code editor and python that have been installed on your PC/Laptop.
  4. The browser, in this case, is Google Chrome, so the options that will be mentioned in this article are available on Google Chrome.

What you will learn

  1. Web crawling technique using spider scrapy.
  2. Scraping technique with HTML parsing method.
  3. Scraping technique with JSON API.
  4. Debugging technique for scrapy in the terminal.

Project’s steps

Here the project‘s steps for scraping it.

  • You must finish reading this article first, and then doing the practice technically.
  • Scraping the main page and get the URLs for all the job titles in there.
  • Scraping all the URLs page.
  • Scraping the texts at the page which has the ads post label.
  • Scraping the texts at the page which has the non-ads post label.

The job title on the main page is divided into two categories, there are ads-post and non-ads-post. Well, the ads post is the job title that has a sponsor and the sign ad for each of them.

That’s the point! We can scrape the data from the non-ads-post using the HTML parsing method. But it is doesn’t apply to the ads-post because in this case, the data from the ads-post can be gained using the JSON API method only.

In this case, I assume that you have already read or understood the scrapy theory here before.

#python #api #programming #html #json

Web Scrapping (HTML Parsing and JSON API) using Python Spider-Scrapy
4.60 GEEK