A Quicker Way to Build Datasets through Web Scraping

If you want to skip the HTML tag digging and get straight to scraping, here’s the gist. **Note that the scraper tries to do an exact match with each item in your wanted list. **Otherwise, read on for a short background on webscraping, when it’s useful to scrape websites, and some challenges you may experience while scraping.

from autoscraper import AutoScraper
	## replace with desired url
	url = 'https://www.yelp.com/biz/chun-yang-tea-flushing-new-york-flushing' 
	## make sure that autoscraper can exactly match the items in your wanted_list 
	wanted_list = ['A review']     ## replace with item(s) of interest

	## build the scraper
	scraper = AutoScraper()
	result = scraper.build(url, wanted_list)

	## get similar results, and check which rules to keep
	groups = scraper.get_result_similar(url, grouped=True)
	groups.keys()
	groups['rule_io6e'] ## replace with rule(s) of interest

	## keep rules and save the model to disk
	scraper.keep_rules('rule_io6e') ## replace with rule(s) of interest
	scraper.save('yelp-reviews')    ## replace with desired model name

	#-------------------------------------------------------------------------
	## using the model later
	scraper.load('yelp-reviews')
	new_url = ""                    ## replace with desired url
	scraper.get_result_similar(new_url)

#dataset #python #web-scraping #data-collection

towardsdatascience.com

A Quicker Way to Build Datasets through Web Scraping