If you want to skip the HTML tag digging and get straight to scraping, here’s the gist. **Note that the scraper tries to do an exact match with each item in your wanted list. **Otherwise, read on for a short background on webscraping, when it’s useful to scrape websites, and some challenges you may experience while scraping.
from autoscraper import AutoScraper
## replace with desired url
url = 'https://www.yelp.com/biz/chun-yang-tea-flushing-new-york-flushing'
## make sure that autoscraper can exactly match the items in your wanted_list
wanted_list = ['A review'] ## replace with item(s) of interest
## build the scraper
scraper = AutoScraper()
result = scraper.build(url, wanted_list)
## get similar results, and check which rules to keep
groups = scraper.get_result_similar(url, grouped=True)
groups.keys()
groups['rule_io6e'] ## replace with rule(s) of interest
## keep rules and save the model to disk
scraper.keep_rules('rule_io6e') ## replace with rule(s) of interest
scraper.save('yelp-reviews') ## replace with desired model name
#-------------------------------------------------------------------------
## using the model later
scraper.load('yelp-reviews')
new_url = "" ## replace with desired url
scraper.get_result_similar(new_url)
#dataset #python #web-scraping #data-collection