More and more frequently data science projects (and not only) require additional data that can be obtained via the means of web scraping. Google search is not an uncommon starting point.
In this guide we will walk through the script that obtains links from the google search results.
Let’s start with the imports, to obtain links from top-n pages of google search result, I am using selenium and BeautifulSoup.
from bs4 import BeautifulSoup from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager
I am also using webdriver_manager package, which comes quite handy at times. Using this package there is no need to download a web driver to your local machine if you don’t have one, it also helps to avoid manual input of the custom path to a web driver. The package supports most of the browsers.
Next, we set up some preferences for the web browser. To avoid web browser popping up when you run your code, I use ‘headless’ argument. There are also a handful of other options that allow to customise the web browser to adapt to the task at hand.
chrome_options = webdriver.ChromeOptions() chrome_options.add_argument("--headless")
We can now start the ChromeDriver. First input argument requires a path to the driver, however by means of the webdriver_manager we can use installation instead.
driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)
Once the web driver is set up, we can move on to the main part of the code where we obtain web links for google search results.
## Query to obtain links query = 'comprehensive guide to web scraping in python' links =  ## Initiate empty list to capture final results ## Specify number of pages on google search, each page contains 10 #links n_pages = 20 for page in range(1, n_pages): url = "http://www.google.com/search?q=" + query + "&start=" + str((page - 1) * 10) driver.get(url) soup = BeautifulSoup(driver.page_source, 'html.parser') ## soup = BeautifulSoup(r.text, 'html.parser') search = soup.find_all('div', class_="yuRUbf") for h in search: links.append(h.a.get('href'))
The code requires two inputs, query of interest and the number of pages in google search to go through. Each page contains 10 search results.
Once parameters are in place we load the url using selenium webdriver, then using BeautifulSoup we parse website data using html.parser. Website data comes in html format, we can view the script behind the website by inspecting the web page.
#data-science #web-scraping #search #python