Learn how to use Python web scraping to find cheap flights in this guide. We'll cover everything you need to know, from choosing the right tools to scraping flight data from popular websites. By the end of this guide, you'll be able to build your own flight scraper to find the best deals on your next trip.
Do you love data science and traveling? Read on to learn how to combine the two and use Python to find cheap flights! A tutorial on how to create a web scraping program that will search for and find cheap airline flight prices, and then send this prices to your …
In this tutorial, I will show you how to use Python to automatically surf a website like Expedia on an hourly basis looking for flights and sending you the best flight rate for a particular route you want every hour straight to your email.
The end result is this nice email:
We will work as follows:
Let’s get started!
Let’s go ahead and import our libraries:
Selenium (for accessing websites and automation testing):
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
Pandas (we will mainly just used Pandas for structuring our data):
import pandas as pd
Time and date-time (for using delays and returning current time we will see why later):
import time
import datetime
We need those for connecting to our email and sending our message:
import smtplib
from email.mime.multipart import MIMEMultipart
Note: I will not go too deeply into web scraping using selenium, but if you want a more detailed tutorial for scraping in general check my previous tutorials for scraping using Selenium and web scraping in general Part 1 and Part 2.
browser = webdriver.Chrome(executable_path='/chromedriver')
This will open an empty browser telling you that this browser is being controlled by automated test software like so:
Next, I will quickly go to Expedia to check the interface and the options available to choose from.
I click right click + inspect on the ticket type (roundtrip, one way, etc.) to see the tags related to it.
As we can see below it has a ‘label’ tag with ‘id = flight-type-roundtrip-label-hp-flight’.
Accordingly, I will use those to store the tags and ids for the three different ticket types as follows:
#Setting ticket types paths
return_ticket = "//label[@id='flight-type-roundtrip-label-hp-flight']"
one_way_ticket = "//label[@id='flight-type-one-way-label-hp-flight']"
multi_ticket = "//label[@id='flight-type-multi-dest-label-hp-flight']"
Then I define a function to choose a ticket type:
def ticket_chooser(ticket):
try:
ticket_type = browser.find_element_by_xpath(ticket)
ticket_type.click()
except Exception as e:
pass
The above sequence is the same sequence I will use for the rest of the code (look for tags and ids or other attributes and define a function to make the choice on the web page).
Below I define a function to choose the departure country.
def dep_country_chooser(dep_country):
fly_from = browser.find_element_by_xpath("//input[@id='flight-origin-hp-flight']")
time.sleep(1)
fly_from.clear()
time.sleep(1.5)
fly_from.send_keys(' ' + dep_country)
time.sleep(1.5)
first_item = browser.find_element_by_xpath("//a[@id='aria-option-0']")
time.sleep(1.5)
first_item.click()
I follow the below logic:
Note that I am using time.sleep between steps to give a chance to the page’s elements to update/load between steps. Without time.sleep, sometimes our script acts faster than the page loads and thus tries to access elements that didn’t load yet causing our code to break.
Let’s do the same for the arrival country.
def arrival_country_chooser(arrival_country):
fly_to = browser.find_element_by_xpath("//input[@id='flight-destination-hp-flight']")
time.sleep(1)
fly_to.clear()
time.sleep(1.5)
fly_to.send_keys(' ' + arrival_country)
time.sleep(1.5)
first_item = browser.find_element_by_xpath("//a[@id='aria-option-0']")
time.sleep(1.5)
first_item.click()
Departure date:
def dep_date_chooser(month, day, year):
dep_date_button = browser.find_element_by_xpath("//input[@id='flight-departing-hp-flight']")
dep_date_button.clear()
dep_date_button.send_keys(month + '/' + day + '/' + year)
Very straight forward:
Return date:
def return_date_chooser(month, day, year):
return_date_button = browser.find_element_by_xpath("//input[@id='flight-returning-hp-flight']")
for i in range(11):
return_date_button.send_keys(Keys.BACKSPACE)
return_date_button.send_keys(month + '/' + day + '/' + year)
For the return date, clearing whatever was written wasn’t working for some reason (probably due to the page having this as autofill not allowing me to override it with .clear())
The way I worked around this is by using Keys.BACKSPACE which simply tells Python to click backspace (to delete whatever is written in the date field). I put it in a for loop to click backspace 11 times to delete all the characters for the date in the field.
Define the function that will click the search button.
def search():
search = browser.find_element_by_xpath("//button[@class='btn-primary btn-action gcw-submit']")
search.click()
time.sleep(15)
print('Results ready!')
Here it is better to use a long delay of 15 seconds or so to make sure all results are loaded before we proceed to the next steps.
The resulting webpage is as follows (with the fields I am interested in marked):
We will use this sequence to compile our data:
Below is the code:
df = pd.DataFrame()
def compile_data():
global df
global dep_times_list
global arr_times_list
global airlines_list
global price_list
global durations_list
global stops_list
global layovers_list
#departure times
dep_times = browser.find_elements_by_xpath("//span[@data-test-id='departure-time']")
dep_times_list = [value.text for value in dep_times]
#arrival times
arr_times = browser.find_elements_by_xpath("//span[@data-test-id='arrival-time']")
arr_times_list = [value.text for value in arr_times]
#airline name
airlines = browser.find_elements_by_xpath("//span[@data-test-id='airline-name']")
airlines_list = [value.text for value in airlines]
#prices
prices = browser.find_elements_by_xpath("//span[@data-test-id='listing-price-dollars']")
price_list = [value.text.split('/div>)[1] for value in prices]
#durations
durations = browser.find_elements_by_xpath("//span[@data-test-id='duration']")
durations_list = [value.text for value in durations]
#stops
stops = browser.find_elements_by_xpath("//span[@class='number-stops']")
stops_list = [value.text for value in stops]
#layovers
layovers = browser.find_elements_by_xpath("//span[@data-test-id='layover-airport-stops']")
layovers_list = [value.text for value in layovers]
now = datetime.datetime.now()
current_date = (str(now.year) + '-' + str(now.month) + '-' + str(now.day))
current_time = (str(now.hour) + ':' + str(now.minute))
current_price = 'price' + '(' + current_date + '---' + current_time + ')'
for i in range(len(dep_times_list)):
try:
df.loc[i, 'departure_time'] = dep_times_list[i]
except Exception as e:
pass
try:
df.loc[i, 'arrival_time'] = arr_times_list[i]
except Exception as e:
pass
try:
df.loc[i, 'airline'] = airlines_list[i]
except Exception as e:
pass
try:
df.loc[i, 'duration'] = durations_list[i]
except Exception as e:
pass
try:
df.loc[i, 'stops'] = stops_list[i]
except Exception as e:
pass
try:
df.loc[i, 'layovers'] = layovers_list[i]
except Exception as e:
pass
try:
df.loc[i, str(current_price)] = price_list[i]
except Exception as e:
pass
print('Excel Sheet Created!')
One thing worth mentioning is that for the price column I am renaming it every time the code runs using this snippet of code:
now = datetime.datetime.now()
current_date = (str(now.year) + '-' + str(now.month) + '-' + str(now.day))
current_time = (str(now.hour) + ':' + str(now.minute))
current_price = 'price' + '(' + current_date + '---' + current_time + ')'
This is because I want to have the header of the column stating the current time at that particular run in order to be able to see later how the price changes over time in case I want to do that.
In this part I will set up three functions:
First, I also need to store my email login credentials in two variables as follows:
#email credentials
username = 'myemail@hotmail.com'
password = 'XXXXXXXXXXX'
def connect_mail(username, password):
global server
server = smtplib.SMTP('smtp.outlook.com', 587)
server.ehlo()
server.starttls()
server.login(username, password)
#Create message template for email
def create_msg():
global msg
msg = '\nCurrent Cheapest flight:\n\nDeparture time: {}\nArrival time: {}\nAirline: {}\nFlight duration: {}\nNo. of stops: {}\nPrice: {}\n'.format(cheapest_dep_time,
cheapest_arrival_time,
cheapest_airline,
cheapest_duration,
cheapest_stops,
cheapest_price)
Here I create the message using placeholders ‘{}’ for the values to be passed in during each run.
Also, the variables used here like cheapest_arrival_time, cheapest_airline, etc. will be defined later when we start running all our functions to hold the values for each particular run.
def send_email(msg):
global message
message = MIMEMultipart()
message['Subject'] = 'Current Best flight'
message['From'] = 'myemail@hotmail.com'
message['to'] = 'myotheremail@hotmail.com'
server.sendmail('myemail@hotmail.com', 'myotheremail@hotmail.com', msg)
Now we will finally run our functions. We will use the below logic.
The data scraping part:
The email part:
Finally, we save our DataFrame to an Excel sheet and sleep for 3600 seconds (1 hour).
This loop will run 8 times in one-hour intervals, thus it will run for 8 hours. You can tweak the timing to your preference.
for i in range(8):
link = 'https://www.expedia.com/'
browser.get(link)
time.sleep(5)
#choose flights only
flights_only = browser.find_element_by_xpath("//button[@id='tab-flight-tab-hp']")
flights_only.click()
ticket_chooser(return_ticket)
dep_country_chooser('Cairo')
arrival_country_chooser('New york')
dep_date_chooser('04', '01', '2019')
return_date_chooser('05', '02', '2019')
search()
compile_data()
#save values for email
current_values = df.iloc[0]
cheapest_dep_time = current_values[0]
cheapest_arrival_time = current_values[1]
cheapest_airline = current_values[2]
cheapest_duration = current_values[3]
cheapest_stops = current_values[4]
cheapest_price = current_values[-1]
print('run {} completed!'.format(i))
create_msg()
connect_mail(username,password)
send_email(msg)
print('Email sent!')
df.to_excel('flights.xlsx')
time.sleep(3600)
Now I will be getting this email every hour for the next 8 hours:
I also have this neat Excel sheet with all the flights and it will keep updating each hour with a new column for the current price:
Now you can take this further by applying so many other ideas such as:
If you have other ideas don’t hesitate to share!
That’s it! I hope you found it useful.
#python