Web scraping is a favorite past time of many programmers. I feel like 2 out of 3 projects I get involved with end up needing me do web scraping. That being said I have seen a LOT of bad web scraping scripts. Even worse is there are people actually charging for code with these issues.

1. Don’t Hard Code Session Cookies

Image for post

I promise the rest of my pictures are better…

Stop. Just. Stop. Anything you hard code is something that has the potential to fail miserably. Here is an example of what this could look like versus what you should instead do.

Your client has a site they want to scrape that requires a login. No problem right, just login to that site from a browser, grab the session cookie and send it every time your code calls the server. Ez pz. What you don’t know is the TTL (Time to live) for that session. What if that session expires after one month? That means once you have handed your client their script you have at most one month before their code is dead in the water.

# This is bad. Don't do this
headers = {
    'Cookie': '_session=23ln4teknl4iowgel'
}
for url in url_list:
    response = requests.get(url, headers=headers)

So what should you do instead? Code your program to login and use the sessions to ensure your cookies get sent with every request!

s = requests.Session()
s.post("https://fakewebsite.com/login", login_data)

for url in url_list:
    response = s.get(url)

It takes just a little extra work but it will save you time from having to constantly update the code.

#web scraping #python #development

If You are Web Scraping Don’t Do These Things
23.15 GEEK