Web Scraping; is it good or is it evil? Well, it doesn’t sit entirely in either camp. It can be used to automate the collection of readily available and accessible data, or data about yourself stored by third parties. On the other hand, it can be abused by sending thousands of requests an hour to a server, or accessing content that’s behind a paywall. One thing that is definite about Web Scraping; it sure is fun!

Generally, the data owners do not like it. Some sites have detection algorithms that look for automated bots and block the associated IP address. Avoiding detection is a part of Web Scraping, but I don’t focus on this side of it in my tips (except tip 1!). Obscuring your IP address using a VPN, using a random webdriver, using random pauses, preforming random clicks on screen and using action.move_to(element).perform are all techniques that attempt to fool these algorithms. But I kind of think if you need to employ these techniques in your scripts, you’ve probably already crossed the line. Remember to keep it responsible.

1. Access Denied

I used selenium for multiple web app projects with no issues, before coming across a site that returned the page source as <html><head><title>Access Denied</title></head>. I discovered we can often get around this by adding an option to the web driver when initially instantiating it.

def __init__(self):
    options = webdriver.ChromeOptions()
    options.add_argument('headless')
    options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36")
    self.driver = webdriver.Chrome(chrome_path, options=options)

The user-agent option adds a request header which allows the server to identify the request as an actual user (and not a robot). I now include this option in all my web projects.

#selenium #python #web-scraping

8 Tips to Master Web Control with Selenium
2.35 GEEK