Web Scraping with Python

Of course, we won’t be able to cover all aspect of every tool we discuss, but this post should be enough to have a good idea of which tools does what, and when to use which.

Note: when I talk about Python in this post you should assume that I talk about Python3.

Table of Content:

  • Web Fundamentals
  • Manually opening a socket and sending the HTTP request
  • urllib3 & LXML
  • requests & BeautifulSoup
  • Scrapy
  • Selenium & Chrome —headless
  • Conclusion

Web Fundamentals

The internet is really complex: there are many underlying technologies and concepts involved to view a simple web page in your browser. I don’t have the pretension to explain everything, but I will show you the most important things you have to understand in order to extract data from the web.

HyperText Transfer Protocol

HTTP uses a client/server model, where an HTTP client (A browser, your Python program, curl, Requests…) opens a connection and sends a message (“I want to see that page : /product”)to an HTTP server (Nginx, Apache…).

Then the server answers with a response (The HTML code for example) and closes the connection. HTTP is called a stateless protocol, because each transaction (request/response) is independent. FTP for example, is stateful.

Basically, when you type a website address in your browser, the HTTP request looks like this:

GET /product/ HTTP/1.1
Host: example.com
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/web\
Accept-Encoding: gzip, deflate, sdch, br
Connection: keep-alive
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit\
/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36

In the first line of this request, you can see multiples things:

  • the GET verb or method being used, meaning we request data from the specific path: /product/.There are other HTTP verbs.
  • The version of the HTTP protocol, in this tutorial we will focus on HTTP 1.
  • Multiple headers fields

Here are the most important header fields :

  • Host: The domain name of the server, if no port number is given, is assumed to be 80.
  • User-Agent: Contains information about the client originating the request, including the OS information. In this case, it is my web-browser (Chrome), on OSX. This header is important because it is either used for statistics (How many users visit my website on Mobile vs Desktop) or to prevent any violations by bots. Because these headers are sent by the clients, it can be modified (it is called “Header Spoofing”), and that is exactly what we will do with our scrapers, to make our scrapers look like a normal web browser.
  • Accept: The content types that are acceptable as a response. There are lots of different content types and sub-types: text/plain, text/html, image/jpeg, application/json
  • Cookie : name1=value1;name2=value2… This header field contains a list of name-value pairs. It is called session cookies, these are used to store data. Cookies are what websites use to authenticate users, and/or store data in your browser. For example, when you fill a login form, the server will check if the credentials you entered are correct, if so, it will redirect you and inject a session cookie in your browser. Your browser will then send this cookie with every subsequent request to that server.
  • Referrer: The Referrer header contains the URL from which the actual URL has been requested. This header is important because websites use this header to change their behavior based on where the user came from. For example, lots of news websites have a paying subscription and let you view only 10% of a post, but if the user came from a news aggregator like Reddit, they let you view the full content. They use the referrer to check this. Sometimes we will have to spoof this header to get to the content we want to extract.

And the list goes on…you can find the full header list here.

A server will respond with something like this:

HTTP/1.1 200 OK
Server: nginx/1.4.6 (Ubuntu) Content-Type: text/html; charset=utf-8 <!DOCTYPE html>
<meta charset="utf-8" /> ...[HTML CODE]

On the first line, we have a new piece of information, the HTTP code 200 OK. It means the request has succeeded. As for the request headers, there are lots of HTTP codes, split into four common classes, 2XX for successful requests, 3XX for redirects, 4XX for bad requests (the most famous being 404 Not found), and 5XX for server errors.

Then, in case you are sending this HTTP request with your web browser, the browser will parse the HTML code, fetch all the eventual assets (Javascript files, CSS files, images…) and it will render the result into the main window.

In the next parts we will see the different ways to perform HTTP requests with Python and extract the data we want from the responses.

Manually opening a socket and sending the HTTP request 


The most basic way to perform an HTTP request in Python is to open a socket and manually send the HTTP request.

import socket

HOST = ‘www.google.com’ # Server hostname or IP address
PORT = 80 # Port

client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_address = (HOST, PORT)

request_header = b’GET / HTTP/1.0\r\nHost: www.google.com\r\n\r\n’

response = ‘’
while True:
recv = client_socket.recv(1024)
if not recv:
response += str(recv)


Now that we have the HTTP response, the most basic way to extract data from it is to use regular expressions.

Regular Expressions

A regular expression (RE, or Regex) is a search pattern for strings. With regex, you can search for a particular character/word inside a bigger body of text.

For example, you could identify all phone numbers inside a web page. You can also replace items, for example, you could replace all uppercase tag in a poorly formatted HTML by lowercase ones. You can also validate some inputs …

The pattern used by the regex is applied from left to right. Each source character is only used once. You may be wondering why it is important to know about regular expressions when doing web scraping?

After all, there is all kind of different Python module to parse HTML, with XPath, CSS selectors.

In an ideal semantic world, data is easily machine-readable, the information is embedded inside relevant HTML element, with meaningful attributes.

But the real world is messy, you will often find huge amounts of text inside a p element. When you want to extract a specific data inside this huge text, for example, a price, a date, a name… you will have to use regular expressions.

Note: Here is a great website to test your regex: https://regex101.com/ and one awesome blog to learn more about them, this post will only cover a small fraction of what you can do with regexp.

Regular expressions can be useful when you have this kind of data:

<p>Price : 19.99$</p>

We could select this text node with an Xpath expression, and then use this kind of regex to extract the price :


To extract the text inside an HTML tag, it is annoying to use a regex, but doable:

import re

html_content = ‘<p>Price : 19.99$</p>’

m = re.match(‘<p>(.+)</p>’, html_content)
if m:

As you can see, manually sending the HTTP request with a socket, and parsing the response with regular expression can be done, but it’s complicated and there are higher-level API that can make this task easier.

urllib3 & LXML

Disclaimer: It is easy to get lost in the urllib universe in Python. You have urllib and urllib2 that are parts of the standard lib. You can also find urllib3. urllib2 was split in multiple modules in Python 3, and urllib3 should not be a part of the standard lib anytime soon. This whole confusing thing will be the subject of a blog post by itself. In this part, I’ve made the choice to only talk about urllib3 as it is used widely in the Python world, by Pip and requests to name only them.

Urllib3 is a high-level package that allows you to do pretty much whatever you want with an HTTP request. It allows doing what we did above with socket with way fewer lines of code.

import urllib3
http = urllib3.PoolManager()
r = http.request(‘GET’, ‘http://www.google.com’)

Much more concise than the socket version. Not only that, but the API is straightforward and you can do many things easily, like adding HTTP headers, using a proxy, POSTing forms …

For example, had we decide to set some headers and to use a proxy, we would only have to do this.

import urllib3
user_agent_header = urllib3.make_headers(user_agent=“<USER AGENT>”)
pool = urllib3.ProxyManager(f’<PROXY IP>', headers=user_agent_header)
r = pool.request(‘GET’, ‘https://www.google.com/’)

See? Exactly the same number of line, however, there are some things that urllib3 does not handle very easily, for example, if we want to add a cookie, we have to manually create the corresponding headers and add it to the request.

There are also things that urllib3 can do that requsts can’t, creation and management of pool and proxy pool, control of retry strategy for example.

To put in simply, urllib3 is between requests and socket in terms of abstraction, although way closer to requests than socket.

This time, to parse the response, we are going to use the lxml package and XPath expressions.


Xpath is a technology that uses path expressions to select nodes or node- sets in an XML document (or HTML document). As with the Document Object Model, Xpath is a W3C standard since 1999. Even if Xpath is not a programming language in itself, it allows you to write expression that can access directly to a specific node, or a specific node-set, without having to go through the entire HTML tree (or XML tree).

Think of XPath as regexp, but specifically for XML/HMTL.

To extract data from an HTML document with XPath we need 3 things:

  • an HTML document
  • some XPath expressions
  • an XPath engine that will run those expressions

To begin we will use the HTML that we got thanks to urllib3, we just want to extract all the links from the Google homepage so we will use one simple XPath expression: //a and we will use LXML to run it. LXML is a fast and easy to use XML and HTML processing library that supports XPATH.


pip install lxml

Below is the code that comes just after the previous snippet:

from lxml import html

We reuse the reponse from urllib3

data_string = r.data.decode(‘utf-8’, errors=‘ignore’)

We instantiate a tree object from the HTML

tree = html.fromstring(data_string)

We run the XPath against this HTML

This returns an array of element

links = tree.xpath(‘//a’)
for link in links:
# For each element we can easily get back the URL

And the output should look like this:



You have to keep in mind that this example is really really simple and doesn’t really show you how powerful XPath can be (note: this XPath expression should have been changed to //a/@href to avoid having to iterate on links to get their href ).

XPath expresions, like regexp, are really powerful and one of the fastest way to extract information from HTML, and like regexp, XPath can quickly become messy, hard to read and hard to maintain.

Requests & BeautifulSoup

Requests is the king of python packages, with more than 11 000 000 downloads, it is the most widly used package for Python.


pip install requests

Making a request with Requests (no comment) is really easy:

import requests

r = requests.get(‘https://www.scrapingninja.co’)

With Requests it is easy to perform POST requests, handle cookies, query parameters…

Authentication to Hacker News

Let’s say we want to create a tool to automatically submit our blog post to Hacker news or any other forums, like Buffer. We would need to authenticate to those websites before posting our link. That’s what we are going to do with Requests and BeautifulSoup!

Here is the Hacker News login form and the associated DOM:

There are three <input> tags on this form, the first one has a type hidden with a name “goto” and the two others are the username and password.

If you submit the form inside your Chrome browser, you will see that there is a lot going on: a redirect and a cookie is being set. This cookie will be sent by Chrome on each subsequent request in order for the server to know that you are authenticated.

Doing this with Requests is easy, it will handle redirects automatically for us, and handling cookies can be done with the Session object.

The next thing we will need is BeautifulSoup, which is a Python library that will help us parse the HTML returned by the server, to find out if we are logged in or not.


pip install beautifulsoup4

So all we have to do is to POST these three inputs with our credentials to the /login endpoint and check for the presence of an element that is only displayed once logged in:

import requests
from bs4 import BeautifulSoup

BASE_URL = ‘https://news.ycombinator.com

s = requests.Session()

data = {“gogo”: “news”, “acct”: USERNAME, “pw”: PASSWORD}
r = s.post(f’{BASE_URL}/login’, data=data)

soup = BeautifulSoup(r.text, ‘html.parser’)
if soup.find(id=‘logout’) is not None:
print(‘Successfuly logged in’)
print(‘Authentication Error’)

In order to learn more about BeautifulSoup we could try to extract every links on the homepage.

By the way, Hacker News offers a powerful API, so we’re doing this as an example, but you should use the API instead of scraping it!

The first thing we need to do is to inspect the Hacker News’s home page to understand the structure and the different CSS classes that we will have to select:

We can see that all posts are inside a <tr class=“athing”> so the first thing we will need to do is to select all these tags. This can be easily done with:

links = soup.findAll(‘tr’, class_=‘athing’)

Then for each link, we will extract its id, title, url and rank:

import requests
from bs4 import BeautifulSoup

r = requests.get(‘https://news.ycombinator.com’)
soup = BeautifulSoup(r.text, ‘html.parser’)
links = soup.findAll(‘tr’, class_=‘athing’)

formatted_links = []

for link in links:
data = {
‘id’: link[‘id’],
‘title’: link.find_all(‘td’)[2].a.text,
“url”: link.find_all(‘td’)[2].a[‘href’],
“rank”: int(links[0].td.span.text.replace(‘.’, ‘’))


As you saw, Requests and BeautifulSoup are great libraries to extract data and automate different things by posting forms. If you want to do large-scale web scraping projects, you could still use Requests, but you would need to handle lots of things yourself.

When you need to scrape a lots of webpages, there are many things you have to take care of:

  • finding a way of parallelizing your code to make it faster
  • handling error
  • storing result
  • filtering result
  • throttling your request so you don’t over load the server

Fortunately for us, tools exist that can handle those things for us.


Scrapy is a powerful Python web scraping framework. It provides many features to download web pages asynchronously, process and save it. It handles multithreading, crawling (the process of going from links to links to find every URLs in a website), sitemap crawling and many more.

Scrapy has also an interactive mode called the Scrapy Shell. With Scrapy Shell you can test your scraping code really quickly, like XPath expression or CSS selectors.

The downside of Scrapy is that the learning curve is steep, there is a lot to learn.

To follow up on our example about Hacker news, we are going to write a Scrapy Spider that scrapes the first 15 pages of results, and saves everything in a CSV file.

You can easily install Scrapy with pip:

pip install Scrapy

Then you can use the scrapy cli to generate the boilerplate code for our project:

scrapy startproject hacker_news_scraper

Inside hackernewsscraper/spider we will create a new python file with our Spider’s code:

from bs4 import BeautifulSoup
import scrapy

class HnSpider(scrapy.Spider):
name = “hacker-news”
allowed_domains = [“news.ycombinator.com”]
start_urls = [f’https://news.ycombinator.com/news?p={i}’ for i in range(1,16)]

def parse(self, response):
    soup = BeautifulSoup(response.text, 'html.parser')
    links = soup.findAll('tr', class_='athing')

    for link in links:
    	yield {
    		'id': link['id'],
    		'title': link.find_all('td')[2].a.text,
    		"url": link.find_all('td')[2].a['href'],
    		"rank": int(link.td.span.text.replace('.', ''))

There is a lot of convention in Scrapy, here we define an Array of starting urls. The attribute name will be used to call our Spider with the Scrapy command line.

The parse method will be called on each URL in the start_urls array

We then need to tune Scrapy a little bit in order for our Spider to behave nicely against the target website.

# Enable and configure the AutoThrottle extension (disabled by default)

See https://doc.scrapy.org/en/latest/topics/autothrottle.html


The initial download delay


You should always turn this on, it will make sure the target website is not slow down by your spiders by analyzing the response time and adapting the numbers of concurrent threads.

You can run this code with the Scrapy CLI and with different output format (CSV, JSON, XML…):

scrapy crawl hacker-news -o links.json

And that’s it! You will now have all your links in a nicely formatted JSON file.

Selenium & Chrome —headless

Scrapy is really nice for large-scale web scraping tasks, but it is not enough if you need to scrape Single Page Application written with Javascript frameworks because It won’t be able to render the Javascript code.

It can be challenging to scrape these SPAs because there are often lots of AJAX calls and websockets connections involved. If performance is an issue, you should always try to reproduce the Javascript code, meaning manually inspecting all the network calls with your browser inspector, and replicating the AJAX calls containing the interesting data.

In some cases, there are just too many asynchronous HTTP calls involved to get the data you want and it can be easier to just render the page in a headless browser.

Another great use case would be to take a screenshot of a page, and this is what we are going to do with the Hacker News homepage (again !)

You can install the selenium package with pip:

pip install selenium

You will also need Chromedriver:

brew install chromedriver

Then we just have to import the Webdriver from selenium package, configure Chrome with headless=True and set a window size (otherwise it is really small):

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options, executable_path=r’/usr/local/bin/chromedriver’)

You should get a nice screenshot of the homepage:

You can do many more with the Selenium API and Chrome, like :

  • Executing Javascript
  • Filling forms
  • Clicking on Elements
  • Extracting elements with CSS selectors / XPath expressions

Selenium and Chrome in headless mode is really the ultimate combination to scrape anything you want. You can automate anything that you could do with your regular Chrome browser.

The big drawback is that Chrome needs lots of memory / CPU power. With some fine-tuning you can reduce the memory footprint to 300-400mb per Chrome instance, but you still need 1 CPU core per instance.

If you want to run several Chrome instances concurrently, you will need powerful servers (the cost goes up quickly) and constant monitoring of resources.


Here is a quick recap table of every technology we discuss about in this about. Do not hesitate to tell us in the comment if you know some ressources that you feel have their places here.

I hope that this overview will help you best choose your Python scraping tools and that you learned things reading this post.

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading about Python

Complete Python Bootcamp: Go from zero to hero in Python 3

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python and Django Full Stack Web Developer Bootcamp

Complete Python Masterclass

Python Tutorial - Python GUI Programming - Python GUI Examples (Tkinter Tutorial)

Computer Vision Using OpenCV

OpenCV Python Tutorial - Computer Vision With OpenCV In Python

Python Tutorial: Image processing with Python (Using OpenCV)

A guide to Face Detection in Python

Machine Learning Tutorial - Image Processing using Python, OpenCV, Keras and TensorFlow

#python #web-development

What is GEEK

Buddha Community

Web Scraping with Python
Ray  Patel

Ray Patel


top 30 Python Tips and Tricks for Beginners

Welcome to my Blog , In this article, you are going to learn the top 10 python tips and tricks.

1) swap two numbers.

2) Reversing a string in Python.

3) Create a single string from all the elements in list.

4) Chaining Of Comparison Operators.

5) Print The File Path Of Imported Modules.

6) Return Multiple Values From Functions.

7) Find The Most Frequent Value In A List.

8) Check The Memory Usage Of An Object.

#python #python hacks tricks #python learning tips #python programming tricks #python tips #python tips and tricks #python tips and tricks advanced #python tips and tricks for beginners #python tips tricks and techniques #python tutorial #tips and tricks in python #tips to learn python #top 30 python tips and tricks for beginners

Sival Alethea

Sival Alethea


Beautiful Soup Tutorial - Web Scraping in Python

The Beautiful Soup module is used for web scraping in Python. Learn how to use the Beautiful Soup and Requests modules in this tutorial. After watching, you will be able to start scraping the web on your own.
📺 The video in this post was made by freeCodeCamp.org
The origin of the article: https://www.youtube.com/watch?v=87Gx3U0BDlo&list=PLWKjhJtqVAbnqBxcdjVGgT3uVR10bzTEB&index=12
🔥 If you’re a beginner. I believe the article below will be useful to you ☞ What You Should Know Before Investing in Cryptocurrency - For Beginner
⭐ ⭐ ⭐The project is of interest to the community. Join to Get free ‘GEEK coin’ (GEEKCASH coin)!
☞ **-----CLICK HERE-----**⭐ ⭐ ⭐
Thanks for visiting and watching! Please don’t forget to leave a like, comment and share!

#web scraping #python #beautiful soup #beautiful soup tutorial #web scraping in python #beautiful soup tutorial - web scraping in python

Ray  Patel

Ray Patel


Lambda, Map, Filter functions in python

Welcome to my Blog, In this article, we will learn python lambda function, Map function, and filter function.

Lambda function in python: Lambda is a one line anonymous function and lambda takes any number of arguments but can only have one expression and python lambda syntax is

Syntax: x = lambda arguments : expression

Now i will show you some python lambda function examples:

#python #anonymous function python #filter function in python #lambda #lambda python 3 #map python #python filter #python filter lambda #python lambda #python lambda examples #python map

Osiki  Douglas

Osiki Douglas


How POST Requests with Python Make Web Scraping Easier

When scraping a website with Python, it’s common to use the

urllibor theRequestslibraries to sendGETrequests to the server in order to receive its information.

However, you’ll eventually need to send some information to the website yourself before receiving the data you want, maybe because it’s necessary to perform a log-in or to interact somehow with the page.

To execute such interactions, Selenium is a frequently used tool. However, it also comes with some downsides as it’s a bit slow and can also be quite unstable sometimes. The alternative is to send a

POSTrequest containing the information the website needs using the request library.

In fact, when compared to Requests, Selenium becomes a very slow approach since it does the entire work of actually opening your browser to navigate through the websites you’ll collect data from. Of course, depending on the problem, you’ll eventually need to use it, but for some other situations, a

POSTrequest may be your best option, which makes it an important tool for your web scraping toolbox.

In this article, we’ll see a brief introduction to the

POSTmethod and how it can be implemented to improve your web scraping routines.

#python #web-scraping #requests #web-scraping-with-python #data-science #data-collection #python-tutorials #data-scraping

Shardul Bhatt

Shardul Bhatt


Why use Python for Software Development

No programming language is pretty much as diverse as Python. It enables building cutting edge applications effortlessly. Developers are as yet investigating the full capability of end-to-end Python development services in various areas. 

By areas, we mean FinTech, HealthTech, InsureTech, Cybersecurity, and that's just the beginning. These are New Economy areas, and Python has the ability to serve every one of them. The vast majority of them require massive computational abilities. Python's code is dynamic and powerful - equipped for taking care of the heavy traffic and substantial algorithmic capacities. 

Programming advancement is multidimensional today. Endeavor programming requires an intelligent application with AI and ML capacities. Shopper based applications require information examination to convey a superior client experience. Netflix, Trello, and Amazon are genuine instances of such applications. Python assists with building them effortlessly. 

5 Reasons to Utilize Python for Programming Web Apps 

Python can do such numerous things that developers can't discover enough reasons to admire it. Python application development isn't restricted to web and enterprise applications. It is exceptionally adaptable and superb for a wide range of uses.

Robust frameworks 

Python is known for its tools and frameworks. There's a structure for everything. Django is helpful for building web applications, venture applications, logical applications, and mathematical processing. Flask is another web improvement framework with no conditions. 

Web2Py, CherryPy, and Falcon offer incredible capabilities to customize Python development services. A large portion of them are open-source frameworks that allow quick turn of events. 

Simple to read and compose 

Python has an improved sentence structure - one that is like the English language. New engineers for Python can undoubtedly understand where they stand in the development process. The simplicity of composing allows quick application building. 

The motivation behind building Python, as said by its maker Guido Van Rossum, was to empower even beginner engineers to comprehend the programming language. The simple coding likewise permits developers to roll out speedy improvements without getting confused by pointless subtleties. 

Utilized by the best 

Alright - Python isn't simply one more programming language. It should have something, which is the reason the business giants use it. Furthermore, that too for different purposes. Developers at Google use Python to assemble framework organization systems, parallel information pusher, code audit, testing and QA, and substantially more. Netflix utilizes Python web development services for its recommendation algorithm and media player. 

Massive community support 

Python has a steadily developing community that offers enormous help. From amateurs to specialists, there's everybody. There are a lot of instructional exercises, documentation, and guides accessible for Python web development solutions. 

Today, numerous universities start with Python, adding to the quantity of individuals in the community. Frequently, Python designers team up on various tasks and help each other with algorithmic, utilitarian, and application critical thinking. 

Progressive applications 

Python is the greatest supporter of data science, Machine Learning, and Artificial Intelligence at any enterprise software development company. Its utilization cases in cutting edge applications are the most compelling motivation for its prosperity. Python is the second most well known tool after R for data analytics.

The simplicity of getting sorted out, overseeing, and visualizing information through unique libraries makes it ideal for data based applications. TensorFlow for neural networks and OpenCV for computer vision are two of Python's most well known use cases for Machine learning applications.


Thinking about the advances in programming and innovation, Python is a YES for an assorted scope of utilizations. Game development, web application development services, GUI advancement, ML and AI improvement, Enterprise and customer applications - every one of them uses Python to its full potential. 

The disadvantages of Python web improvement arrangements are regularly disregarded by developers and organizations because of the advantages it gives. They focus on quality over speed and performance over blunders. That is the reason it's a good idea to utilize Python for building the applications of the future.

#python development services #python development company #python app development #python development #python in web development #python software development