Garry Taylor

Garry Taylor

1556109320

How to build a URL crawler to map a website using Python?

#python

What is GEEK

Buddha Community

Edward Jackson

1556109808

A simple project for learning the fundamentals of web scraping

Before we start, let’s make sure we understand what web scraping is:

Web scraping is the process of extracting data from websites to present it in a format users can easily make sense of.
In this tutorial, I want to demonstrate how easy it is to build a simple URL crawler in Python that you can use to map websites. While this program is relatively simple, it can provide a great introduction to the fundamentals of web scraping and automation. We will be focusing on recursively extracting links from web pages, but the same ideas can be applied to a myriad of other solutions.

Our program will work like this:

  1. Visit a web page
  2. Scrape all unique URL’s found on the webpage and add them to a queue
  3. Recursively process URL’s one by one until we exhaust the queue
  4. Print results

First Things First

The first thing we should do is import all the necessary libraries. We will be using BeautifulSoup, requests, and urllib for web scraping.

from bs4 import BeautifulSoup
import requests
import requests.exceptions
from urllib.parse import urlsplit
from urllib.parse import urlparse
from collections import deque

Next, we need to select a URL to start crawling from. While you can choose any webpage with HTML links, I recommend using ScrapeThisSite. It is a safe sandbox that you can crawl without getting in trouble.

url = “https://scrapethissite.com”

Next, we are going to need to create a new deque object so that we can easily add newly found links and remove them once we are finished processing them. Pre-populate the deque with your url variable:

# a queue of urls to be crawled next
new_urls = deque([url])

We can then use a set to store unique URL’s once they have been processed:

# a set of urls that we have already processed 
processed_urls = set()

We also want to keep track of local (same domain as the target), foreign (different domain as the target), and broken URLs:

# a set of domains inside the target website
local_urls = set()

# a set of domains outside the target website
foreign_urls = set()

# a set of broken urls
broken_urls = set()

Time To Crawl

With all that in place, we can now start writing the actual code to crawl the website.

We want to look at each URL in the queue, see if there are any additional URL’s within that page and add each one to the end of the queue until there are none left. As soon as we finish scraping a URL, we will remove it from the queue and add it to the processed_urls set for later use.

# process urls one by one until we exhaust the queue
while len(new_urls):
    # move url from the queue to processed url set
    url = new_urls.popleft()
    processed_urls.add(url)
    # print the current url
    print(“Processing %s” % url)

Next, add an exception to catch any broken web pages and add them to the broken_urls set for later use:

try:
    response = requests.get(url)

except(requests.exceptions.MissingSchema, requests.exceptions.ConnectionError, requests.exceptions.InvalidURL, requests.exceptions.InvalidSchema):
    # add broken urls to it’s own set, then continue
    broken_urls.add(url)
    continue

We then need to get the base URL of the webpage so that we can easily differentiate local and foreign addresses:

# extract base url to resolve relative links
parts = urlsplit(url)
base = “{0.netloc}”.format(parts)
strip_base = base.replace(“www.”, “”)
base_url = “{0.scheme}://{0.netloc}”.format(parts)
path = url[:url.rfind(‘/’)+1] if ‘/’ in parts.path else url

Initialize BeautifulSoup to process the HTML document:

soup = BeautifulSoup(response.text, “lxml”)

Now scrape the web page for all links and sort add them to their corresponding set:

for link in soup.find_all(‘a’):
    # extract link url from the anchor
    anchor = link.attrs[“href”] if “href” in link.attrs else ‘’

if anchor.startswith(‘/’):
        local_link = base_url + anchor
        local_urls.add(local_link)
    elif strip_base in anchor:
        local_urls.add(anchor)
    elif not anchor.startswith(‘http’):
        local_link = path + anchor
        local_urls.add(local_link)
    else:
        foreign_urls.add(anchor)

Since I want to limit my crawler to local addresses only, I add the following to add new URLs to our queue:

for i in local_urls:
    if not i in new_urls and not i in processed_urls:
        new_urls.append(i)

If you want to crawl all URLs use:

if not link in new_urls and not link in processed_urls:
    new_urls.append(link)

Warning: The way the program currently works, crawling foreign URL’s will take a VERY long time. You could possibly get into trouble for scraping websites without permission. Use at your own risk!

Here is all my code:

from bs4 import BeautifulSoup
import requests
import requests.exceptions
from urllib.parse import urlsplit
from urllib.parse import urlparse
from collections import deque
import re

url = "https://scrapethissite.com"
# a queue of urls to be crawled
new_urls = deque([url])

# a set of urls that we have already been processed 
processed_urls = set()
# a set of domains inside the target website
local_urls = set()
# a set of domains outside the target website
foreign_urls = set()
# a set of broken urls
broken_urls = set()

# process urls one by one until we exhaust the queue
while len(new_urls):
    # move next url from the queue to the set of processed urls
    url = new_urls.popleft()
    processed_urls.add(url)
    # get url's content
    print("Processing %s" % url)
    try:
        response = requests.get(url)
    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError, requests.exceptions.InvalidURL, requests.exceptions.InvalidSchema):
        # add broken urls to it's own set, then continue
        broken_urls.add(url)
        continue
    
    # extract base url to resolve relative links
    parts = urlsplit(url)
    base = "{0.netloc}".format(parts)
    strip_base = base.replace("www.", "")
    base_url = "{0.scheme}://{0.netloc}".format(parts)
    path = url[:url.rfind('/')+1] if '/' in parts.path else url

    # create a beutiful soup for the html document
    soup = BeautifulSoup(response.text, "lxml")

    for link in soup.find_all('a'):
        # extract link url from the anchor
        anchor = link.attrs["href"] if "href" in link.attrs else ''

        if anchor.startswith('/'):
            local_link = base_url + anchor
            local_urls.add(local_link)
        elif strip_base in anchor:
            local_urls.add(anchor)
        elif not anchor.startswith('http'):
            local_link = path + anchor
            local_urls.add(local_link)
        else:
            foreign_urls.add(anchor)

        for i in local_urls:
            if not i in new_urls and not i in processed_urls:
                new_urls.append(i)

print(processed_urls) 

And that should be it. You have just created a simple tool to crawl a website and map all URLs found!

In Conclusion

Feel free to build upon and improve this code. For example, you could modify the program to search web pages for email addresses or phone numbers as you crawl them. You could even extend functionality by adding command line arguments to provide the option to define output files, limit searches to depth, and much more.

Thanks for reading ❤

If you liked this post, share it with all of your programming buddies!

Follow me on Facebook | Twitter

Learn More

Complete Python Bootcamp: Go from zero to hero in Python 3

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python and Django Full Stack Web Developer Bootcamp

Complete Python Masterclass

The Python Bible™ | Everything You Need to Program in Python

Computer Vision Using OpenCV

An A-Z of useful Python tricks

A Complete Machine Learning Project Walk-Through in Python

Learning Python: From Zero to Hero

A guide to Face Detection in Python

*Originally published on *https://medium.freecodecamp.org

Ray  Patel

Ray Patel

1619510796

Lambda, Map, Filter functions in python

Welcome to my Blog, In this article, we will learn python lambda function, Map function, and filter function.

Lambda function in python: Lambda is a one line anonymous function and lambda takes any number of arguments but can only have one expression and python lambda syntax is

Syntax: x = lambda arguments : expression

Now i will show you some python lambda function examples:

#python #anonymous function python #filter function in python #lambda #lambda python 3 #map python #python filter #python filter lambda #python lambda #python lambda examples #python map

Ray  Patel

Ray Patel

1619518440

top 30 Python Tips and Tricks for Beginners

Welcome to my Blog , In this article, you are going to learn the top 10 python tips and tricks.

1) swap two numbers.

2) Reversing a string in Python.

3) Create a single string from all the elements in list.

4) Chaining Of Comparison Operators.

5) Print The File Path Of Imported Modules.

6) Return Multiple Values From Functions.

7) Find The Most Frequent Value In A List.

8) Check The Memory Usage Of An Object.

#python #python hacks tricks #python learning tips #python programming tricks #python tips #python tips and tricks #python tips and tricks advanced #python tips and tricks for beginners #python tips tricks and techniques #python tutorial #tips and tricks in python #tips to learn python #top 30 python tips and tricks for beginners

Why Use WordPress? What Can You Do With WordPress?

Can you use WordPress for anything other than blogging? To your surprise, yes. WordPress is more than just a blogging tool, and it has helped thousands of websites and web applications to thrive. The use of WordPress powers around 40% of online projects, and today in our blog, we would visit some amazing uses of WordPress other than blogging.
What Is The Use Of WordPress?

WordPress is the most popular website platform in the world. It is the first choice of businesses that want to set a feature-rich and dynamic Content Management System. So, if you ask what WordPress is used for, the answer is – everything. It is a super-flexible, feature-rich and secure platform that offers everything to build unique websites and applications. Let’s start knowing them:

1. Multiple Websites Under A Single Installation
WordPress Multisite allows you to develop multiple sites from a single WordPress installation. You can download WordPress and start building websites you want to launch under a single server. Literally speaking, you can handle hundreds of sites from one single dashboard, which now needs applause.
It is a highly efficient platform that allows you to easily run several websites under the same login credentials. One of the best things about WordPress is the themes it has to offer. You can simply download them and plugin for various sites and save space on sites without losing their speed.

2. WordPress Social Network
WordPress can be used for high-end projects such as Social Media Network. If you don’t have the money and patience to hire a coder and invest months in building a feature-rich social media site, go for WordPress. It is one of the most amazing uses of WordPress. Its stunning CMS is unbeatable. And you can build sites as good as Facebook or Reddit etc. It can just make the process a lot easier.
To set up a social media network, you would have to download a WordPress Plugin called BuddyPress. It would allow you to connect a community page with ease and would provide all the necessary features of a community or social media. It has direct messaging, activity stream, user groups, extended profiles, and so much more. You just have to download and configure it.
If BuddyPress doesn’t meet all your needs, don’t give up on your dreams. You can try out WP Symposium or PeepSo. There are also several themes you can use to build a social network.

3. Create A Forum For Your Brand’s Community
Communities are very important for your business. They help you stay in constant connection with your users and consumers. And allow you to turn them into a loyal customer base. Meanwhile, there are many good technologies that can be used for building a community page – the good old WordPress is still the best.
It is the best community development technology. If you want to build your online community, you need to consider all the amazing features you get with WordPress. Plugins such as BB Press is an open-source, template-driven PHP/ MySQL forum software. It is very simple and doesn’t hamper the experience of the website.
Other tools such as wpFoRo and Asgaros Forum are equally good for creating a community blog. They are lightweight tools that are easy to manage and integrate with your WordPress site easily. However, there is only one tiny problem; you need to have some technical knowledge to build a WordPress Community blog page.

4. Shortcodes
Since we gave you a problem in the previous section, we would also give you a perfect solution for it. You might not know to code, but you have shortcodes. Shortcodes help you execute functions without having to code. It is an easy way to build an amazing website, add new features, customize plugins easily. They are short lines of code, and rather than memorizing multiple lines; you can have zero technical knowledge and start building a feature-rich website or application.
There are also plugins like Shortcoder, Shortcodes Ultimate, and the Basics available on WordPress that can be used, and you would not even have to remember the shortcodes.

5. Build Online Stores
If you still think about why to use WordPress, use it to build an online store. You can start selling your goods online and start selling. It is an affordable technology that helps you build a feature-rich eCommerce store with WordPress.
WooCommerce is an extension of WordPress and is one of the most used eCommerce solutions. WooCommerce holds a 28% share of the global market and is one of the best ways to set up an online store. It allows you to build user-friendly and professional online stores and has thousands of free and paid extensions. Moreover as an open-source platform, and you don’t have to pay for the license.
Apart from WooCommerce, there are Easy Digital Downloads, iThemes Exchange, Shopify eCommerce plugin, and so much more available.

6. Security Features
WordPress takes security very seriously. It offers tons of external solutions that help you in safeguarding your WordPress site. While there is no way to ensure 100% security, it provides regular updates with security patches and provides several plugins to help with backups, two-factor authorization, and more.
By choosing hosting providers like WP Engine, you can improve the security of the website. It helps in threat detection, manage patching and updates, and internal security audits for the customers, and so much more.

Read More

#use of wordpress #use wordpress for business website #use wordpress for website #what is use of wordpress #why use wordpress #why use wordpress to build a website

How To Compare Tesla and Ford Company By Using Magic Methods in Python

Magic Methods are the special methods which gives us the ability to access built in syntactical features such as ‘<’, ‘>’, ‘==’, ‘+’ etc…

You must have worked with such methods without knowing them to be as magic methods. Magic methods can be identified with their names which start with __ and ends with __ like init, call, str etc. These methods are also called Dunder Methods, because of their name starting and ending with Double Underscore (Dunder).

Now there are a number of such special methods, which you might have come across too, in Python. We will just be taking an example of a few of them to understand how they work and how we can use them.

1. init

class AnyClass:
    def __init__():
        print("Init called on its own")
obj = AnyClass()

The first example is _init, _and as the name suggests, it is used for initializing objects. Init method is called on its own, ie. whenever an object is created for the class, the init method is called on its own.

The output of the above code will be given below. Note how we did not call the init method and it got invoked as we created an object for class AnyClass.

Init called on its own

2. add

Let’s move to some other example, add gives us the ability to access the built in syntax feature of the character +. Let’s see how,

class AnyClass:
    def __init__(self, var):
        self.some_var = var
    def __add__(self, other_obj):
        print("Calling the add method")
        return self.some_var + other_obj.some_var
obj1 = AnyClass(5)
obj2 = AnyClass(6)
obj1 + obj2

#python3 #python #python-programming #python-web-development #python-tutorials #python-top-story #python-tips #learn-python

Art  Lind

Art Lind

1602968400

Python Tricks Every Developer Should Know

Python is awesome, it’s one of the easiest languages with simple and intuitive syntax but wait, have you ever thought that there might ways to write your python code simpler?

In this tutorial, you’re going to learn a variety of Python tricks that you can use to write your Python code in a more readable and efficient way like a pro.

Let’s get started

Swapping value in Python

Instead of creating a temporary variable to hold the value of the one while swapping, you can do this instead

>>> FirstName = "kalebu"
>>> LastName = "Jordan"
>>> FirstName, LastName = LastName, FirstName 
>>> print(FirstName, LastName)
('Jordan', 'kalebu')

#python #python-programming #python3 #python-tutorials #learn-python #python-tips #python-skills #python-development