How to Extract All Website Links using BeautifulSoup in Python

How to Extract All Website Links using BeautifulSoup in Python

In this tutorial, you will learn how to build a link extractor tool in Python from scratch using requests and BeautifulSoup libraries.

Viewing what a web page links to is one of the major steps of SEO diagnostics process. It is also useful in information gathering phase for penetration testers that tries to gain access to a specific authorized website.

Let's install the dependencies:

pip3 install requests bs4 colorama

Open up a new Python file and follow along, let's import the modules we need:

import requests
from urllib.request import urlparse, urljoin
from bs4 import BeautifulSoup
import colorama

We are going to use colorama just for using different colors when printing, to distinguish between internal and external links:

# init the colorama module
GREEN = colorama.Fore.GREEN
GRAY = colorama.Fore.LIGHTBLACK_EX
RESET = colorama.Fore.RESET

We gonna need two global variables, one for all internal links of the website and the other for all the external links:

# initialize the set of links (unique links)
internal_urls = set()
external_urls = set()
  • Internal links are URLs that links to other pages of the same website.
  • External links are URLs that links to other websites.

Since not all links in anchor tags (a tags) are valid (I've experimented with this), some are links to parts of the website, some are javascript, so let's write a function to validate URLs:

def is_valid(url):
    Checks whether `url` is a valid URL.
    parsed = urlparse(url)
    return bool(parsed.netloc) and bool(parsed.scheme)

This will make sure that a proper scheme (protocol, e.g http or https) and domain name exists in the URL.

Now let's build a function to return all the valid URLs of a web page:

def get_all_website_links(url):
    Returns all URLs that is found on `url` in which it belongs to the same website
    # all URLs of `url`
    urls = set()
    # domain name of the URL without the protocol
    domain_name = urlparse(url).netloc
    soup = BeautifulSoup(requests.get(url).content, "html.parser")

First, I initialized the urls set variable, I've used Python sets here because we don't want redundant links.

Second, I've extracted the domain name from the URL, we gonna need it to check whether the link we grabbed is external or internal.

Third, I've downloaded the HTML content of the web page and wrapped it with a soup object to ease HTML parsing.

Let's get all HTML a tags (anchor tags that contains all the links of the web page):

    for a_tag in soup.findAll("a"):
        href = a_tag.attrs.get("href")
        if href == "" or href is None:
            # href empty tag

So we get the href attribute and check if there is something there. Otherwise, we just continue to the next link.

Since not all links are absolute, we gonna need to join relative URLs with its domain name (e.g when href is "/search" and url is "", the result will be ""):

        # join the URL if it's relative (not absolute link)
        href = urljoin(url, href)

Now we need to remove HTTP GET parameters from the URLs, since this will cause redundancy in the set, the below code handles that:

        parsed_href = urlparse(href)
        # remove URL GET parameters, URL fragments, etc.
        href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path

Let's finish up the function:

        if not is_valid(href):
            # not a valid URL
        if href in internal_urls:
            # already in the set
        if domain_name not in href:
            # external link
            if href not in external_urls:
                print(f"{GRAY}[!] External link: {href}{RESET}")
        print(f"{GREEN}[*] Internal link: {href}{RESET}")
    return urls

All we did here is checking:

  • If the URL isn't valid, continue to the next link.
  • If the URL is already in the internal_urls, we don't need that either.
  • If the URL is an external link, print it in gray color and add it to our global external_urls set and continue to the next link.

Finally, after all checks, the URL will be an internal link, we print it and add it to our urls and internal_urls sets.

The above function will only grab the links of one specific page, what if we want to extract all links of the entire website ? Let's do this:

# number of urls visited so far will be stored here
total_urls_visited = 0

def crawl(url, max_urls=50):
    Crawls a web page and extracts all links.
    You'll find all links in `external_urls` and `internal_urls` global set variables.
        max_urls (int): number of max urls to crawl, default is 30.
    global total_urls_visited
    total_urls_visited += 1
    links = get_all_website_links(url)
    for link in links:
        if total_urls_visited > max_urls:
        crawl(link, max_urls=max_urls)

This function crawls the website, which means it gets all the links of the first page and then call itself recursively to follow all the links extracted previously. However, this can cause some issues, the program will get stuck on large websites such as, as a result, I've added a max_urls parameter to exit when we reach a certain number of URLs checked.

Alright, let's test this, make sure you use this on a website you're authorized to, otherwise I'm not responsible for any harm you make.

if __name__ == "__main__":
    print("[+] Total External links:", len(external_urls))
    print("[+] Total Internal links:", len(internal_urls))
    print("[+] Total:", len(external_urls) + len(internal_urls))

I'm testing on this website. However, I highly encourage you to not to do that, that will cause a lot of requests and will crowd the web server and may block your IP address.

Here is a part of the output:

How to Extract All Website Links using BeautifulSoup in Python

Awesome, right ? I hope this tutorial was a benefit for you to inspire you to build such tools using Python.

I edited the code a little bit, so you will be able to save the output URLs in a file, check the full code.

Happy Scraping ♥

python web-development

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Hire Python Developers

Are you looking for experienced, reliable, and qualified Python developers? If yes, you have reached the right place. At **[]( "")**, our full-stack Python development services...

Hire Python Developers India

Looking to build robust, scalable, and dynamic responsive websites and applications in Python? At **[]( "")**, we constantly endeavor to give you exactly what you need. If you need to...

Basic Data Types in Python | Python Web Development For Beginners

In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.

Hire Web Developer

Looking for an attractive & user-friendly web developer?, a leading web, and mobile app development company, offers web developers for hire through flexible engagement models. You can **[Hire Web...

Top Python Development Companies | Hire Python Developers

After analyzing clients and market requirements, TopDevelopers has come up with the list of the best Python service providers. These top-rated Python developers are widely appreciated for their professionalism in handling diverse projects. When...