Motivation

When searching the keyword “machine learning” on Github, I found 246,632 machine learning repositories. Since these are top repositories in machine learning, I expect the owners and the contributors of these repositories to be experts or competent in machine learning. Thus, I decided to extract the profiles of these users to gain some interesting insights into their background as well as statistics.

My Method for Scraping

Tools

To scrape, I use three tools:

Beautiful Soup to extract the URLs of all the repositories under the machine learning tag. Beautiful Soup is a Python library that makes it extremely easy to scrape data from websites. If you are not aware of Beautiful Soup, I wrote a tutorial on how to scrape with Beautiful Soup in this article.

Detailed Tutorials for Beginners: Web Scrape Movie Database from Multiple Pages with Beautiful Soup

You probably heard about Beautiful Soup. But what will you do if the tag of the data you want to access is not specific…

medium.com

PyGithub to extract the information about the users. PyGithub is a Python library to use the Github API v3. With it, you can manage your Github resources (repositories, user profiles, organizations, etc.) from Python scripts.
Requests to extract the information about the repositories and the links to contributors’ profiles.

Methods

I scrape the owners as well as the top 30 contributors of the top 90 repositories that pop up in the search

By removing duplicates as well as removing the profiles that are organizations like udacity, I obtain a list of 1208 users. For each user, I scrape the 20 data points as listed below

new_profile.info()

Specifically, the first 13 data points are obtained from here

The rest of the data points are obtained from the repositories of a user:

total_stars is the total number of stars of all repositories
max_star is the maximum star among all repositories
forks is the total number of forks of all repositories
descriptions are the descriptions from all repository of a user of all repositories
contribution is the number of contribution within last year

Visualize the Data

#data-analysis #github #scraping #data-science #data-visualization #data analysis