Visualizing GitHub’s Global Community

Visualizing GitHub’s Global Community

Visualizing GitHub’s global community. How our visualize global is built how we illustrate at GitHub. Because of the volume of world data generated on GitHub every day, the size of our databases, as well as the importance of keeping GitHub fast and community.

This is the second post in a series about how we built our new homepage.

  1. How our globe is built
  2. How we collect and use the data behind the globe
  3. How we made the page fast and performant
  4. How we illustrate at GitHub
  5. How we designed the homepage and wrote the narrative

In the first post, my teammate  Tobias shared  how we made the 3D globe come to life, with lots of nitty gritty details about Three.js, performance optimization, and delightful touches.

But there’s another side to the story—the data! We hope you enjoy the read. 

Data goals

When we kicked off the project, we knew that we didn’t want to make just another animated globe. We wanted the data to be interesting and engaging. We wanted it to be real, and most importantly, we wanted it to be live.

Luckily, the data was there.

The challenge then became designing a data service that addressed the following challenges:

  1. How do we query our massive volume of data?
  2. How do we show you the most interesting bits?
  3. How do we geocode user locations in a way that respects privacy?
  4. How do we expose the computed data back to the monolith?
  5. How do we not break GitHub? 

Let’s begin, shall we?

Querying GitHub

So, how hard could it be to show you some recent pull requests? It turns out it’s actually very simple:

class GlobeController < ApplicationController
  def data
    pull_requests = PullRequest
      .where(open: true)
      .joins(:repositories)
      .where("repository.is_open_source = true")
      .last(10_000)

    render json: pull_requests
  end
end

Just kidding 

Because of the volume of data generated on GitHub every day, the size of our databases, as well as the importance of keeping GitHub fast and reliable, we knew we couldn’t query our production databases directly.

Luckily, we have a data warehouse and a fantastic team that maintains it. Data from production is fetched, sanitized, and packaged nicely into the data warehouse on a regular schedule. The data can then be queried using  Presto, a flavor of SQL meant for querying large sets of data.

We also wanted the data to be as fresh as possible. So instead of querying snapshots of our MySQL tables that are only copied over once a day, we were able to query data coming from our  Apache Kafka event stream that makes it into the data warehouse much more regularly.

engineering github

What is Geek Coin

What is GeekCash, Geek Token

Best Visual Studio Code Themes of 2021

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

How to Compare Multiple GitHub Projects with Our GitHub Stats tool

In this article we are going to compare three most popular machine learning projects for you.

Deploying my portfolio website on Github Pages using Github Actions.

Deploying my portfolio website on Github Pages using Github Actions. I recently deployed my portfolio site and wanted to try out github actions and this is my experience of automating the deployment.

Stay Safe on GitHub: Security Practices to Follow

As developers in this deeply interconnected community use open source code to build software, Github security should be a top priority. This is because extensive code re-use increases the risk of distributing vulnerabilities from one dependency or repository to another. As such, every contributor should focus on creating a secure development environment. Here are eight security practices that GitHub users can follow to stay safe and protect their code:

Stay Safe on GitHub: Security Practices to Follow

As developers in this deeply interconnected community use open source code to build software, Github security should be a top priority. This is because extensive code re-use increases the risk of distributing vulnerabilities from one dependency or repository to another. As such, every contributor should focus on creating a secure development environment. Here are eight security practices that GitHub users can follow to stay safe and protect their code:

Stay Safe on GitHub: Security Practices to Follow

As developers in this deeply interconnected community use open source code to build software, Github security should be a top priority. This is because extensive code re-use increases the risk of distributing vulnerabilities from one dependency or repository to another. As such, every contributor should focus on creating a secure development environment. Here are eight security practices that GitHub users can follow to stay safe and protect their code: