Sheldon  Grant

Sheldon Grant

1623299424

GitHub Availability Report: May 2021

Introduction

In May, we experienced two incidents resulting in significant impact and degraded state of availability for API requests, GitHub Pages, GitHub Actions and the GitHub Packages service, specifically the GitHub Packages Container registry service.

May 8 06:46 UTC (46 minutes)

This  incident was caused by failures in an underlying MySQL database, which caused some operations to time out for the GitHub Container registry service. During this incident, some customers viewing packages in the UI or interacting with the registry through “docker push” and “docker pull” may have experienced failures as the engineering team investigated the incident. After performing a failover to one of our database replicas, the affected systems were properly restored.

Our internal engineering team is now prioritizing work that will help ensure reduced impact to customers should such underlying outages happen again. This work includes creating internal documentation, dashboards, and enhanced alerts to quickly triage the cause of operation failures. We will also continue to actively maintain and increase replicas in different regions and availability zones that serve as a line of defense against unexpected region outages.

#engineering #github

What is GEEK

Buddha Community

GitHub Availability Report: May 2021
Desmond  Gerber

Desmond Gerber

1620879240

GitHub Availability Report: April 2021

In April, we experienced two incidents resulting in significant impact and degraded state of availability for API requests and the GitHub Packages service, specifically the GitHub Packages Container registry service.

April 1 21:30 UTC (lasting one hour and 34 minutes)

This incident was caused by failures in our DNS resolution, resulting in a degraded state of availability for the GitHub Packages Container registry service. During this incident, some of our internal services that support the Container registry experienced intermittent failures when trying to connect to dependent services. The inability to resolve requests to these services resulted in users being unable to push new container images to the Container registry as well as pull existing images. The Container registry is currently in a public beta, and only beta users were impacted during this incident. The broader GitHub Packages service remained unaffected.

As a next step, we are looking at increasing the cache times of our DNS resolutions to decrease the impact of intermittent DNS resolution failures in the future.

#github availability report #github

Edison  Stark

Edison Stark

1603861600

How to Compare Multiple GitHub Projects with Our GitHub Stats tool

If you have project code hosted on GitHub, chances are you might be interested in checking some numbers and stats such as stars, commits and pull requests.

You might also want to compare some similar projects in terms of the above mentioned stats, for whatever reasons that interest you.

We have the right tool for you: the simple and easy-to-use little tool called GitHub Stats.

Let’s dive right in to what we can get out of it.

Getting started

This interactive tool is really easy to use. Follow the three steps below and you’ll get what you want in real-time:

1. Head to the GitHub repo of the tool

2. Enter as many projects as you need to check on

3. Hit the Update button beside each metric

In this article we are going to compare three most popular machine learning projects for you.

#github #tools #github-statistics-react #github-stats-tool #compare-github-projects #github-projects #software-development #programming

GitHub Availability Report: November 2020

Introduction

In November, we experienced two incidents resulting in significant impact and degraded state of availability for issues, pull requests, and GitHub Actions services.

November 2 12:00 UTC (lasting 32 minutes)

The SSL certificate for *.githubassets.com expired, impacting web requests for GitHub.com UI and services. There was an auto-generated issue indicating the certificate was within 30 days of expiration, but it was not addressed in time. Impact was reported, and the on-call engineer remediated it promptly.

#actions #github actions #report #github

Annalise  Hyatt

Annalise Hyatt

1595547660

Introducing the GitHub Availability Report

What is the Availability Report?

Historically, GitHub has published post-incident reviews for major incidents that impact service availability. Whether we’re sharing new investments to infrastructure or detailing site downtimes, our belief is that we can collectively grow as an industry by learning from one another. This month, we’re excited to introduce the GitHub Availability Report.

What can you expect?

On the first Wednesday of each month, we’ll publish a report describing GitHub’s availability, including a description of any incidents that may have occurred and update you on how we are evolving our engineering systems and practices in response. You should expect these updates to include a summary of what happened, as well as a technical explanation for incidents where we believe the occurrence was novel and contains information that helps engineers around the world learn how to improve product operations at scale.

Why are we doing this?

Availability and performance are a core feature, including how GitHub responds to service disruptions. We strive to engineer systems that are highly available and fault-tolerant and we expect that most of these monthly updates will recap periods of time where GitHub was >99% available. When things don’t go as planned, rather than waiting to share information about particularly interesting incidents, we want to describe all of the events that may impact you. Our hope is that by increasing our transparency and sharing what we’ve learned, rather than simply reporting minutes of downtime on a status page, everyone can learn from our experiences. At GitHub, we take the trust you place in us very seriously, and we hope this is a way for you to help hold us accountable for continuously improving our operational excellence as well as our product functionality.

#engineering #github #availability report

Lenora  Hauck

Lenora Hauck

1597565820

GitHub Availability Report: July 2020

In July we experienced one specific incident resulting in a degraded state of availability for GitHub.com. We’d like to share our learnings from this incident with the community in the spirit of being transparent about our service disruptions, and helping other services improve their own operations.

July 13 08:18 UTC (lasting for four hours, 25 minutes)

The incident started when our production Kubernetes Pods started getting marked as unavailable. This cascaded through our clusters resulting in a reduction in capacity, which ultimately brought down our services. Investigation into the Pods revealed that a single container within the Pod was exceeding its defined memory limits and being terminated. Even though that container is not required for production traffic to be processed, the nature of Kubernetes requires that all containers be healthy for a Pod to be marked as available.

Normally when a Pod runs into this failure mode, the cluster will recover within a minute or so. In this case, the container in the Pod was configured with an ImagePullPolicy of Always, which instructed Kubernetes to fetch a new container image every time. However, due to a routine DNS maintenance operation that had been completed earlier, our clusters were unable to successfully reach our registry resulting in Pods failing to start. This issue impact was increased when a redeploy was triggered in an attempt to mitigate, and we saw the failure start to propagate across our production clusters. It wasn’t until we restarted the process with the cached DNS records that we were able to successfully fetch container images, redeploy, and recover our services.

Moving forward, we’ve identified a number of areas to address this quarter:

  • Enhancing monitoring ensuring Pod restarts would not fail again based on this same pattern
  • Minimizing our dependency on the image registry
  • Expanding validation during DNS changes
  • Reevaluating all the existing Kubernetes deployment policies

In parallel, we have an ongoing workstream to improve our approach to progressive deployments that will provide the ability to carefully evaluate the impact of deployments in a more incremental fashion. This is part of a broader engineering initiative focused on reliability that we will have more details on in the coming months.

#engineering #github #github availability