In March, we experienced three incidents resulting in significant impact and degraded state of availability for issues, pull requests, webhooks, API requests, GitHub Pages, and GitHub Actions services.

Follow up to March 1 09:59 UTC (lasting one hour and 42 minutes)

As mentioned in the  February availability report, our service monitors detected a high error rate on creating check suites for workflow runs, which affected the Actions service. This incident resulted in the failure or delay of some queued jobs for a period of time. Additionally, some customers using the search/filter functionality to find workflows may have experienced incomplete search results.

Upon further investigation of this incident, we identified this issue was caused by check suite IDs exceeding max Int32. We had anticipated the check suite IDs and check run IDs would cross the limit and migrated all database columns to bigint six months back. Our codebase that consists of Ruby, Go, and C## does not have explicit type casting to Int32. We failed to identify a GraphQL library we depend on using Int32 when unmarshalling JSON.

When Actions identifies that a job needs to be run on a repo (triggered by webhooks or cron schedules), we first create a check suite. Those individual check suites were successfully created since the database could handle values greater than Int32, but processing those responses failed due to an external library we were using expecting an Int32. Jobs failed to be queued as a result and the check suites were left in a pending state. We deployed a code fix to mitigate after validating it would not lead to data integrity issues in other microservices that may be relying on check suite IDs.

#engineering #github

GitHub Availability Report: March 2021
1.10 GEEK