Introduction

Choosing the right steps when working in the field of data science is truly no silver bullet. Most data scientists might have their custom workflow, which could be more or less automated, depending on their area of work. Using Kubernetes can be a tremendous enhancement when trying to automate workflows on a large scale. In this blog post, I would like to take you on my journey of doing data science while integrating the overall workflow into Kubernetes.

The target of the research I did in the past few months was to find any useful information about all those thousands of GitHub issues and pull requests (PRs) we have in the Kubernetes repository. What I ended up with was a fully automated, in Kubernetes running Continuous Integration (CI) and Deployment (CD) data science workflow powered by Kubeflow and Prow. You may not know both of them, but we get to the point where I explain what they’re doing in detail. The source code of my work can be found in the kubernetes-analysis GitHub repository, which contains everything source code-related as well as the raw data. But how to retrieve this data I’m talking about? Well, this is where the story begins.

Getting the Data

The foundation for my experiments is the raw GitHub API data in plain JSON format. The necessary data can be retrieved via the GitHub issues endpoint, which returns all pull requests as well as regular issues in the REST API. I exported roughly 91000 issues and pull requests in the first iteration into a massive 650 MiB data blob. This took me about 8 hours of data retrieval time because for sure, the GitHub API is rate limited. To be able to put this data into a GitHub repository, I’d chosen to compress it via [xz(1)](https://linux.die.net/man/1/xz). The result was a roundabout 25 MiB sized tarball, which fits well into the repository.

I had to find a way to regularly update the dataset because the Kubernetes issues and pull requests are updated by the users over time as well as new ones are created. To achieve the continuous update without having to wait 8 hours over and over again, I now fetch the delta GitHub API data between the last update and the current time. This way, a Continuous Integration job can update the data on a regular basis, whereas I can continue my research with the latest available set of data.

From a tooling perspective, I’ve written an all-in-one Python executable, which allows us to trigger the different steps during the data science experiments separately via dedicated subcommands. For example, to run an export of the whole data set, we can call:

> export GITHUB_TOKEN=<MY-SECRET-TOKEN>
> ./main export
INFO | Getting GITHUB_TOKEN from environment variable
INFO | Dumping all issues
INFO | Pulling 90929 items
INFO | 1: Unit test coverage in Kubelet is lousy. (~30%)
INFO | 2: Better error messages if go isn't installed, or if gcloud is old.
INFO | 3: Need real cluster integration tests
INFO | 4: kubelet should know which containers it is managing
… [just wait 8 hours] …

#kubernetes #kubernetes’ history #github #github #kubeflow #tensorflow

My exciting journey into Kubernetes’ history
1.45 GEEK