Tasty waffles and a seemingly dry data migration project, how does that go together?

In this article I am going to discuss:

  1. Data migration as a sample use case in which waffle charts can be a useful tool to reveal hidden patterns in your dataset
  2. How to prepare the dataset with pandas so that waffle charts can be produced at scale
  3. How to tweak tableau to generate the desired output, since waffle charts are not supported out of the box

Sample Use Case: Data Migration

Let’s say you are working as a data scientist in a cross-functional technical team and are confronted with the following challenge:

Your team is executing a big scale file migration between** two systems**. The files are **grouped into separate entities **each consisting of approximately 5000 to 1 0000 files. These entities are completely independent so there is no overlap. A file can only be assigned to one unique entity. The goal is to migrate a multitude of these entities in separate batches.

The migration scope** per file** can be broken down into a sequence of four distinct steps:

  1. Migration of the source file
  2. Migration of the related metadata
  3. Assignment of the file to a new folder structure
  4. Quality Assurance

All four steps have to be conducted in this exact same order.

The migration itself is run by automated scripts, which print the current status to a logfile. The logfile is then pre-processed and available as an excel-document in the below format:

Your job is to monitor and make sense of this data by communicating the overall migration status to your team in an intuitive way.

Brainstorming while understanding the dataset

Naturally the first step would be to create a Jupyter Notebook and load the logfile as a pandas dataframe for further processing.

import pandas as pd
import numpy as np

migration_status = pd.read_excel("migration_status.xlsx")
migration_status.head()

To get a general feel for the dataset, you could first run pandas’ .describe() and .info() methods:

The sample data set contains 5292 rows and no missing values. Migration steps are encoded as boolean values (0 = False; 1 = True). To get a first idea you could print the value_count for all four distinct migration steps

for column in migration_status.columns[2:]:
    print(migration_status[column].value_counts())

which yields the following results

1    5292
Name: SOURCE MIGRATED, dtype: int64
1    4958
0    334 
Name: METADATA MIGRATED, dtype: int64
1    4957
0    335 
Name: ASSIGNED TO FOLDER, dtype: int64
1    3343
0    1949
Name: QA EXECUTED, dtype: int64

You could then add an additional column “MIGRATION_STATUS” which reflects the overall migration status per file. Since booleans (0 and 1) are used in the respective columns, this can be easily achieved by calculating the sum of all migration step columns and then dividing it by 4 (the total amount of migration steps).

migration_status["MIGRATION_STATUS"] = migration_status.iloc[:, 2:].sum(axis=1) / 4

Using value_counts on the new column already reveals useful insights

migration_status["MIGRATION_STATUS"].value_counts()

1.00    3342
0.75    1616
0.25    334 
Name: MIGRATION_STATUS, dtype: int64

The output tells us that 3342 files have been fully migrated. 1616 files have been 75% migrated, whereas the migration of 334 files has just started and is currently stuck at 25%.

So far so good …

But isn’t there more information hidden in the data? Will your team look at these dry summary statistics and get a real grasp of the current migration status? Your team may also want to know the following:

Which file ranges have been migrated and which haven’t?

Are there file ranges that are stuck in a certain status?

Can we get this information all on a single slide in an easy to understand visual?

A simple spreadsheet and summary statistics based approach doesn’t seem to reveal the bigger picture here.

Waffles to the rescue!

Foto von Brigitte Tohm von Pexels

While considering different possible visualization types, such as pie-charts, stacked-bar charts, line-charts, heatmaps etc. you realize that these all don’t seem to accurately cater to the sequential nature of the migration process. After one of your colleagues is again complaining about his PC, since it got so slow that it was no longer usable, you suddenly remember the days of the good old defragmentation screen, which is (you guessed it) a type of modified waffle-chart.

Interestingly the defragmentation screen already appears here in reddit’s nostalgia section.

At first glance this visualization type might appear anachronistic, even a little bit nostalgic and too playful. But on second thought it is very intuitive, since it uses a learned visual pattern. Most importantly it does a great job by adding sequential and location based information to the chart in a very compact format.

After doing some research you decide to use a combination of pandas and tableau to prepare the data and generate the waffle charts.

Preparing the Data

#pandas #tableau #data-science #waffle #data-visualization #data analysis

How Waffle Charts Can Help With Your Next Data Migration Project
1.30 GEEK