As I mentioned in the previous part, I will try to redo what I’ve done before with the Gojek dataset. After further learning data analytics and looking back to my first analysis, I realized that there’s a quite few things that could be made better. In this part, I’ll cover what I’d do differently. Anyway, if you haven’t checked the first part, you can read it here.
What did I miss
Based on my first analysis, there’s a couple of things that weren’t there, but I think should’ve been there in the first place, and one of them was one of the most important things that data analyst do. Data cleaning.
Well, based on the story that I’ve known from so many data analysts, data cleaning takes almost 50% — 80% of their total time in data analysis processes. So, don’t ever abandon your cleaning task. Before moving forward, always check what could you possibly do to make your data quality better, and your data processing even easier in the coming steps.
Garbage in, garbage out remember? Yup, let’s do the least enjoying part of data analysis for the sake of our results;).
Even if you think your data already looks good when it came to you, it’s best to always verify. Another reason to do so is it can actually help you understand your data even better. As for my case, it already looks clean when I first look at it, but as I said, scan through it again. And here’s what I do:
- First, I load the CSV file to MySQL (FYI, I only use excel the first time).
- Then, after exploring the table, I came up with a few things that I think should be done.
- First, I change the NULL value to 0 in the transactions value and the number of order column. I did it because usually NULL/missing value is either removed or assigned with something else. In this case, I simply use 0 and changed the status to “failed” (more on this later). Please note that if you forget and still keeping the NULL, you might face a problem if you try to do some calculation based on the numerical column, so keep that in mind.
- Then, I changed the “failed/timeout” and “other” in the status column to “failed”. Because they actually have the same meaning, so this is done in order to avoid confusion in the future.
- Next, I assigned a status to “failed” where the number of orders is 0 or transactions_value is 0. Even if it has the status “complete”, “cancelled” or else, I think one of the above must not contain 0. Because if it does, the comparison between service in terms of the number of order and transaction value wouldn’t be apple to apple (imagine having 0 order but have 100000 transactions value in GO-FOOD and have 10 order but 50000 transaction value in GO-JEK). I think the actual best practice here would be, to talk to the data collection team and verified the data, rather than directly assign them to ‘Failed’. But since we can’t do that, just assume the data collection team already told us that all of them were “system error/failed.”
- Lastly, I drop all rows in April, because in April, we only have data for one day. And since we want to make its quarterly report, it’s necessary to remove the data in April.
#personal #data #data-analysis #data-visualization