3 years ago, I graduated from a Business Analytics (Big Data) Bachelor degree at a University. I aspired to be a data scientist and wanted to apply all the machine learning models that I learnt from school to change the world. Yes ‘change the world’ was even the phrase that I used in many of my interviews during the time because I believed so.

Now looking back the past 3 years, I would say I was too young and too idealistic about the application of data science in real world. In this article, I want to talk about the difference between data science in school and at real work. I hope this article can resonate with some current data scientists and also help to set expectation for people who aspire to be data scientists in the future.

Factor 1: Data Availability

For data availability, I am referring to both data quantity and data quality. It breaks down into multiple questions.

Do we have the data we need in the first place?

Do we have enough number of cases for the data to create a valid machine learning model(for both training and testing dataset)?

How is the quality of data? (Missing data? Good distribution over sectors?)

How updated or recent is the dataset?

Does the data exist in the format you want?

**In School: **Teacher would provide dataset or we need to go and source for our own dataset. Even we are required to find our own dataset, what I usually do is to go WorldBank and use one of its macro economic dataset. This is because WorldBank dataset is complete and usually in good quality.

Most of our efforts are focused on building machine learning models instead of finding a good dataset or cleaning up data.

**At Work: **Unfortunately, unless you belong to one of those tech giants like Google or Facebook, dataset is not easy to get especially for good quality data. We need to either web scrape data from online sources or purchase data from third party vendor.

Due to data limitation on sample size or number of predictors, sometimes I could not apply random forest algorithm or even train-test split because they would require a minimal rows or columns for data. In addition, if I work on a data science project, around 90% of the time for me is doing data engineering or cleaning up dataset to make sure it is in the good shape.

Factor 2: Business Need

Image for post

It may sound a bit unfair to compare because school is not really a normal business entity. However, that is exactly the point of difference here. School does not depend on your machine learning model to make profit and so there is a big difference on the level of business need to start the project.

In School: I can pick up any problem statement and use my machine learning model to solve it. Even though the problem may not have much value due to the level of difficulty of implementation, I can still do it.

For example, in school I was correlating GDP to the life expectancy which does not seem to have direct business value.

At Work: People get paid for every project they do and companies definitely do not want to waste money on low business value project. Before we start on a data science project, we need to make sure that this is a problem statement worth exploring or ideally the problem statement should come from business owners.

During the project process, we would constantly communicate with business owners on what we are building and check whether we are heading the correct direction. This is because ultimately business owners will be the ones using your product or insights.

#expectations-vs-reality #data-science #expectations #data analysis

Are you ready to be a Data Scientist?
1.05 GEEK