The global COVID-19 pandemic has left many with a lot of time on their hands to work on their data science project portfolios. With everyone applying to jobs, how can you make sure that yours stand out? Read on to find out.

1. Use more unique data

Image for post

Photo by Randy Fath on Unsplash

Iris, Galton, Titanic, Northwind Traders, Superstore, Go Data Warehouse. While you were studying data science in school, you no doubt came across at least one of these data sets. There is a reason for that, they help demonstrate concepts like clustering, regression, logistic regression, database structures, data visualization, or building reports. Each data set is clean and small, but that isn’t all they have in common: everyone has worked with these data sets. There are no new or exciting projects being built on the training data sets. No recruiter is going to look at your Titanic project (one of the most popular data sets on Kaggle) and say, we need this person on our team.

There are no new or exciting projects being built on the training data sets

We live in the data age and that means that there is no shortage of data sets that are easily available for download. Get your data from somewhere more exciting than Kaggle or the data you learned machine learning on. A good place to branch out is to Data.gov. In 2013 President Obama signed an executive order making open and machine readable data the new default for government information. This means that there is a wealth of searchable information ready to download right from Data.gov. Federal student loan program data, federal aid to states data, and accidental drug related deaths are just a few of the over 200,000 data sets available for your use. Just make sure to look at the metadata provided with the file so you understand what you are working with.

Want to make things a little more personal? Use your own data! Anything you do can be turned into data. Many gyms are closed during stay at home orders, maybe you’re working out from home. All of those exercises you are doing can be tracked. Look at how many reps you are doing, which muscle groups you are working on, and what days you are working out on. The best part about using your own data is that you are the subject matter expert. You may end up with some smaller data sets to work with, but you will have a much deeper understanding of how it was captured and have control over adding new variables or dimensions to it.

None of these sounding advanced enough? Take a look at web scraping. Web scraping is the automated process of collecting unstructured data from the internet. You will have to write the code in a language such as R or python to capture the data. You will have to do your own research about the values you capture and how the website you are scraping got those values. The end result will be much more unique, but it will also create more work to learn about the data and clean your collected data.

2. Do a data cleaning project

Hand Sanitizer

Photo by Kelly Sikkema on Unsplash

Speaking of data cleaning, real world data is disgusting, be sure to wear your face mask while working with it. Jokes aside, when someone asks for a model that uses data to predict customer churn, there is almost never a clean, ready to use data source to build that model from. Most classes will not prepare you to handle the sorts of dirty data that organizations have available. This is a critical skill that you need to showcase in at least one of your projects.

Speaking of data cleaning, real world data is disgusting

There are many tasks that can be associated with cleaning data. A good place to start is understanding the data. Government and publicly available data will often have a data dictionary containing descriptions of each dimension, measure, observation, and table in the data. This will help you understand what data was collected, when it was collected, and who collected it.

Understanding what you are looking at is a key to data validation. Once you know what a variable is you may be able to use the data dictionary, common sense, or a subject matter expert to determine which values don’t make sense. For example temperatures should fall in a certain range of values. If it is temperature data and the data dictionary specifies the units of measurement as Kelvin, any 0 or negative values would be suspect. If it is temperature data from Bermuda, warmer temperatures would make sense. Here, anything too hot or too cold would be suspect. For something like manufacturing welding temperatures, you may want to look to a professor or engineering student to give you more guidance. The key in this step is to find values that don’t look right.

Another area to look into is how to handle missing values. Like data validation, context matters when handling missing values. If you are looking at financial loan data for cars and a loan is in good standing and has a missing value for repossession status, you won’t be worried about that value being missing. If your project involves psychological assessments and you are missing answers to a lot of questions, you may take a different course of action, like eliminating the observation. Sometimes missing values make sense in your context. As with data validation, work with your subject matter experts and peers to understand what to do with missing values.

Sometimes missing values make sense in your context

#behind-the-data-project #data-communication #data-pipeline #big-data-projects #data-science

1. Use more unique data

2. Do a data cleaning project

towardsdatascience.com

6 Ways to Make Your Data Science Projects Stand Out