Simple steps to improve your data science projects and get noticed
The global COVID-19 pandemic has left many with a lot of time on their hands to work on their data science project portfolios. With everyone applying to jobs, how can you make sure that yours stand out? Read on to find out.
Iris, Galton, Titanic, Northwind Traders, Superstore, Go Data Warehouse. While you were studying data science in school, you no doubt came across at least one of these data sets. There is a reason for that, they help demonstrate concepts like clustering, regression, logistic regression, database structures, data visualization, or building reports. Each data set is clean and small, but that isn’t all they have in common: everyone has worked with these data sets. There are no new or exciting projects being built on the training data sets. No recruiter is going to look at your Titanic project (one of the most popular data sets on Kaggle) and say, we need this person on our team.
There are no new or exciting projects being built on the training data sets
We live in the data age and that means that there is no shortage of data sets that are easily available for download. Get your data from somewhere more exciting than Kaggle or the data you learned machine learning on. A good place to branch out is to Data.gov. In 2013 President Obama signed an executive order making open and machine readable data the new default for government information. This means that there is a wealth of searchable information ready to download right from Data.gov. Federal student loan program data, federal aid to states data, and accidental drug related deaths are just a few of the over 200,000 data sets available for your use. Just make sure to look at the metadata provided with the file so you understand what you are working with.
Want to make things a little more personal? Use your own data! Anything you do can be turned into data. Many gyms are closed during stay at home orders, maybe you’re working out from home. All of those exercises you are doing can be tracked. Look at how many reps you are doing, which muscle groups you are working on, and what days you are working out on. The best part about using your own data is that you are the subject matter expert. You may end up with some smaller data sets to work with, but you will have a much deeper understanding of how it was captured and have control over adding new variables or dimensions to it.
None of these sounding advanced enough? Take a look at web scraping. Web scraping is the automated process of collecting unstructured data from the internet. You will have to write the code in a language such as R or python to capture the data. You will have to do your own research about the values you capture and how the website you are scraping got those values. The end result will be much more unique, but it will also create more work to learn about the data and clean your collected data.
Speaking of data cleaning, real world data is disgusting, be sure to wear your face mask while working with it. Jokes aside, when someone asks for a model that uses data to predict customer churn, there is almost never a clean, ready to use data source to build that model from. Most classes will not prepare you to handle the sorts of dirty data that organizations have available. This is a critical skill that you need to showcase in at least one of your projects.
Speaking of data cleaning, real world data is disgusting
There are many tasks that can be associated with cleaning data. A good place to start is understanding the data. Government and publicly available data will often have a data dictionary containing descriptions of each dimension, measure, observation, and table in the data. This will help you understand what data was collected, when it was collected, and who collected it.
Understanding what you are looking at is a key to data validation. Once you know what a variable is you may be able to use the data dictionary, common sense, or a subject matter expert to determine which values don’t make sense. For example temperatures should fall in a certain range of values. If it is temperature data and the data dictionary specifies the units of measurement as Kelvin, any 0 or negative values would be suspect. If it is temperature data from Bermuda, warmer temperatures would make sense. Here, anything too hot or too cold would be suspect. For something like manufacturing welding temperatures, you may want to look to a professor or engineering student to give you more guidance. The key in this step is to find values that don’t look right.
Another area to look into is how to handle missing values. Like data validation, context matters when handling missing values. If you are looking at financial loan data for cars and a loan is in good standing and has a missing value for repossession status, you won’t be worried about that value being missing. If your project involves psychological assessments and you are missing answers to a lot of questions, you may take a different course of action, like eliminating the observation. Sometimes missing values make sense in your context. As with data validation, work with your subject matter experts and peers to understand what to do with missing values.
Sometimes missing values make sense in your context
An extensively researched list of top microsoft big data analytics and solution with ratings & reviews to help find the best Microsoft big data solutions development companies around the world.
In this article, see the role of big data in healthcare and look at the new healthcare dynamics. Big Data is creating a revolution in healthcare, providing better outcomes while eliminating fraud and abuse, which contributes to a large percentage of healthcare costs.
‘Data is the new science. Big Data holds the key answers’ - Pat Gelsinger The biggest advantage that the enhancement of modern technology has brought
We need no rocket science in understanding that every business, irrespective of their size in the modern-day business world, needs data insights for its expansion. Big data analytics is essential when it comes to understanding the needs and wants of a significant section of the audience.
Online Data Science Training in Noida at CETPA, best institute in India for Data Science Online Course and Certification. Call now at 9911417779 to avail 50% discount.