Table of Contents

  1. Introduction
  2. Kaggle
  3. Datasets
  4. Summary
  5. References

Introduction

Over a certain amount of time, you might notice that there are similar datasets being utilized in data science blogs, undergraduate studies, graduate courses, and online learning. These datasets can sometimes reflect the current events happening in the world or can be general, yet extremely popular datasets used for practicing and showcasing data science techniques and processes. The most important aspect of these datasets is that they are ultimately used to promote the greater good by bringing together intelligent minds to solve a pressing issue. There are several sites where datasets can be housed, but I find myself going to the same one — that is Kaggle. This platform offers countless datasets and ranks them by trending metrics. I will be discussing four of the top 10 data science datasets right now.

As data becomes more easily obtainable, it is crucial to be aware that with this data there becomes an even bigger focus on what you do with it. These datasets highlight certain call-to-actions, tasks, and inspirations, so if you are unsure of how to handle the data, this part of the dataset information can be quite useful.

Kaggle

Kaggle [2] is a platform for data analysis, data scientists, and machine learning engineers that allow for collaboration of solving problems, competing, and overall, learning from one another. At the time that this article is written, there are nearly 46,000 datasets on Kaggle. You can filter the datasets by ‘Hottest’, ‘Most Votes’, ‘New’, ‘Updated’, and ‘Usability’.

The datasets I will be describing in this article are sorted by the ‘Hottest’ filter and consist of four of the top 10 datasets.

Datasets

Below, I will highlight names, descriptions, and facts about four of the most popular datasets on Kaggle. Some datasets also have call-to-actions, tasks, inspiration, and prizes. Of course, in these unprecedented times, the top dataset is pertaining to COVID-19.

Description —

This dataset has around 7,900 votes. The main purpose of the dataset is to be utilized as an artificial intelligence (AI) challenge with AI2, CZI, MSR, Georgetown, as well as NIH & The White House. This open dataset is in response to the COVID-19 pandemic consisting of nearly 15 GB of data. There are about 17 tasks associated with this dataset. An example of a task would be ‘What do we know about COVID-19 risk factors?’. It is recommended that data scientists use this dataset with natural language processing and AI techniques to ultimately serve as support in fighting this prevalent disease.

This reason alone is what separates Kaggle from other dataset websites — the website encourages people from different backgrounds to come together to fight a pressing cause.

As with the description, there are also other key features of a dataset, including the ‘Call to Action’ and ‘Prizes’.

#data #data-science #towards-data-science #machine-learning #covid19 #data analysis

The Top Data Science Datasets Right Now
1.15 GEEK