While we want to work on a data science and machine learning problem, it is nice when we find out that a dataset that is suitable for solving our desired problem is already available and ready to use on a platform like Kaggle. It makes our life much easier. Collecting data can be sometimes a difficult and slow process. Data is the new gold. By making our datasets public and by promoting an open source thinking among data science and machine learning practitioners we can accelerate the progress that is done in this field. A good place to do so is Kaggle. It is for data scientists what Github is for software developers. If we happen to have collected an interesting dataset dataset, it is good practice to publish it on Kaggle, so that others can use it too. And by doing so, we can increase our reputation on Kaggle, and this may help us in getting a job in the field; this is another benefit of publishing datasets on Kaggle.

Let us get started.


Now, assuming you already have a dataset that you can publish, the first thing you need to do is to create the dataset entry. From your Kaggle homepage, go to the “Data” tab from the left panel:

Image for post
Next, click on “New Dataset” to create your dataset entry:

Image for post

Now, a dialog like this opens where you can give your dataset a name, edit its URL and upload the files:

Image for post

If your dataset is large you can upload an archive and Kaggle will automatically decompress it so that when someone that visits its page, he/she can see individual files in it.

Note the “private” icon in the bottom-right corner of the dialog. When you create a dataset, it is made by default private; so that only you and people you specify can access it. This is the preferred way to create it, and after you add extra information and make sure everything is OK, you make it public. You can also create it directly as public by toggling that private/public button in the dialog.

As an example, I will upload a dataset with Medium articles scraped using Python and Beautiful Soup. If you are interested to see how I collected this data, you can read my previous post here.

#kaggle #data-science #open-source #data #open-data #data analysis

Publishing your first dataset on Kaggle
2.50 GEEK