When you are working in data science one of the hardest parts is discovering which data to use when trying to solve a business problem.

Remember that before trying to get data to solve a problem you need to get the context of a business and the project. With context I mean all the specifics on how a company runs its projects, how the company is established, its competitors, how many departments exist, the different objectives and goals they have, and how they measure success or failure.

When you have all of that you can start thinking about getting the required data to solve the business problem. In this article I won’t talk that much about data collection, instead, I want to discuss and show you the process to enrich the data you already have with new data.

Remember that getting new data has to be done in a systematic fashion, it’s not just getting data out of nowhere, we have to do it consistently, plan it, create a process to do it, and this depends in engineering, architect, DataOps and more things that I’ll be discussing in other articles.

Setting up the environment

In this article, we will be using three things: Python, GitHub, and Explorium. If you want to know more about Explorium check this:

Where is the data?

Or how to enrich your datasets and create new features automatically.


Let’s start by creating a new git repo. Here we will be storing our data, code, and documentation. Go to your terminal and create a new folder and move there:

mkdir data_discovery
cd data_discovery

Then initialize the git repo:

git init

Now let’s create a remote repo on GitHub:

Now go to your terminal and type (change the URL to yours):

git remote add origin https://github.com/FavioVazquez/data_discovery.git

Now let’s check:

git remote -v

You should see (with your own URL of course):

origin https://github.com/FavioVazquez/data_discovery.git (fetch)
origin https://github.com/FavioVazquez/data_discovery.git (push)

Now let’s start a Readme file (I’m using Vim):

vim Readme.md

And write whatever you want in there.

Now let’s add the file to git:

git add .

And create our first commit:

git commit -m "Add readme file"

Finally, let’s push this to our GitHub repo:

git push --set-upstream origin master

By this point your repo should look like this:

Finding the data

Let’s find some data. I’m not going to do the whole data science process here of understanding the data, exploring it, modeling, or anything like that. I’m just going to find some interesting data as a demo for you.

#machine-learning #towards-data-science #artificial-intelligence #data-science #programming #data analysis

Discovering New Data
1.10 GEEK