How to Build A Data Set For Your Machine Learning Project

How to Build A Data Set For Your Machine Learning Project

How to Build A Data Set For Your Machine Learning Project - A machine learning model can be seen as a miracle but it's won't amount to anything if one doesn't feed good dataset into the model.

Originally published by Alexandre Gonfalonieri at

Are you about thinking AI for your organization? You have identified a use case with a proven ROI? Perfect! but not so fast… do you have a data set? 

Well, most companies are struggling to build an AI-ready data set or perhaps simply ignore this issue, I thought that this article might help you a little bit.

Let’s start with the basics…

data set is a collection of data. In other words, a data set corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable, and each row corresponds to a given member of the data set in question.

In Machine Learning projects, we need a training data set. It is the actual data set used to train the model for performing various actions.

Why do I need a data set?

ML depends heavily on data, without data, it is impossible for an “AI” to learn. It is the most crucial aspect that makes algorithm training possible… No matter how great your AI team is or the size of your data set, if your data set is not good enough, your entire AI project will fail! I have seen fantastic projects fail because we didn’t have a good data set despite having the perfect use case and very skilled data scientists.

A supervised AI is trained on a corpus of training data.

During an AI development, we always rely on data. From training, tuning, model selection to testing, we use three different data sets: the training set, the validation set ,and the testing set. For your information, validation sets are used to select and tune the final ML model.

You might think that the gathering of data is enough but it is the opposite. In every AI projects, classifying and labeling data sets takes most of our time , especially data sets accurate enough to reflect a realistic vision of the market/world.

I want to introduce you to the first two data sets we need — the training data set and test data set because they are used for different purposes during your AI project and the success of a project depends a lot on them.

  1. The training data set is the one used to train an algorithm to understand how to apply concepts such as neural networks, to learn and produce results. It includes both input data and the expected output.

Training sets make up the majority of the total data, around 60 %. In testing, the models are fit to parameters in a process that is known as adjusting weights.

  1. The test data set is used to evaluate how well your algorithm was trained with the training data set. In AI projects, we can’t use the training data set in the testing stage because the algorithm will already know in advance the expected output which is not our goal.

Testing sets represent 20% of the data. The test set is ensured to be the input data grouped together with verified correct outputs, generally by human verification.

Based on my experience, it is a bad idea to attempt further adjustment past the testing phase. It will likely lead to overfitting.


What is overfitting?

A well-known issue for data scientists… Overfitting is a modeling error which occurs when a function is too closely fit to a limited set of data points.

How much data is needed?

All projects are somehow unique but I’d say that you need 10 times as much data as the number of parameters in the model being built. The more complicated the task, the more data needed.

What type data do I need?

I always start AI projects by asking precise questions to the company decision-maker. What are you trying to achieve through AI? Based on your answer, you need to consider what data you actually need to address the question or problem you are working on. Make some assumptions about the data you require and be careful to record those assumptions so that you can test them later if needed.

Below are some questions to help you :

  • What data can you use for this project? You must have a clear picture of everything that you can use.
  • What data not available you wish you had? I like this question since we can always somehow simulate this data.

I have a data set, what now?

Not so fast! You should know that all data sets are innacurate. At this moment of the project, we need to do some data preparation, a very important step in the machine learning process. Basically, data preparation is about making your data set more suitable for machine learning. It is a set of procedures that consume most of the time spent on machine learning projects.

Even if you have the data, you can still run into problems with its quality, as well as biases hidden within your training sets. To put it simply, the quality of training data determines the performance of machine learning systems.

Have you heard about AI biases? 

An AI can be easily influenced… Over the years, data scientists have found out that some popular data sets used to train image recognition included gender biases.

As a consequence, AI applications are taking longer to build because we are trying to make sure that the data is correct and integrated properly.

What if I don’t have enough data?

It can happen that you lack the data required to integrate an AI solution. I am not gonna lie to you, it takes time to build an AI-ready data set if you still rely on paper documents or .csv files. I would recommend you to first take time to build a modern data collection strategy.

If you already determined the objective of your ML solution, you can ask your team to spend time creating the data or outsource the process. In my latest project, the company wanted to build an image recognition model but had no pictures. As a consequence, we spent weeks taking pictures to build the data set and finding out ways for future customers to do it for us.

Do you have a data strategy? 

Creating a data-driven culture in an organization is perhaps the hardest part of being an AI specialist. When I try to explain why the company needs a data culture, I can see frustration in the eyes of most employees. Indeed, data collection can be an annoying task that burdens your employees. However, we can automate most of the data gathering process!

Another issue could be data accessibility and ownership… In many of my projects, I noticed that my clients had enough data, but that the data was locked away and hard to access. You must create connections between data silos in your organization. In order to get special insights, you must gather data from multiple sources.

Regarding ownership, compliance is also an issue with data sources — just because a company has access to information, doesn’t mean that it has the right to use it! Don’t hesitate to ask your legal team about this (GDPR in Europe is one example).

Quality, Scope and Quantity !

Machine Learning is not only about large data set. Indeed, you don’t feed the system with every known data point in any related field. We want to feed the system with carefully curated data, hoping it can learn, and perhaps extend, at the margins, knowledge that people already have.

Most companies believe that it is enough to gather every possible data, combine them and let the AI find insights.

When building a data set, you should aim for a diversity of data. I always recommend companies to gather both internal and external data. The goal is to build a unique data set that will be hard for your competitors to copy. Machine learning applications do require a large number of data points, but this doesn’t mean the model has to consider a wide range of features.

We want meaningful data related to the project. You may possess rich, detailed data on a topic that simply isn’t very useful. An AI expert will ask you precise questions about which fields really matter, and how those fields will likely matter to your application of the insights you get.

In my latest mission, I had to help a company build an image recognition model for Marketing purposes. The idea was to build and confirm a proof of concept. This company had no data set except some 3D renders of their products. We wanted the AI to recognize the product, read the packaging, determine if it was the right product for the customer and help them understand how to use it.
Our data set was composed of 15 products and for each, we managed to have 200 pictures.This number is justified by the fact that it was still a prototype, otherwise, I would have needed way more pictures! This assumes you are making use of transfer learning techniques.
When it comes to pictures, we needed different backgrounds, lighting conditions, angles, etc.
Everyday, I used to select 20 pictures randomly from the training set and analyze them. It would give me a good idea of how diverse and accurate the data set was. 
Every time I’ve done this, I have discovered something important regarding our data. It could be an unbalanced number of pictures with the same angle, incorrect labels, etc.

A good idea would be to start with a model that has been pre-trained on a large existing data set and use transfer learning to finetune it with your smaller set of data you’ve gathered.

Data Preprocessing

Alright, let’s back to our data set. At this step, you have gathered your data that you judge essential, diverse and representive for your AI project. Preprocessing includes selection of the right data from the complete data set and building a training set. The process of putting together the data in this optimal format is known as feature transformation.

  1. Format: The data might be spread in different files. For example, sales results from different countries with different currency, languages, etc. which needs to be gathered together to form a data set.
  2. Data Cleaning: In this step, our goal is to deal with missing values and remove unwanted characters from the data.
  3. Feature Extraction: In this step, we focus on analysis and optimisation of the number of features. Usually, a member of the team has to find out which features are important for prediction and select them for faster computations and low memory consumption.


The perfect data strategy

The most sucessful AI projects are those that integrate a data collection strategy during the service/product life-cyle. Indeed, data collection can’t be a series of one-off exercises. It must be built into the core product itself. Basically, every time a user engages with your product/service, you want to collect data from the interaction. The goal is to use this constant new data flow to improve your product/service.

When you reach this level of data usage, every new customer you add makes the data set bigger and thus the product better, which attracts more customers, which makes the data set better, and so on. It is some kind of positive circle.


The best and long term oriented ML projects are those that leverage dynamic, constantly updated data sets. The advantage of building such data collection strategy is that it becomes very hard for your competitors to replicate your data set. With data, the AI becomes better and in some cases like collaborative filtering, it is very valuable. Collaborative filtering makes suggestions based on the similarity between users, it will improve with access to more data; the more user data one has, the more likely it is that the algorithm can find a similar a user.

This means that you need a strategy for continuous improvement of your data set for as long as there’s any user benefit to better model accuracy. If you can, find creative ways to harness even weak signals to access larger data sets.

Once again, let me use the example of an image recognition model. In my last experience, we imagined and designed a way for users to take pictures of our products and send it to us. These pictures would then be used to feed our AI system and make our system smarter with time.

Another approach is to increase the efficiency of your labeling pipeline, for instance, we used to rely a lot on a system that could suggest labels predicted by the initial version of the model so that labelers can make faster decisions.

Finally, I have seen companies just hiring more people to label new training inputs… It takes time and money but it works, though it can be difficult in organizations that don’t traditionally have a line item in their budget for this kind of expenditure.

Despite what most SaaS companies are saying, Machine Learning requires time and preparation. Whenever your hear the term AI, you must think about the data behind it. I hope that this article will help you understand the key role of data in ML projects and convince you to take time to reflect on your data strategy.

Originally published by Alexandre Gonfalonieri at


Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

☞ Data Science, Deep Learning, & Machine Learning with Python

☞ Deep Learning A-Z™: Hands-On Artificial Neural Networks

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Python for Data Science and Machine Learning Bootcamp

☞ Machine Learning, Data Science and Deep Learning with Python

☞ [2019] Machine Learning Classification Bootcamp in Python

☞ Introduction to Machine Learning & Deep Learning in Python

☞ Machine Learning Career Guide – Technical Interview

☞ Machine Learning Guide: Learn Machine Learning Algorithms

☞ Machine Learning Basics: Building Regression Model in Python

☞ Machine Learning using Python - A Beginner’s Guide

data-science machine-learning

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Most popular Data Science and Machine Learning courses — July 2020

Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant

15 Machine Learning and Data Science Project Ideas with Datasets

Learning is a new fun in the field of Machine Learning and Data Science. In this article, we’ll be discussing 15 machine learning and data science projects.

50 Data Science Jobs That Opened Just Last Week

Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments. Our latest survey report suggests that as the overall Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments, data scientists and AI practitioners should be aware of the skills and tools that the broader community is working on. A good grip in these skills will further help data science enthusiasts to get the best jobs that various industries in their data science functions are offering.

Pipelines in Machine Learning | Data Science | Machine Learning | Python

Machine Learning Pipelines performs a complete workflow with an ordered sequence of the process involved in a Machine Learning task. The Pipelines can also

“How’d you get started with machine learning and data science?”

“How’d you get started with machine learning and data science?”: I trained my first model in 2017 on my friend's lounge room floor.