Many thanks to renowned data scientist for his inspiration and guidance on this tutorial.
The above image resembles the Non-IID (Independent and Identically Distributed) dataset. A collection of random variables (images in our case) is independent and identically distributed if each random variable (image) has a similar probability distribution as the others, and all are mutually independent. In part 1 of this series, we used the CIFAR10 dataset, an example of an IID type, but for the real-world use case, there needs to be a non-IID dataset to represent the real-world scenario. So, What is non-IID data? What changes to make in the current dataset (CIFAR10) to accumulate non-IID data for Federated Learning?
These are some of the questions which will be answered in this tutorial. This blog is part 2 of the series Preserving Data Privacy in Deep Learning and focuses on the distribution of the CIFAR10 into a non-IID dataset further divided among the clients. After completing this tutorial, you will know:
An analogy to the non-IID dataset. Photo by Harsh Yadav
Real-life data (referring to objects, values, attributes, and other aspects) is essentially non-independent and identically distributed (non-IID). In contrast, most of the existing analytical or machine learning methods are based on IID data. So, there needs to be a proper approach to handle such type of real-world dataset. This tutorial will lead to a non-IID dataset’s foundations and thus open the stage for implementing various federated learning techniques to handle the problem of getting insights from non-IID data. Non-IIDness is a common problem, causing unstable performances of deep learning models. In literature, the non-IID image classification problem is largely understudied.
NICO (Non-IID Image dataset with contexts) is one such benchmark dataset that can be further used to develop state-of-the-art machine learning models to tackle non-IID data.
In this series, CIFAR 10 is used as the benchmark dataset, and further, it is converted into a non-IID dataset. To learn more about the basics of federated learning, please head over to part 1 of this series. In this tutorial, we will create two different types of the dataset, one is replicating the real-life data, i.e. **real-world dataset, **and another one is the extreme example of a non-iid dataset.
**REAL-WORLD DATASET: **CIFAR 10 is randomly divided into the given number of clients. So, a client can have images from any number of classes, say, one client has images from only 1 class and another client has images from 5 classes. This type of dataset replicates the real-world scenario where clients can have different types of images.
#deep-learning #federated-learning #non-iid #privacy #pytorch #deep learning
Preserving Data Privacy in Deep Learning | Part 2: Distribution of a balanced dataset into a non-IID/real-world dataset, further divided into clients for federated learning.