Preserving Data Privacy in Deep Learning

Image for post

Many thanks to renowned data scientist for his inspiration and guidance on this tutorial.

The above image resembles the Non-IID (Independent and Identically Distributed) dataset. A collection of random variables (images in our case) is independent and identically distributed if each random variable (image) has a similar probability distribution as the others, and all are mutually independent. In part 1 of this series, we used the CIFAR10 dataset, an example of an IID type, but for the real-world use case, there needs to be a non-IID dataset to represent the real-world scenario. So, What is non-IID data? What changes to make in the current dataset (CIFAR10) to accumulate non-IID data for Federated Learning?

These are some of the questions which will be answered in this tutorial. This blog is part 2 of the series Preserving Data Privacy in Deep Learning and focuses on the distribution of the CIFAR10 into a non-IID dataset further divided among the clients. After completing this tutorial, you will know:

Non-IID Dataset
Conversion of a balanced dataset into the non-IID and real-world datasets
Forming clients encapsulating a part of this non-IID/real-world dataset
Use cases of image classification using federated learning

Image for post

An analogy to the non-IID dataset. Photo by Harsh Yadav

Real-life data (referring to objects, values, attributes, and other aspects) is essentially non-independent and identically distributed (non-IID). In contrast, most of the existing analytical or machine learning methods are based on IID data. So, there needs to be a proper approach to handle such type of real-world dataset. This tutorial will lead to a non-IID dataset’s foundations and thus open the stage for implementing various federated learning techniques to handle the problem of getting insights from non-IID data. Non-IIDness is a common problem, causing unstable performances of deep learning models. In literature, the non-IID image classification problem is largely understudied.

NICO (Non-IID Image dataset with contexts) is one such benchmark dataset that can be further used to develop state-of-the-art machine learning models to tackle non-IID data.

In this series, CIFAR 10 is used as the benchmark dataset, and further, it is converted into a non-IID dataset. To learn more about the basics of federated learning, please head over to part 1 of this series. In this tutorial, we will create two different types of the dataset, one is replicating the real-life data, i.e. **real-world dataset, **and another one is the extreme example of a non-iid dataset.

**REAL-WORLD DATASET: **CIFAR 10 is randomly divided into the given number of clients. So, a client can have images from any number of classes, say, one client has images from only 1 class and another client has images from 5 classes. This type of dataset replicates the real-world scenario where clients can have different types of images.

#deep-learning #federated-learning #non-iid #privacy #pytorch #deep learning

towardsdatascience.com

Preserving Data Privacy in Deep Learning | Part 2