Homomorphic encryption intro:

Part 1: Overview and use cases

Part 2: HE landscape and CKKS

Part 3: Encoding and decoding in CKKS

Introduction

Advancements in machine learning algorithms have resulted in a widespread adoption across industries. Nonetheless, areas dealing with sensitive and private data, like healthcare or finance, have lagged behind due to regulatory constraints to protect users’ data.

With the emergence of Machine Learning as a Service, entities are providing model inference as a service. We can distinguish three parties in such scenarios, a model owner such as a hospital who has trained a model, a host such as a cloud provider providing computing power, and a client wanting to benefit from the service. A model owner can also be a host in some scenarios. Trust must be established between those parties as the client does not want her data to be leaked, and the model owner wants to protect her model.

In this article, we will have a quick look at Machine Learning, and the advances it brought. Then we will see how the data dependence of Machine Learning makes it unsuitable for some sensitive use cases, and how new solutions can make training and inference of models possible on encrypted data. Finally we will have a focus on Homomorphic Encryption, and see what use cases it can cover.

This article is non-technical and is aimed at a broad audience. Following articles will dig deeper into the technicalities of Homomorphic Encryption, with both the theory and a Python implementation of an Homomorphic Encryption scheme.

I. The privacy concerns around data exploitation

Machine Learning, especially Deep Learning, which is the sub-field of Machine Learning focusing on Deep Neural Networks (DNNs), have proved to advance the state of the art on several various tasks. Diverse domains such as image recognition with ResNet, text processing with BERT, or even speech generation with WaveNet, have all seen massive improvements using Deep Learning, while other models failed behind by a considerable margin.

The underlying principle of Deep Learning is to train models on a massive amount of data, and by optimizing the loss of the model. By doing so, Deep learning has managed to get human-like performances, as it is able to find complex patterns that would be invisible to the human eye.

Therefore, Machine Learning seems to pave the way for new technological leaps in the future, but relies heavily on the use of data, be it for its training, or for inference.

Most of the data used in those Deep Learning models often came from public, and non-personnal data, such as Wikitext, or Imagenet. Nonetheless other scenarios might require more sensible training data. For instance, a speech-to-text model might require people to record their voices, a diagnosis tool will require private health data to be sent, or a credit analysis tool might need to have a look at financial information.

While we saw how powerful Machine Learning is, we see here that some sensitive use cases can not be directly answered by such approach, as the data is too sensitive to be shared, either for training or for inference. Therefore, there seems to be a trade off between data privacy, and data efficiency, in the sense that one can uphold better confidentiality of private data, by imposing strict processes where data is only seen by a trusted human expert, at the cost of making this process long and expensive, or use trained models that can have excellent performances and scalability, but at the expense of needing to see data during training and inference.

However, several techniques have emerged in the past years that allow to reconcile privacy and efficiency. Among them, three seem to be the most promising :

  • Homomorphic Encryption (HE) is a public key cryptographic scheme. The user creates a pair of secret and public key, uses the public one to encrypt her data, before sending it to a third party which will perform computations on the encrypted data. Because of the homomorphic properties of the encryption and decryption, the user can get the encrypted result and decode it with her own key to see the output of the computation on her data, without having shown it once in clear to the third party.
  • Secure Multi Party Computation (SMPC) is a different paradigm which relies more on communication between the participants. Data can be split, as well as the model, and each actor only sends a few shares of her data, so that others can not reconstruct the initial data, but can participate, and do some computation on shares of data. Then once each party has finished, everything can be aggregated and the result of the output is known to each party.
  • Trusted Execution Environments (TEE) enable the development of software thanks to hardware guarantees of privacy. Intel’s SGX technology provides an implementation of such system. The enclave technology allows programs to be executed in isolation of other programs. All data inbound and outbound is encrypted, and computation in clear only happens within the enclave. The enclave code and integrity can then be checked externally.

#deep-learning #privacy #gdpr #deep learning

Homomorphic Encryption intro: Part 1: Overview and use cases
1.15 GEEK