Training with Multiple Workers using TensorFlow Quantum

Training large machine learning models is a core ability for TensorFlow. Over the years, scale has become an important feature in many modern machine learning systems for NLP, image recognition, drug discovery etc. Making use of multiple machines to boost computational power and throughput has led to great advances in the field. Similarly in quantum computing and quantum machine learning, the availability of more machine resources speeds up the simulation of larger quantum states and more complex systems. In this tutorial you will walk through how to use TensorFlow and TensorFlow quantum to conduct large scale and distributed QML simulations. Running larger simulations with greater FLOP/s counts unlocks new possibilities for research that otherwise wouldn’t be possible at smaller scales. In the figure below we have outlined approximate scaling capabilities for several different hardware settings for quantum simulation.

Kubernetes to simplify this process. [Kubernetes](https://kubernetes.io/) is an open source container orchestration system, and it is a proven platform to effectively manage large-scale workloads. While it is possible to have a multi-worker setup with a cluster of physical or virtual machines, Kubernetes offers many advantages, including:

* Service discovery - workers can easily identify each other using well-known DNS names, rather than manually configuring IP destinations.
* Automatic bin-packing - your workloads are automatically scheduled on different machines based on resource demand and current consumption.
* Automated rollouts and rollbacks - the number of worker replicas can be changed by changing a configuration, and Kubernetes automatically adds/removes workers in response and schedules in machines where resources are available.

This tutorial guides you through a TensorFlow Quantum multi-worker setup using [Google Cloud](https://cloud.google.com/) products, including [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine) , a managed Kubernetes platform. You will have the chance to take the single-worker [Quantum Convolutional Neural Network (QCNN) tutorial](https://www.tensorflow.org/quantum/tutorials/qcnn) in TensorFlow Quantum and augment it for multi-worker training.

From our experiments in the multi-worker setting, training a 23-qubit QCNN with 1,000 training examples, which corresponds to roughly 3,000 circuits simulated using full state vector simulation takes 5 minutes per epoch on a 32 node (512 vCPU) cluster, which costs a few US dollars. By comparison, the same training job on a single-worker would take roughly 4 hours per epoch. Pushing things a little bit farther, [hundreds of thousands of 30-qubit circuits could be run in a few hours using more than 10,000 virtual CPUs](https://blog.tensorflow.org/2020/11/characterizing-quantum-advantage-in.html) which could have taken weeks to run in a single-worker setting. The actual performance and cost may vary depending on your cloud setup, such as VM machine type, total cluster running time, etc. Before performing larger experiments, we recommend starting with a small cluster first, like the one used in this tutorial.

The source code for this tutorial is available in the [TensorFlow Quantum](https://github.com/tensorflow/quantum/tree/research/qcnn_multiworker) GitHub repository. `README.md` contains the quickest way to get this tutorial up and running. This tutorial will instead focus on walk through each step in detail, to help you understand the underlying concepts and integrate them with your own projects. Let’s get started!

### **1\. Setting up Infrastructure in Google Cloud**

### **2\. Preparing Your Kubernetes Cluster**

### **3\. Training with MultiWorkerMirroredStrategy**

### **4\. Understanding Training Performance Using TensorBoard**

### **5\. Running Inference**

### **6\. Cleaning Up**

#tensorflow