10x Faster Parallel Python Without Python Multiprocessing
While Python’s multiprocessing library has been used successfully for a wide range of applications, in this blog post, we show that it falls short for several important classes of applications including numerical data processing, stateful computation, and computation with expensive initialization. There are two main reasons:
Ray is a fast, simple framework for building and running distributed applications that addresses these issues. For an introduction to some of the basic concepts, see this blog post. Ray leverages Apache Arrow for efficient data handling and provides task and actor abstractions for distributed computing.
This blog post benchmarks three workloads that aren’t easily expressed with Python multiprocessing and compares Ray, Python multiprocessing, and serial Python code. Note that it’s important to always compare to optimized single-threaded code.
In these benchmarks, Ray is 10–30x faster than serial Python, 5–25x faster than multiprocessing, and 5–15x faster than the faster of these two on a large machine.
The benchmarks were run on EC2 using the m5 instance types (m5.large for 1 physical core and m5.24xlarge for 48 physical cores). Code for running all of the benchmarks is available here. Abbreviated snippets are included in this post. The main differences are that the full benchmarks include 1) timing and printing code, 2) code for warming up the Ray object store, and 3) code for adapting the benchmark to smaller machines.
Many machine learning, scientific computing, and data analysis workloads make heavy use of large arrays of data. For example, an array may represent a large image or dataset, and an application may wish to have multiple tasks analyze the image. Handling numerical data efficiently is critical.
Each pass through the for loop below takes 0.84s with Ray, 7.5s with Python multiprocessing, and 24s with serial Python (on 48 physical cores). This performance gap explains why it is possible to build libraries like Modin on top of Ray but not on top of other libraries.
The code looks as follows with Ray.
import numpy as np import psutil import ray import scipy.signal num_cpus = psutil.cpu_count(logical=False) ray.init(num_cpus=num_cpus) @ray.remote def f(image, random_filter): # Do some image processing. return scipy.signal.convolve2d(image, random_filter)[::5, ::5] filters = [np.random.normal(size=(4, 4)) for _ in range(num_cpus)] # Time the code below. for _ in range(10): image = np.zeros((3000, 3000)) image_id = ray.put(image) ray.get([f.remote(image_id, filters[i]) for i in range(num_cpus)])
ray.put(image), the large array is stored in shared memory and can be accessed by all of the worker processes without creating copies. This works not just with arrays but also with objects that contain arrays (like lists of arrays).
When the workers execute the
f task, the results are again stored in shared memory. Then when the script calls
ray.get([...]), it creates numpy arrays backed by shared memory without having to deserialize or copy the values.
The code looks as follows with Python multiprocessing.
from multiprocessing import Pool import numpy as np import psutil import scipy.signal num_cpus = psutil.cpu_count(logical=False) def f(args): image, random_filter = args # Do some image processing. return scipy.signal.convolve2d(image, random_filter)[::5, ::5] pool = Pool(num_cpus) filters = [np.random.normal(size=(4, 4)) for _ in range(num_cpus)] # Time the code below. for _ in range(10): image = np.zeros((3000, 3000)) pool.map(f, zip(num_cpus * [image], filters))
The difference here is that Python multiprocessing uses pickle to serialize large objects when passing them between processes. This approach requires each process to create its own copy of the data, which adds substantial memory usage as well as overhead for expensive deserialization, which Ray avoids by using the Apache Arrow data layout for zero-copy serialization along with the Plasma store.
Workloads that require substantial “state” to be shared between many small units of work are another category of workloads that pose a challenge for Python multiprocessing. This pattern is extremely common, and I illustrate it hear with a toy stream processing application.
State is often encapsulated in Python classes, and Ray provides an actor abstraction so that classes can be used in the parallel and distributed setting. In contrast, Python multiprocessing doesn’t provide a natural way to parallelize Python classes, and so the user often needs to pass the relevant state around between
map calls. This strategy can be tricky to implement in practice (many Python variables are not easily serializable) and it can be slow when it does work.
Below is a toy example that uses parallel tasks to process one document at a time, extract the prefixes of each word, and return the most common prefixes at the end. The prefix counts are stored in the actor state and mutated by the different tasks.
This example takes 3.2s with Ray, 21s with Python multiprocessing, and 54s with serial Python (on 48 physical cores).
The Ray version looks as follows.
from collections import defaultdict import numpy as np import psutil import ray num_cpus = psutil.cpu_count(logical=False) ray.init(num_cpus=num_cpus) @ray.remote class StreamingPrefixCount(object): def __init__(self): self.prefix_count = defaultdict(int) self.popular_prefixes = set() def add_document(self, document): for word in document: for i in range(1, len(word)): prefix = word[:i] self.prefix_count[prefix] += 1 if self.prefix_count[prefix] > 3: self.popular_prefixes.add(prefix) def get_popular(self): return self.popular_prefixes streaming_actors = [StreamingPrefixCount.remote() for _ in range(num_cpus)] # Time the code below. for i in range(num_cpus * 10): document = [np.random.bytes(20) for _ in range(10000)] streaming_actors[i % num_cpus].add_document.remote(document) # Aggregate all of the results. results = ray.get([actor.get_popular.remote() for actor in streaming_actors]) popular_prefixes = set() for prefixes in results: popular_prefixes |= prefixes
Ray performs well here because Ray’s abstractions fit the problem at hand. This application needs a way to encapsulate and mutate state in the distributed setting, and actors fit the bill.
The multiprocessing version looks as follows.
from collections import defaultdict from multiprocessing import Pool import numpy as np import psutil num_cpus = psutil.cpu_count(logical=False) def accumulate_prefixes(args): running_prefix_count, running_popular_prefixes, document = args for word in document: for i in range(1, len(word)): prefix = word[:i] running_prefix_count[prefix] += 1 if running_prefix_count[prefix] > 3: running_popular_prefixes.add(prefix) return running_prefix_count, running_popular_prefixes pool = Pool(num_cpus) running_prefix_counts = [defaultdict(int) for _ in range(4)] running_popular_prefixes = [set() for _ in range(4)] for i in range(10): documents = [[np.random.bytes(20) for _ in range(10000)] for _ in range(num_cpus)] results = pool.map( accumulate_prefixes, zip(running_prefix_counts, running_popular_prefixes, documents)) running_prefix_counts = [result for result in results] running_popular_prefixes = [result for result in results] popular_prefixes = set() for prefixes in running_popular_prefixes: popular_prefixes |= prefixes
The challenge here is that
pool.map executes stateless functions meaning that any variables produced in one
pool.map call that you want to use in another
pool.map call need to be returned from the first call and passed into the second call. For small objects, this approach is acceptable, but when large intermediate results needs to be shared, the cost of passing them around is prohibitive (note that this wouldn’t be true if the variables were being shared between threads, but because they are being shared across process boundaries, the variables must be serialized into a string of bytes using a library like pickle).
Because it has to pass so much state around, the multiprocessing version looks extremely awkward, and in the end only achieves a small speedup over serial Python. In reality, you wouldn’t write code like this because you simply wouldn’t use Python multiprocessing for stream processing. Instead, you’d probably use a dedicated stream-processing framework. This example shows that Ray is well-suited for building such a framework or application.
One caveat is that there are many ways to use Python multiprocessing. In this example, we compare to
Pool.map because it gives the closest API comparison. It should be possible to achieve better performance in this example by starting distinct processes and setting up multiple multiprocessing queues between them, however that leads to a complex and brittle design.
In contrast to the previous example, many parallel computations don’t necessarily require intermediate computation to be shared between tasks, but benefit from it anyway. Even stateless computation can benefit from sharing state when the state is expensive to initialize.
Below is an example in which we want to load a saved neural net from disk and use it to classify a bunch of images in parallel.
This example takes 5s with Ray, 126s with Python multiprocessing, and 64s with serial Python (on 48 physical cores). In this case, the serial Python version uses many cores (via TensorFlow) to parallelize the computation and so it is not actually single threaded.
Suppose we’ve initially created the model by running the following.
import tensorflow as tf mnist = tf.keras.datasets.mnist.load_data() x_train, y_train = mnist x_train = x_train / 255.0 model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(512, activation=tf.nn.relu), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation=tf.nn.softmax) ]) model.compile( optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) # Train the model. model.fit(x_train, y_train, epochs=1) # Save the model to disk. filename = '/tmp/model' model.save(filename)
Now we wish to load the model and use it to classify a bunch of images. We do this in batches because in the application the images may not all become available simultaneously and the image classification may need to be done in parallel with the data loading.
The Ray version looks as follows.
import psutil import ray import sys import tensorflow as tf num_cpus = psutil.cpu_count(logical=False) ray.init(num_cpus=num_cpus) filename = '/tmp/model' @ray.remote class Model(object): def __init__(self, i): # Pin the actor to a specific core if we are on Linux to prevent # contention between the different actors since TensorFlow uses # multiple threads. if sys.platform == 'linux': psutil.Process().cpu_affinity([i]) # Load the model and some data. self.model = tf.keras.models.load_model(filename) mnist = tf.keras.datasets.mnist.load_data() self.x_test = mnist / 255.0 def evaluate_next_batch(self): # Note that we reuse the same data over and over, but in a # real application, the data would be different each time. return self.model.predict(self.x_test) actors = [Model.remote(i) for i in range(num_cpus)] # Time the code below. # Parallelize the evaluation of some test data. for j in range(10): results = ray.get([actor.evaluate_next_batch.remote() for actor in actors])
Loading the model is slow enough that we only want to do it once. The Ray version amortizes this cost by loading the model once in the actor’s constructor. If the model needs to be placed on a GPU, then initialization will be even more expensive.
The multiprocessing version is slower because it needs to reload the model in every map call because the mapped functions are assumed to be stateless.
The multiprocessing version looks as follows. Note that in some cases, it is possible to achieve this using the
initializer argument to
multiprocessing.Pool. However, this is limited to the setting in which the initialization is the same for each process and doesn’t allow for different processes to perform different setup functions (e.g., loading different neural network models), and doesn’t allow for different tasks to be targeted to different workers.
from multiprocessing import Pool import psutil import sys import tensorflow as tf num_cpus = psutil.cpu_count(logical=False) filename = '/tmp/model' def evaluate_next_batch(i): # Pin the process to a specific core if we are on Linux to prevent # contention between the different processes since TensorFlow uses # multiple threads. if sys.platform == 'linux': psutil.Process().cpu_affinity([i]) model = tf.keras.models.load_model(filename) mnist = tf.keras.datasets.mnist.load_data() x_test = mnist / 255.0 return model.predict(x_test) pool = Pool(num_cpus) for _ in range(10): pool.map(evaluate_next_batch, range(num_cpus))
What we’ve seen in all of these examples is that Ray’s performance comes not just from its performance optimizations but also from having abstractions that are appropriate for the tasks at hand. Stateful computation is important for many many applications, and coercing stateful computation into stateless abstractions comes at a cost.
Before running these benchmarks, you will need to install the following.
pip install numpy psutil ray scipy tensorflow
If you have trouble installing
psutil, then try using Anaconda Python.
The original benchmarks were run on EC2 using the m5 instance types (m5.large for 1 physical core and m5.24xlarge for 48 physical cores).
In order to launch an instance on AWS or GCP with the right configuration, you can use the Ray autoscaler and run the following command.
pip install numpy psutil ray scipy tensorflow
config.yaml is provided here (for starting an m5.4xlarge instance).
While this blog post focuses on benchmarks between Ray and Python multiprocessing, an apples-to-apples comparison is challenging because these libraries are not very similar. Differences include the following.
More relevant links are below.
Python GUI Programming Projects using Tkinter and Python 3
Learn Hands-On Python Programming By Creating Projects, GUIs and Graphics
Python is a dynamic modern object -oriented programming language
It is easy to learn and can be used to do a lot of things both big and small
Python is what is referred to as a high level language
Python is used in the industry for things like embedded software, web development, desktop applications, and even mobile apps!
SQL-Lite allows your applications to become even more powerful by storing, retrieving, and filtering through large data sets easily
If you want to learn to code, Python GUIs are the best way to start!
I designed this programming course to be easily understood by absolute beginners and young people. We start with basic Python programming concepts. Reinforce the same by developing Project and GUIs.
The Python coding language integrates well with other platforms – and runs on virtually all modern devices. If you’re new to coding, you can easily learn the basics in this fast and powerful coding environment. If you have experience with other computer languages, you’ll find Python simple and straightforward. This OSI-approved open-source language allows free use and distribution – even commercial distribution.
When and how do I start a career as a Python programmer?
In an independent third party survey, it has been revealed that the Python programming language is currently the most popular language for data scientists worldwide. This claim is substantiated by the Institute of Electrical and Electronic Engineers, which tracks programming languages by popularity. According to them, Python is the second most popular programming language this year for development on the web after Java.
Python Job Profiles
The median total pay for Python jobs in California, United States is $74,410, for a professional with one year of experience
Below are graphs depicting average Python salary by city
The first chart depicts average salary for a Python professional with one year of experience and the second chart depicts the average salaries by years of experience
Who Uses Python?
This course gives you a solid set of skills in one of today’s top programming languages. Today’s biggest companies (and smartest startups) use Python, including Google, Facebook, Instagram, Amazon, IBM, and NASA. Python is increasingly being used for scientific computations and data analysis
Take this course today and learn the skills you need to rub shoulders with today’s tech industry giants. Have fun, create and control intriguing and interactive Python GUIs, and enjoy a bright future! Best of Luck
Who is the target audience?
Anyone who wants to learn to code
For Complete Programming Beginners
For People New to Python
This course was designed for students with little to no programming experience
People interested in building Projects
Anyone looking to start with Python GUI development
Access to a computer
Download Python (FREE)
Should have an interest in programming
Interest in learning Python programming
Install Python 3.6 on your computer
What will you learn
Build Python Graphical User Interfaces(GUI) with Tkinter
Be able to use the in-built Python modules for their own projects
Use programming fundamentals to build a calculator
Use advanced Python concepts to code
Build Your GUI in Python programming
Use programming fundamentals to build a Project
Signup Login & Registration Programs
Job Interview Preparation Questions
& Much More
Guide to Python Programming Language
The course will lead you from beginning level to advance in Python Programming Language. You do not need any prior knowledge on Python or any programming language or even programming to join the course and become an expert on the topic.
The course is begin continuously developing by adding lectures regularly.
Please see the Promo and free sample video to get to know more.
Hope you will enjoy it.
An Enthusiast Mind
Basic Knowledge To Use Computer
What will you learn
Will Be Expert On Python Programming Language
Build Application On Python Programming Language
Python Programming Tutorials For Beginners
Hello and welcome to brand new series of wiredwiki. In this series i will teach you guys all you need to know about python. This series is designed for beginners but that doesn't means that i will not talk about the advanced stuff as well.
As you may all know by now that my approach of teaching is very simple and straightforward.In this series i will be talking about the all the things you need to know to jump start you python programming skills. This series is designed for noobs who are totally new to programming, so if you don't know any thing about
programming than this is the way to go guys Here is the links to all the videos that i will upload in this whole series.
In this video i will talk about all the basic introduction you need to know about python, which python version to choose, how to install python, how to get around with the interface, how to code your first program. Than we will talk about operators, expressions, numbers, strings, boo leans, lists, dictionaries, tuples and than inputs in python. With
Lots of exercises and more fun stuff, let's get started.
Download free Exercise files.
Who is the target audience?
First time Python programmers
Students and Teachers
IT pros who want to learn to code
Aspiring data scientists who want to add Python to their tool arsenal
Students should be comfortable working in the PC or Mac operating system
What will you learn
know basic programming concept and skill
build 6 text-based application using python
be able to learn other programming languages
be able to build sophisticated system using python in the future