Image for post

I read somewhere that when dealing with computation problems involving billions of things, it’s a good idea to divide the work amongst multiple processes.

You can also use multiple threads but I’ll stick to multiprocessing since this more closely reflects how Databricks works.

Databricks is a multiprocessing platform. There are problems that on first blush do not appear to be appropriate for Databricks yet can be a great fit if you think about it differently.

Consider the problem of estimating pi using a Monte-Carlo simulation. You can think of the estimation method this way:

  1. Throw a dart at a dartboard.
  2. If the dart lands in the circle, you get 1 point.
  3. Repeat steps 1 & 2 until your sick of it.
  4. Add up your points, multiply by 4, and divide by the number of throws. This will give you an estimate for pi.

The more darts you throw the better the estimate.

Serial Implementation

The code below shows a serial implementation:

	from random import random
	from time import time
	import sys

	def throw_dart():
	    x = random()
	    y = random()

	    if (x * x) + (y * y) <= 1:
	        return 1

	    return 0

	def main(iterations: int):
	    hits = 0
	    start = time()

	    for i in range(0, iterations):
	        hits = hits + throw_dart()

	    end = time()
	    pi = (4 * hits) / iterations
	    print(pi)
	    print(f"Execution time: {end - start} seconds.")

	if __name__ == "__main__":
	    main(int(sys.argv[1]))

Here are the execution times on my workstation.

The execution times are very linear which is not surprising since the serial algorithm is _O(n). _When running on one core, the algorithm cries uncle at 1 billion throws.

Parallel Implementation using Multiprocessing

Python has a cool multiprocessing module that is built for divide and conquer types of problems. So how would you alter the serial code so that it can work when run using multiple processes?

Here is what I decided to do:

  1. Divide the number of iterations by the number of processes I want to spawn. This is how many darts each process throws.
  2. Spawn the appropriate number of processes.
  3. Tell each process to throw the darts and keep track of its score.
  4. When all processes have finished, add up the number of hits, multiply by 4, and divide by the total throws to get the estimate of pi.

#azure-databricks #multiprocessing #azure

Multiprocessing Made Easy(ier) with Databricks
11.10 GEEK