I read somewhere that when dealing with computation problems involving billions of things, it’s a good idea to divide the work amongst multiple processes.
You can also use multiple threads but I’ll stick to multiprocessing since this more closely reflects how Databricks works.
Databricks is a multiprocessing platform. There are problems that on first blush do not appear to be appropriate for Databricks yet can be a great fit if you think about it differently.
Consider the problem of estimating pi using a Monte-Carlo simulation. You can think of the estimation method this way:
The more darts you throw the better the estimate.
The code below shows a serial implementation:
from random import random
from time import time
import sys
def throw_dart():
x = random()
y = random()
if (x * x) + (y * y) <= 1:
return 1
return 0
def main(iterations: int):
hits = 0
start = time()
for i in range(0, iterations):
hits = hits + throw_dart()
end = time()
pi = (4 * hits) / iterations
print(pi)
print(f"Execution time: {end - start} seconds.")
if __name__ == "__main__":
main(int(sys.argv[1]))
Here are the execution times on my workstation.
The execution times are very linear which is not surprising since the serial algorithm is _O(n). _When running on one core, the algorithm cries uncle at 1 billion throws.
Python has a cool multiprocessing module that is built for divide and conquer types of problems. So how would you alter the serial code so that it can work when run using multiple processes?
Here is what I decided to do:
#azure-databricks #multiprocessing #azure