Hi all,
I’ve been very busy applying to internships and working on research. However, I wanted to take some time to write a short blog post on multiprocessing in Python. I recently had to re-learn (whoops) how to do it, and wanted to share the info here in case it can help someone else!
First off, what is multiprocessing? And how does it differ from multithreading?
Let’s first think about the difference between a process and a thread. Put simply, a process is a full program and it’s execution. A thread is a segment of a process.
So, multithreading has multiple threads on a single processor whereas multiprocessing has single (can be multiple) threads on multiple, separate, processors.
For the purpose of this blog post we will be working through an example of multiprocessing. Sometimes in Python we write a for loop to execute some function and we want to run this function a significant number of times. For example, I recently was working with a dataset where for each row I wanted to take some data within the row and return/calculate something, in my case I wanted to run some text data through a classifier. Now, if the function doesn’t require the previous row’s result and can be run independently, this is a prime use-case for multiprocessing which will decrease the total computational time of your program. If you have 10,000 rows and use a single processor, it will be slower than if you multiple-processors, since you essentially can be running multiple rows of your dataset through your function at the exact same time.
Below is a very simple example of multiprocessing to provide some set-up code. The code below is broken into three main sections: imports, task, main function.
import pandas as pd import numpy as np import multiprocess as mp from tqdm import tqdm import math def task(arg): return (math.sqrt(arg)) if __name__ == '__main__': pool = mp.Pool(processes=2) #number of cores inputs = range(1, 5000) #range we are iterating over results = [] for result in tqdm(pool.imap(task, inputs), total=len(inputs)): results.append(result)
So, first the imports (lines 1-5). I imported pandas and numpy out of habit, you’ll need to import multiprocess in order to do multiprocessing and then if you also import tdqm we can display a progress bar of the work. The last import is math so we can compute a square root.
GOAL: Find and store the square root of the numbers 1 through 4999.
What’s our task then? i.e. what is the function we want to run on every number from 1 to 4999? We want the square root, so our task is the square root. So we define our task function to take in an argument, and return the square root of the argument (lines 7 & 8).
Now, what goes in our main function (lines 10-15)? This is where the multiprocessing magic happens! We first want to create our Pool, and put in the number of processes we want to use, in this case my computer has 2 cores (embarrassing), so processes=2. Our inputs are the numbers 1 through 4999. We initialize an empty list to store our results and then write our for loop to save the results. Recall tdqm isn’t scary, it is just included so when we run the code a progress bar shows up, which is technically unnecessary but for longer projects it is very useful to know how far you are a long.
What about pool.imap()? Well, the documentation tells us it takes in a function and an iterable, so in our case the function we want to complete is our task function, and we are iterating over our input values, range(1,5000). Note, the output of imap will be ordered, so our results vector will have the results in order. However, if you mess around with print statements you’ll see that thought the results are outputted are ordered the program does not run our function “in order” i.e. it doesn’t start with 1 and 2, then 3 and 4. Lastly, we append our result to our results vector.
note: when I run this with processes=2, I get about 1961.24it/s (iterations per second) and when I run with processes=1 it’s 2044.88it/s. So is it doubly fast with 2 cores versus 1? No. Is it faster? Yes!
Now, how do we know it worked? Well, if you open your Activity Monitor while the code is running rather than seeing 1 python you’ll see 2! (or n, where n is the number of cores you use).
Hopefully this short little tutorial can help you get started with multiprocessing, or made you realize more of the code you write you *could* implement multiprocessing to speed things up.
Additional Resources:
https://superfastpython.com/multiprocessing-for-loop/
https://www.geeksforgeeks.org/multithreading-python-set-1/
https://www.indeed.com/career-advice/career-development/multithreading-vs-multiprocessing
https://towardsdatascience.com/multithreading-and-multiprocessing-in-10-minutes-20d9b3c6a867