1641333600
Without parallelization | ![]() |
---|---|
With parallelization | ![]() |
$ pip install pandarallel [--upgrade] [--user]
On Windows, Pandaral·lel will works only if the Python session (python
, ipython
, jupyter notebook
, jupyter lab
, ...) is executed from Windows Subsystem for Linux (WSL).
On Linux & macOS, nothing special has to be done.
An example of each API is available here.
For some examples, here is the comparative benchmark with and without using Pandaral·lel.
Computer used for this benchmark:
For those given examples, parallel operations run approximately 4x faster than the standard operations (except for series.map
which runs only 3.2x faster).
First, you have to import pandarallel
:
from pandarallel import pandarallel
Then, you have to initialize it.
pandarallel.initialize()
This method takes 5 optional parameters:
shm_size_mb
: Deprecated.nb_workers
: Number of workers used for parallelization. (int) If not set, all available CPUs will be used.progress_bar
: Display progress bars if set to True
. (bool, False
by default)verbose
: The verbosity level (int, 2
by default)use_memory_fs
: (bool, None
by default)SystemError
if memory file system is not available.Using memory file system reduces data transfer time between the main process and workers, especially for big data.
Memory file system is considered as available only if the directory /dev/shm
exists and if the user has read and write rights on it.
Basically, memory file system is only available on some Linux distributions (including Ubuntu).
With df
a pandas DataFrame, series
a pandas Series, func
a function to apply/map, args
, args1
, args2
some arguments, and col_name
a column name:
Without parallelization | With parallelization |
---|---|
df.apply(func) | df.parallel_apply(func) |
df.applymap(func) | df.parallel_applymap(func) |
df.groupby(args).apply(func) | df.groupby(args).parallel_apply(func) |
df.groupby(args1).col_name.rolling(args2).apply(func) | df.groupby(args1).col_name.rolling(args2).parallel_apply(func) |
df.groupby(args1).col_name.expanding(args2).apply(func) | df.groupby(args1).col_name.expanding(args2).parallel_apply(func) |
series.map(func) | series.parallel_map(func) |
series.apply(func) | series.parallel_apply(func) |
series.rolling(args).apply(func) | series.rolling(args).parallel_apply(func) |
You will find a complete example here for each row in this table.
I have 8 CPUs but parallel_apply
speeds up computation only about x4. Why?
Actually Pandarallel can only speed up computation until about the number of cores your computer has. The majority of recent CPUs (like Intel Core i7) uses hyperthreading. For example, a 4-core hyperthreaded CPU will show 8 CPUs to the operating system, but will really have only 4 physical computation units.
On Ubuntu, you can get the number of cores with $ grep -m 1 'cpu cores' /proc/cpuinfo
.
I use Jupyter Lab and instead of progress bars, I see these kind of things:VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=625000), Label(value='0 / 625000')…
Run the following 3 lines, and you should be able to see the progress bars:
$ pip install ipywidgets
$ jupyter nbextension enable --py widgetsnbextension
$ jupyter labextension install @jupyter-widgets/jupyterlab-manager
(You may also have to install nodejs
if asked)
Author: nalepae
Source Code: https://github.com/nalepae/pandarallel
License: BSD-3-Clause License
1641333600
Without parallelization | ![]() |
---|---|
With parallelization | ![]() |
$ pip install pandarallel [--upgrade] [--user]
On Windows, Pandaral·lel will works only if the Python session (python
, ipython
, jupyter notebook
, jupyter lab
, ...) is executed from Windows Subsystem for Linux (WSL).
On Linux & macOS, nothing special has to be done.
An example of each API is available here.
For some examples, here is the comparative benchmark with and without using Pandaral·lel.
Computer used for this benchmark:
For those given examples, parallel operations run approximately 4x faster than the standard operations (except for series.map
which runs only 3.2x faster).
First, you have to import pandarallel
:
from pandarallel import pandarallel
Then, you have to initialize it.
pandarallel.initialize()
This method takes 5 optional parameters:
shm_size_mb
: Deprecated.nb_workers
: Number of workers used for parallelization. (int) If not set, all available CPUs will be used.progress_bar
: Display progress bars if set to True
. (bool, False
by default)verbose
: The verbosity level (int, 2
by default)use_memory_fs
: (bool, None
by default)SystemError
if memory file system is not available.Using memory file system reduces data transfer time between the main process and workers, especially for big data.
Memory file system is considered as available only if the directory /dev/shm
exists and if the user has read and write rights on it.
Basically, memory file system is only available on some Linux distributions (including Ubuntu).
With df
a pandas DataFrame, series
a pandas Series, func
a function to apply/map, args
, args1
, args2
some arguments, and col_name
a column name:
Without parallelization | With parallelization |
---|---|
df.apply(func) | df.parallel_apply(func) |
df.applymap(func) | df.parallel_applymap(func) |
df.groupby(args).apply(func) | df.groupby(args).parallel_apply(func) |
df.groupby(args1).col_name.rolling(args2).apply(func) | df.groupby(args1).col_name.rolling(args2).parallel_apply(func) |
df.groupby(args1).col_name.expanding(args2).apply(func) | df.groupby(args1).col_name.expanding(args2).parallel_apply(func) |
series.map(func) | series.parallel_map(func) |
series.apply(func) | series.parallel_apply(func) |
series.rolling(args).apply(func) | series.rolling(args).parallel_apply(func) |
You will find a complete example here for each row in this table.
I have 8 CPUs but parallel_apply
speeds up computation only about x4. Why?
Actually Pandarallel can only speed up computation until about the number of cores your computer has. The majority of recent CPUs (like Intel Core i7) uses hyperthreading. For example, a 4-core hyperthreaded CPU will show 8 CPUs to the operating system, but will really have only 4 physical computation units.
On Ubuntu, you can get the number of cores with $ grep -m 1 'cpu cores' /proc/cpuinfo
.
I use Jupyter Lab and instead of progress bars, I see these kind of things:VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=625000), Label(value='0 / 625000')…
Run the following 3 lines, and you should be able to see the progress bars:
$ pip install ipywidgets
$ jupyter nbextension enable --py widgetsnbextension
$ jupyter labextension install @jupyter-widgets/jupyterlab-manager
(You may also have to install nodejs
if asked)
Author: nalepae
Source Code: https://github.com/nalepae/pandarallel
License: BSD-3-Clause License
1644358500
$ pip install pandarallel [--upgrade] [--user]
On Windows, Pandaral·lel will works only if the Python session (python
, ipython
, jupyter notebook
, jupyter lab
, ...) is executed from Windows Subsystem for Linux (WSL).
On Linux & macOS, nothing special has to be done.
%load_ext autoreload
%autoreload 2
import pandas as pd
import time
from pandarallel import pandarallel
import math
import numpy as np
Initialize pandarallel
pandarallel.initialize()
DataFrame.apply
df_size = int(5e6)
df = pd.DataFrame(dict(a=np.random.randint(1, 8, df_size),
b=np.random.rand(df_size)))
def func(x):
return math.sin(x.a**2) + math.sin(x.b**2)
%%time
res = df.apply(func, axis=1)
%%time
res_parallel = df.parallel_apply(func, axis=1)
res.equals(res_parallel)
DataFrame.applymap
df_size = int(1e7)
df = pd.DataFrame(dict(a=np.random.randint(1, 8, df_size),
b=np.random.rand(df_size)))
def func(x):
return math.sin(x**2) - math.cos(x**2)
%%time
res = df.applymap(func)
%%time
res_parallel = df.parallel_applymap(func)
res.equals(res_parallel)
DataFrame.groupby.apply
df_size = int(3e7)
df = pd.DataFrame(dict(a=np.random.randint(1, 1000, df_size),
b=np.random.rand(df_size)))
def func(df):
dum = 0
for item in df.b:
dum += math.log10(math.sqrt(math.exp(item**2)))
return dum / len(df.b)
%%time
res = df.groupby("a").apply(func)
%%time
res_parallel = df.groupby("a").parallel_apply(func)
res.equals(res_parallel)
DataFrame.groupby.rolling.apply
df_size = int(1e6)
df = pd.DataFrame(dict(a=np.random.randint(1, 300, df_size),
b=np.random.rand(df_size)))
def func(x):
return x.iloc[0] + x.iloc[1] ** 2 + x.iloc[2] ** 3 + x.iloc[3] ** 4
%%time
res = df.groupby('a').b.rolling(4).apply(func, raw=False)
%%time
res_parallel = df.groupby('a').b.rolling(4).parallel_apply(func, raw=False)
res.equals(res_parallel)
DataFrame.groupby.expanding.apply
df_size = int(1e6)
df = pd.DataFrame(dict(a=np.random.randint(1, 300, df_size),
b=np.random.rand(df_size)))
def func(x):
return x.iloc[0] + x.iloc[1] ** 2 + x.iloc[2] ** 3 + x.iloc[3] ** 4
%%time
res = df.groupby('a').b.expanding(4).apply(func, raw=False)
%%time
res_parallel = df.groupby('a').b.expanding(4).parallel_apply(func, raw=False)
res.equals(res_parallel)
Series.map
df_size = int(5e7)
df = pd.DataFrame(dict(a=np.random.rand(df_size) + 1))
def func(x):
return math.log10(math.sqrt(math.exp(x**2)))
%%time
res = df.a.map(func)
%%time
res_parallel = df.a.parallel_map(func)
res.equals(res_parallel)
Series.apply
df_size = int(3.5e7)
df = pd.DataFrame(dict(a=np.random.rand(df_size) + 1))
def func(x, power, bias=0):
return math.log10(math.sqrt(math.exp(x**power))) + bias
%%time
res = df.a.apply(func, args=(2,), bias=3)
%%time
res_parallel = df.a.parallel_apply(func, args=(2,), bias=3)
res.equals(res_parallel)
Series.rolling.apply
df_size = int(1e6)
df = pd.DataFrame(dict(a=np.random.randint(1, 8, df_size),
b=list(range(df_size))))
def func(x):
return x.iloc[0] + x.iloc[1] ** 2 + x.iloc[2] ** 3 + x.iloc[3] ** 4
%%time
res = df.b.rolling(4).apply(func, raw=False)
%%time
res_parallel = df.b.rolling(4).parallel_apply(func, raw=False)
res.equals(res_parallel)
For some examples, here is the comparative benchmark with and without using Pandaral·lel.
Computer used for this benchmark:
For those given examples, parallel operations run approximately 4x faster than the standard operations (except for series.map
which runs only 3.2x faster).
First, you have to import pandarallel
:
from pandarallel import pandarallel
Then, you have to initialize it.
pandarallel.initialize()
This method takes 5 optional parameters:
shm_size_mb
: Deprecated.nb_workers
: Number of workers used for parallelization. (int) If not set, all available CPUs will be used.progress_bar
: Display progress bars if set to True
. (bool, False
by default)verbose
: The verbosity level (int, 2
by default)use_memory_fs
: (bool, None
by default)SystemError
if memory file system is not available.Using memory file system reduces data transfer time between the main process and workers, especially for big data.
Memory file system is considered as available only if the directory /dev/shm
exists and if the user has read and write rights on it.
Basically, memory file system is only available on some Linux distributions (including Ubuntu).
With df
a pandas DataFrame, series
a pandas Series, func
a function to apply/map, args
, args1
, args2
some arguments, and col_name
a column name:
Without parallelization | With parallelization |
---|---|
df.apply(func) | df.parallel_apply(func) |
df.applymap(func) | df.parallel_applymap(func) |
df.groupby(args).apply(func) | df.groupby(args).parallel_apply(func) |
df.groupby(args1).col_name.rolling(args2).apply(func) | df.groupby(args1).col_name.rolling(args2).parallel_apply(func) |
df.groupby(args1).col_name.expanding(args2).apply(func) | df.groupby(args1).col_name.expanding(args2).parallel_apply(func) |
series.map(func) | series.parallel_map(func) |
series.apply(func) | series.parallel_apply(func) |
series.rolling(args).apply(func) | series.rolling(args).parallel_apply(func) |
You will find a complete example here for each row in this table.
I have 8 CPUs but parallel_apply
speeds up computation only about x4. Why?
Actually Pandarallel can only speed up computation until about the number of cores your computer has. The majority of recent CPUs (like Intel Core i7) uses hyperthreading. For example, a 4-core hyperthreaded CPU will show 8 CPUs to the operating system, but will really have only 4 physical computation units.
On Ubuntu, you can get the number of cores with $ grep -m 1 'cpu cores' /proc/cpuinfo
.
I use Jupyter Lab and instead of progress bars, I see these kind of things:VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=625000), Label(value='0 / 625000')…
Run the following 3 lines, and you should be able to see the progress bars:
$ pip install ipywidgets
$ jupyter nbextension enable --py widgetsnbextension
$ jupyter labextension install @jupyter-widgets/jupyterlab-manager
(You may also have to install nodejs
if asked)
Download Details:
Author: nalepae
Source Code: https://github.com/nalepae/pandarallel
License: BSD-3-Clause License