【Parallel processing in Python】Joblib explained
yuto
Posted on June 29, 2024
1. What is Joblib
Joblib is a Python library that provides tools for efficiently saving and loading Python objects, particularly useful for machine learning workflows.
・Install
pip install joblib
2. Key Features
2.1 Parallel Processing
Joblib provides easy-to-use parallel processing capabilities through its Parallel
and delayed
functions. This is useful for tasks that can be parallelized, such as parameter grid searches or data preprocessing.
from joblib import Parallel, delayed
def process_data(data):
# Simulate a time-consuming data processing step
import time
time.sleep(1)
return data ** 2
data = [1, 2, 3, 4, 5]
# Parallel(n_jobs=workjob_num)(delayed(func_be_applied)(aug) for elem in array
results = Parallel(n_jobs=2)(delayed(process_data)(d) for d in data)
print(results)
We can use it smiply with list comprehensions as like above. If you specify n_jobs=-1
, all available CPU cores will be used for the parallel computation. This can significantly speed up processing time for tasks that are CPU-bound and can be effectively parallelized.
However, it may affect other application using CPU or memory, so should be careful this.
:::details Speed test
・Test
from joblib import Parallel, delayed
import time
def process_data(data):
# Simulate a time-consuming data processing step
time.sleep(1)
return data ** 2
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Normal calculation
start_time = time.time()
results_normal = [process_data(d) for d in data]
end_time = time.time()
normal_duration = end_time - start_time
print("Normal Calculation Results:", results_normal)
print("Normal Calculation Duration:", normal_duration, "seconds")
# Parallel calculation with n_jobs=2
start_time = time.time()
results_parallel = Parallel(n_jobs=2)(delayed(process_data)(d) for d in data)
end_time = time.time()
parallel_duration = end_time - start_time
print("Parallel Calculation Results:", results_parallel)
print("Parallel Calculation Duration:", parallel_duration, "seconds")
# Parallel calculation with n_jobs=-1
start_time = time.time()
results_parallel = Parallel(n_jobs=-1)(delayed(process_data)(d) for d in data)
end_time = time.time()
parallel_duration = end_time - start_time
print("Parallel Calculation Results:", results_parallel)
print("Parallel Calculation Duration:", parallel_duration, "seconds")
・Result
# Normal Calculation Results: [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
# Normal Calculation Duration: 10.011737823486328 seconds
# Parallel Calculation Results: [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
# Parallel Calculation Duration: 5.565693616867065 seconds
# Parallel Calculation Results: [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
# Parallel Calculation Duration: 3.627182722091675 seconds
:::
As you can see the results of test, parallel processing provides 2x or more faster calculation.
2.2 Serialization/Compression
Joblib using binary format when saves and loads python objects to disk. It provide efficient and faster operation.
Also, it supports various compression methods like zlib, gzip, bz2, and xz, allowing you to reduce the storage size of saved objects.
・Serialization
import joblib
data = [i for i in range(1000000)]
compression = False
if compression:
joblib.dump(data, 'data.pkl', compress=('gzip', 3))
else:
joblib.dump(data, 'data.pkl')
data = joblib.load('data.pkl')
print(len(data))
# 1000000
the '3' specifies the compression level (typically from 1 to 9, where higher numbers indicate more compression but slower speeds).
2.3 Memory Mapping
For large NumPy arrays, Joblib can use memory mapping to save memory by keeping a reference to the data on disk instead of loading it all into memory.
When you memory-map a file, parts of the file are loaded into RAM as needed, which can result in slower access times compared to having the entire dataset in RAM. However, it allows you to handle datasets larger than your available RAM.
If you wanna use the data, you have to load the data from storage disk.
from joblib import Memory
import math
cachedir = "./memory_cache"
memory = Memory(cachedir, verbose=0)
@memory.cache
def calc(x):
print("RUNNING......")
return math.sqrt(x)
print(calc(2))
print(calc(2))
print(calc(5))
# RUNNING......
# 1.41421356237
# 1.41421356237
# RUNNING......
# 2.23606797749979
As shown, the same calculation's result is returned without calculation(not through the func).
This is useful when doing same calculation, like Fibonacci sequence.
3. Summary
Joblib is so useful liblary in python. Espacially, parallel processing is crucial impact for like data preprocessing that must be faster.
Reference
[1] Joblib
[2] Joblibの様々な便利機能を把握する
Posted on June 29, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024