'Parallel loading of Input Files in Pandas Dataframe
I have a Requirement, where I have three Input files and need to load them inside the Pandas Data Frame, before merging two of the files into one single Data Frame.
The File extension always changes, it could be .txt one time and .xlsx or .csv another time.
How Can I run this process parallel, in order to save the waiting/ loading time ?
This is my code at the moment,
from time import time # to measure the time taken to run the code
start_time = time()
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
import pandas as pd # to work with the data frames
Primary_df = pd.read_excel (Primary_File)
Secondary_1_df = pd.read_csv (Secondary_File_1)
Secondary_2_df = pd.read_csv (Secondary_File_2)
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
print(end_time - start_time)
It takes around 20 minutes for me to load my primary_df and secondary_df. So, I am looking for an efficient solution possibly using parallel processing to save time. I timed by Reading operation and it takes most of the time approximately 18 minutes 45 seconds.
Hardware Config :- Intel i5 Processor, 16 GB Ram and 64-bit OS
Question Made Eligible for bounty :- As I am looking for a working code with detailed steps - using a package with in anaconda environment that supports loading my input files Parallel and storing them in a pandas data frame separately. This should eventually save time.
Solution 1:[1]
Try this:
from time import time
import pandas as pd
from multiprocessing.pool import ThreadPool
start_time = time()
pool = ThreadPool(processes=3)
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
# Define a function for the thread
def import_xlsx(file_name):
df_xlsx = pd.read_excel(file_name)
# print(df_xlsx.head())
return df_xlsx
def import_csv(file_name):
df_csv = pd.read_csv(file_name)
# print(df_csv.head())
return df_csv
# Create two threads as follows
Primary_df = pool.apply_async(import_xlsx, (Primary_File, )).get()
Secondary_1_df = pool.apply_async(import_csv, (Secondary_File_1, )).get()
Secondary_2_df = pool.apply_async(import_csv, (Secondary_File_2, )).get()
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
Solution 2:[2]
Why not use asyncio over multiprocessing?
Instead of using multiple threads, you might want to first leverage on the I/O level with an Async CSV Dict Reader (which can be parallelized using multiprocessing for multiple files). Afterwards, you can either concat the dicts and then load these dictionaries into pandas or load the individual dicts into pandas and concat there.
However, pandas does not support asyncio so you will have a performance loss at some point.
Solution 3:[3]
Try using @Cezary.Sz code but using (delete the calls to .get()), instead:
Primary_df_job = pool.apply_async(import_xlsx, (Primary_File, ))
Secondary_1_df_job = pool.apply_async(import_csv, (Secondary_File_1, ))
Secondary_2_df_job = pool.apply_async(import_csv, (Secondary_File_2, ))
Then
Secondary_1_df = Secondary_1_df_job.get()
Secondary_2_df = Secondary_2_df_job.get()
And you can use the dataframes, while Primary_df_job is loading.
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
When you need Primary_df in your code, use
Primary_df = Primary_df_job.get()
This will block the execution until Primary_df_job is finished.
Solution 4:[4]
Unfortunately, due to GIL (Global Interpreter Lock) in Python, multiple threads do not run simultaneously — all threads use the same single CPU's core. That means if you create several threads to load your files, the total time will be equal (or actually greater) the time needed to load that files one-by-one.
More about GIL: https://wiki.python.org/moin/GlobalInterpreterLock
To speed up load time you can try to switch from csv/excel to pickle files (or HDF).
Solution 5:[5]
You give the hardware details but you do not give the most interesting part: the number of disks you have, the type of RAID you have and the filesystem you are reading from.
If you only have one disk, no RAID, and a regular filesystem (ext4, XFS, etc.), like you mostly have on laptops, you will not be able to increase the bandwidth simply by throwing CPUs (multithread or multiprocess) at the problem. Using multiple threads, or asynchronous I/Os will help mask the latency a bit, but will not increase the bandwidth, because chances are you are already saturating it with a single reader process.
So using the code suggested by @Cezary.Sz, try moving one of the file to a USB3.0 external storage, or to SDSX storage. If you are running on a large workstation, look at the hardware details to see if several disks are available, and if you run on a large cluster, look for a parallel filesystem (BeeGFS, Lustre, etc.)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | tahesse |
| Solution 3 | Cassiano |
| Solution 4 | |
| Solution 5 | damienfrancois |
