'Speeding up derived feature calculation in Pandas dataframe
I have the following workflow in a Python notebook
- Load data into a pandas dataframe from a table (around 200K rows) --> I will call this
orig_DFmoving forward - Manipulate
orig_DFto get into a DF that has columns <Feature1, Feature 2,...,Feature N, Label> --> I will call this derived DF ```ML_input DF`` moving forward. This DF is used to train a ML model - To get
ML_input DF, I need to do some complex processing on each row inorig_DF. In particular, each row inorig_DFgets converted into multiple "rows" (number unknown before processing a row) inML_input DF
Currently, I am doing (code below)
orig_df.iterrows()to loop through each row- Apply a function on each row. This returns a list.
- Accumulate results from multiple rows into one list
- Convert this list into
ML_input DFafter the loop ends
This works but I want speed this up by parallelizing the work on each row and accumulating the results. Would appreciate pointers from Pandas experts on how to do this. An example would be greatly appreciated
Current code is below.
Note: I have looked into using df.apply(). But two issues seem to be
applyin itself does not seem to parallelize things.- I don't how to make apply handle this
one row converted to multiple rowissue (any pointers here will also help)
Current code
def get_training_dataframe(dfin):
X = []
for index, row in dfin.iterrows():
ts_frame_dict = ast.literal_eval(row["sample_dictionary"])
for ts, frame in ts_frame_dict.items():
features = get_features(frame)
if features != None:
X += [features]
return pd.DataFrame(X, columns=FEATURE_NAMES)
Solution 1:[1]
It's difficult to know what optimizations are possible without having example data and without knowing what get_features() does.
The following code ought to be equivalent (I think) to your code, but it attempts to "vectorize" each step instead of performing it all within the for-loop. Perhaps that will offer you a chance to more easily measure the time taken by each step, and optimize the bottlenecks.
In particular, I wonder if it's faster to combine the calls to ast.literal_eval() into a single call. That's what I've done here, but I have no idea if it's truly faster.
I recommend trying line profiler if you can.
import ast
import pandas as pd
def get_training_dataframe(dfin):
frame_dicts = ast.literal_eval('[' + ','.join(dfin['sample_dictionary']) + ']')
frames = chain(*(d.values() for d in frame_dicts))
features = map(get_features, frames)
features = [f for f in features if f is not None]
return pd.DataFrame(features, columns=FEATURE_NAMES)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Stuart Berg |
