'Speeding up derived feature calculation in Pandas dataframe

I have the following workflow in a Python notebook

  1. Load data into a pandas dataframe from a table (around 200K rows) --> I will call this orig_DF moving forward
  2. Manipulate orig_DF to get into a DF that has columns <Feature1, Feature 2,...,Feature N, Label> --> I will call this derived DF ```ML_input DF`` moving forward. This DF is used to train a ML model
  3. To get ML_input DF, I need to do some complex processing on each row in orig_DF. In particular, each row in orig_DF gets converted into multiple "rows" (number unknown before processing a row) in ML_input DF

Currently, I am doing (code below)

  1. orig_df.iterrows() to loop through each row
  2. Apply a function on each row. This returns a list.
  3. Accumulate results from multiple rows into one list
  4. Convert this list into ML_input DF after the loop ends

This works but I want speed this up by parallelizing the work on each row and accumulating the results. Would appreciate pointers from Pandas experts on how to do this. An example would be greatly appreciated

Current code is below.

Note: I have looked into using df.apply(). But two issues seem to be

  1. apply in itself does not seem to parallelize things.
  2. I don't how to make apply handle this one row converted to multiple row issue (any pointers here will also help)

Current code

def get_training_dataframe(dfin):
    X = []
    for index, row in dfin.iterrows():
        ts_frame_dict = ast.literal_eval(row["sample_dictionary"])
        for ts, frame in ts_frame_dict.items():
          features = get_features(frame)
          if features != None:
            X += [features]

   return pd.DataFrame(X, columns=FEATURE_NAMES)


Solution 1:[1]

It's difficult to know what optimizations are possible without having example data and without knowing what get_features() does.

The following code ought to be equivalent (I think) to your code, but it attempts to "vectorize" each step instead of performing it all within the for-loop. Perhaps that will offer you a chance to more easily measure the time taken by each step, and optimize the bottlenecks.

In particular, I wonder if it's faster to combine the calls to ast.literal_eval() into a single call. That's what I've done here, but I have no idea if it's truly faster.

I recommend trying line profiler if you can.

import ast
import pandas as pd

def get_training_dataframe(dfin):
    frame_dicts = ast.literal_eval('[' + ','.join(dfin['sample_dictionary']) + ']')
    frames = chain(*(d.values() for d in frame_dicts))
    features = map(get_features, frames)
    features = [f for f in features if f is not None]
    return pd.DataFrame(features, columns=FEATURE_NAMES)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Stuart Berg