'How to make Python code to transform data in power BI more efficient to reduce refresh time?

I am currently using .apply and if statement to create a conditional column returning string values based on other columns in a power BI table. These conditions are to categorize data in a way that considers incorrectly entered data. The code returns exactly the result I want but is slow, the table is about 50,000 rows and the if statement is being run over every row. Here is my code:

# 'dataset' holds the input data for this script

import pandas as pd
import numpy as np

fd = dataset[pd.isnull(dataset['bID'])]["RepackaID"].dropna().tolist()

def conditions(x):

    if pd.notnull(x['bID']) and pd.notnull(x["RepackaID"]):
        return 'b Input'
    elif x['bID'] == 0:
        return 'b Input'
# Below is the line that I struggled to do in power query M
    elif pd.notnull(x['aID']) and x['aID'] in fd:
        return "a Repack Output"
    elif x["Repack"] == True:
        return "b Repack Output"
    elif pd.notnull(x["bID"]):
        return "b Traded"
    elif x['cID'] == 94 or x['cID'] == 41:
        return "b Repack Output"
    elif pd.notnull(x["RepackaID"]):
        return "a Repack Input"
    else:
        return "a Output"

dataset['new column'] = dataset.apply(conditions, axis=1)

The line I struggled to do in power query M is marked with a comment it references the list fd defined above. It is just a case of returning this result "a repack output" to our new column if a value for aID exists in RepackaID butdoing this over 50,000 rows. The rest seems quiet easy to implement in power query m.

Just wondering if I am doing this in the most pythonic way (I suspect not) or if there is a better way to do this in m power query, any ideas? I would actually prefer to do this in m power query as other people working on this don't use python. Any improvements would be much welcomed.



Solution 1:[1]

In Python pandas, always aim for vectorized operations and avoid running rowwise operations with for ... iterrows or .apply (often a hidden loop). Specifically, for multiple conditional statements, consider numpy.select that conditionally assigns elementwise between an equal number list of conditions and list of values all with whole arrays and not individual cells.

fd = dataset[pd.isnull(dataset['bID'])]["RepackaID"].dropna().tolist()

# LIST OF BOOLEAN ARRAYS
conditions = [
    (pd.notnull(dataset['bID'])) & (pd.notnull(dataset["RepackaID"])),
    dataset['bID'] == 0,
    (pd.notnull(dataset['aID'])) & (dataset['aID'].isin(fd)),
    dataset["Repack"] == True,
    pd.notnull(dataset["bID"]),
    (dataset['cID'] == 94) | (dataset['cID'] == 41),
    pd.notnull(dataset["RepackaID"])
]

# LIST OF RETURN VALUES
values = [
    "b Input",
    "b Input",
    "a Repack Output",
    "b Repack Output",
    "b Traded",
    "b Repack Output".
    "a Repack Input"
]

dataset['new column'] = np.select(conditions, values, default="a Output")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Parfait