'Conditional copy and replacement of values in dataframe [Python]

I would greatly appreciate any help. Within a dataframe, I have products that are grouped and scored according to their groups (group A products have the same set of Score_1, Score_2, ..., Score_N, Group B has another set etc). As some products/groups are referencing others, their scores need to be replaced with the scores of those groups that they are referencing. The Dataframe currently looks like this:

      Product     Group    Reference_Group      Score_1      Score_2  ... Score_N
0      XXX0X1       A          NaN            0.007598     0.007538      
1      XXX0X2       A          NaN            0.007598     0.007538      
2      XXX0X3       A          NaN            0.007598     0.007538
3      XXX0X4       B           A             0.003343     0.002696
4      XXX0X5       B           A             0.003343     0.002696
5      XXX0X6       B           A             0.003343     0.002696
6      XXX0X7       C          NaN            0.003399     0.004444
7      XXX0X8       C          NaN            0.003399     0.004444
8      XXX0X9       C          NaN            0.003399     0.004444
9      XXX0X10      D           C             0.006677     0.006262
10     XXX0X11      D           C             0.006677     0.006262
11     XXX0X12      D           C             0.006677     0.006262
...
1569

Where the Reference_Group != NaN, I need to replace the Score_1, Score_2, ..., Score_N of those rows with that of the Group indicated in Reference_Group (Score_1, Score_2, ..., Score_N for Group B products need to be replaced with Score_1, Score_2,..., Score_N of Group A products). The rows with NaN in Reference_Group need not be edited. The final df needs to look like this:

      Product    Group    Reference_Group      Score_1      Score_2  ... Score_N
0      XXX0X1       A          NaN            0.007598     0.007538      
1      XXX0X2       A          NaN            0.007598     0.007538      
2      XXX0X3       A          NaN            0.007598     0.007538
3      XXX0X4       B           A             0.007598     0.007538
4      XXX0X5       B           A             0.007598     0.007538
5      XXX0X6       B           A             0.007598     0.007538
6      XXX0X7       C          NaN            0.003399     0.004444
7      XXX0X8       C          NaN            0.003399     0.004444
8      XXX0X9       C          NaN            0.003399     0.004444
9      XXX0X10      D           C             0.003399     0.004444
10     XXX0X11      D           C             0.003399     0.004444
11     XXX0X12      D           C             0.003399     0.004444
...
1569

As there are too many rows and columns, I cannot hardcode for only the examples above.

Thank you for reading my question!



Solution 1:[1]

The idea is to divide the original dataframe into one set of products whose scores will be referenced ("referenced"), and another set who needs to reference scores ("referencing"). We drop all the Score_* columns from the referencing dataframe, then re-populate those same Score_* columns by copying from referenced in a left-join operation.

Every other step is janitorial (they pre- and post-process the dataframes for the left-join):

referenced = df[df.Reference_Group.isna()]
referencing = df[df.Reference_Group.notna()]

# For left-join, ID column name must match
referencing = referencing.rename(
    columns={"Group": "old.Group", "Reference_Group": "Group"})
referencing = referencing[["Product", "old.Group", "Group"]]

# The left-join operation itself
referencing = referencing.merge(
    referenced, how="left", on="Group", suffixes=("", "_extra"))

# Drop additional columns from the left-join
referencing = referencing.drop(["Product_extra", "Reference_Group"], axis=1)
# Drop additional rows from the left-join
referencing = referencing.drop_duplicates(subset="Product")

# Revert column names
referencing = referencing.rename(
    columns={"old.Group": "Group", "Group": "Reference_Group"})

done = pd.concat([referenced, referencing])

The large chunk in the middle can probably be shortened by carefully adjusting some arguments, but I am not familiar enough with pandas to do so.


Edit: it is possible to conserve the original ordering of the rows, using the "Product" column as a unique ID for each row. Append the following statement:

# Preserve original row order
done = done.sort_values(
    by="Product",
    key=lambda xs: list(map(
        lambda x: df.Product[df.Product == x].index[0], xs))
    ).reset_index(drop=True)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1