'Map names to column values pandas
The Problem
I had a hard time phrasing this question but essentially I have a series of X columns that represent weights at specific points in time. Then another set of X columns that represent the names of those people that were measured.
That table looks like this (there's more than two columns, this is just a toy example):
| a_weight | b_weight | a_name | b_name |
|---|---|---|---|
| 10 | 5 | John | Michael |
| 1 | 2 | Jake | Michelle |
| 21 | 3 | Alice | Bob |
| 2 | 1 | Ashley | Brian |
What I Want
I want to have a two columns with the maximum weight and name at each point in time. I want this to be vectorized because the data is a lot. I can do it using a for loop or an .apply(lambda row: row[col]) but it is very slow.
So the final table would look something like this:
| a_weight | b_weight | a_name | b_name | max_weight | max_name |
|---|---|---|---|---|---|
| 10 | 5 | John | Michael | a_weight | John |
| 1 | 2 | Jake | Michelle | b_weight | Michelle |
| 21 | 3 | Alice | Bob | a_weight | Alice |
| 2 | 1 | Ashley | Brian | a_weight | Ashley |
What I've Tried
I've been able to create a mirror df_subset with just the weights, then use the idxmax function to make a max_weight column:
df_subset = df[[c for c in df.columns if "weight" in c]]
max_weight_col = df_subset.idxmax(axis="columns")
This returns a column that is the max_weight column in the section above. Now I run:
df["max_name_col"] = max_weight_col.str.replace("_weight","_name")
and I have this:
| a_weight | b_weight | a_name | b_name | max_weight | max_name_col |
|---|---|---|---|---|---|
| 10 | 5 | John | Michael | a_weight | a_name |
| 1 | 2 | Jake | Michelle | b_weight | b_name |
| 21 | 3 | Alice | Bob | a_weight | a_name |
| 2 | 1 | Ashley | Brian | a_weight | a_name |
I basically want to run a code similar to the one below without a for-loop:
df["max_name"] = [row[row["max_name_col"]] for row in df]
How do I move on from here? I feel like I'm so close but I'm stuck. Any help? I'm also open to throwing away the entire code and doing something else if there's a faster way.
Solution 1:[1]
You can do that for sure just pass to numpy argmax
v1 = df.filter(like='weight').values
v2 = df.filter(like='name').values
df['max_weight'] = v1[df.index, v1.argmax(1)]
df['max_name'] = v2[df.index, v1.argmax(1)]
df
Out[921]:
a_weight b_weight a_name b_name max_weight max_name
0 10 5 John Michael 10 John
1 1 2 Jake Michelle 2 Michelle
2 21 3 Alice Bob 21 Alice
3 2 1 Ashley Brian 2 Ashley
Solution 2:[2]
This would do the trick assuming you only have 2 weight columns:
df["max_weight"] = df[["a_weight", "b_weight"]].idxmax(axis=1)
mask = df["max_weight"] == "a_weight"
df.loc[mask, "max_name"] = df[mask]["a_name"]
df.loc[~mask, "max_name"] = df[~mask]["b_name"]
Solution 3:[3]
We could use idxmax to find the column names; then use factorize + numpy advanced indexing to get the names:
df['max_weight'] = df.loc[:, df.columns.str.contains('weight')].idxmax(axis=1)
df['max_name'] = (df.loc[:, df.columns.str.contains('name')].to_numpy()
[np.arange(len(df)), df['max_weight'].factorize()[0]])
Output:
a_weight b_weight a_name b_name max_weight max_name
0 10 5 John Michael a_weight John
1 1 2 Jake Michelle b_weight Michelle
2 21 3 Alice Bob a_weight Alice
3 2 1 Ashley Brian a_weight Ashley
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | BENY |
| Solution 2 | Jay Mody |
| Solution 3 |
