'Pandas - select column using other column value as column name
I have a dataframe that contains a column, let's call it "names". "names" has the name of other columns. I would like to add a new column that would have for each row the value based on the column name contained on that "names" column.
Example:
Input dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b']})
a | b | names | --- | --- | ---- | 1 | -1 | 'a' | 2 | -2 | 'b' | 3 | -3 | 'a' | 4 | -4 | 'b' |
Output dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b'], "new_col":[1,-2,3,-4]})
a | b | names | new_col | --- | --- | ---- | ------ | 1 | -1 | 'a' | 1 | 2 | -2 | 'b' | -2 | 3 | -3 | 'a' | 3 | 4 | -4 | 'b' | -4 |
Solution 1:[1]
Because DataFrame.lookup
is deprecated as of Pandas 1.2.0, the following is what I came up with using DataFrame.melt
:
df['new_col'] = df.melt(id_vars='names', value_vars=['a', 'b'], ignore_index=False).query('names == variable').loc[df.index, 'value']
Output:
>>> df
a b names new_col
0 1 -1 a 1
1 2 -2 b -2
2 3 -3 a 3
3 4 -4 b -4
Can this be simplified? For correctness, the index must not be ignored.
Additional reference:
Solution 2:[2]
Solution using pd.factorize
(from https://github.com/pandas-dev/pandas/issues/39171#issuecomment-773477244):
idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Solution 3:[3]
With the straightforward and easy solution (lookup
) deprecated, another alternative to the pandas-based ones proposed here is to convert df
into a numpy array and use numpy indexing:
df['new_col'] = df.values[df.index.get_indexer(df['names'].index), df.columns.get_indexer(df['names'])]
Let me explain what this does. df.values
is a numpy array based on the DataFrame. As numpy arrays have to be indexed numerically, we need to use the get_indexer
function to convert the pandas row and column index names to index numbers that can be used with numpy:
>>> df.index.get_indexer(df['names'].index)
array([0, 1, 2, 3], dtype=int64)
>>> df.columns.get_indexer(df['names'])
array([0, 1, 0, 1], dtype=int64)
(In this case, where the row index is already numerical, you could get away with simply using df.index
as the first argument inside the bracket, but this does not work generally.)
Solution 4:[4]
Here's a short solution using df.melt
and df.merge
:
df.merge(df.melt(var_name='names', ignore_index=False), on=[None, 'names'])
Outputs:
key_0 a b names value
0 0 1 -1 a 1
1 1 2 -2 b -2
2 2 3 -3 a 3
3 3 4 -4 b -4
There's a redundant key_0
column which you need to drop with df.drop
.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | ignoring_gravity |
Solution 3 | Erlend Magnus Viggen |
Solution 4 | Leif Metcalf |