'Melting a dataframe based on a flag
I've a dataframe like this:
pd.DataFrame({'time':['01-01-2020','02-01-2020','01-01-2020','02-01-2020'],'level':['q','q','r','r'],'a':[1,2,3,4],'b':[12,34,54,67],'c':[18,29,39,47],'a_1':[0.1,0.2,0.3,0.4],'a_2':[0,1,0,1],'b_1':[0.28,0.47,0.02,0.05],'b_2':[1,1,0,1],'c_1':[0.18,0.40,0.12,0.01],'c_2':[1,1,0,0]})
>> time level a b c a_1 a_2 b_1 b_2 c_1 c_2
0 01-01-2020 q 1 12 18 0.1 0 0.28 1 0.18 1
1 02-01-2020 q 2 34 29 0.2 1 0.47 1 0.40 1
2 01-01-2020 r 3 54 39 0.3 0 0.02 0 0.12 0
3 02-01-2020 r 4 67 47 0.4 1 0.05 1 0.01 0
I wish to melt the data with time and level as index have all other columns as rows which had a flag 1 corresponding to their prefixes. Eg. I wish to have the values of a and a_1 listed as values and items if the value a_2 was 1.
Desired output:
>> time level column values items
0 01-01-2020 q b 12 0.28
1 01-01-2020 q c 18 0.18
2 02-01-2020 q a 2 0.20
3 02-01-2020 q b 34 0.47
4 02-01-2020 q c 29 0.40
5 02-01-2020 r a 4 0.40
6 02-01-2020 r b 67 0.05
I can get all the values irrespective of the flags and then filter for flags==1. But,not sure how to "melt" /"unstack" in this case. I tried a lot of ways but in vain. Please help me out here.
Solution 1:[1]
Let's try with melt:
i, c = ['time', 'level'], pd.Index(['a', 'b','c'])
# mask the values where flag=0
m = df[c + '_1'].mask(df[c + '_2'].eq(0).values)
# melt the dataframe & assign the items column
s = df[[*i, *c]].melt(i, var_name='columns')\
.assign(items=m.values.T.reshape((-1, 1)))
# drop the nan values and sort the dataframe
s = s.dropna(subset=['items']).sort_values(i, ignore_index=True)
Details:
mask the values in columns ending with suffix _1 where the values in the corresponding flag columns equals 0:
a_1 b_1 c_1
0 NaN 0.28 0.18
1 0.2 0.47 0.40
2 NaN NaN NaN
3 0.4 0.05 NaN
melt the dataframe containing the columns a, b, c, then reshape the masked values and assign a new column items in melted dataframe:
time level columns value items
0 01-01-2020 q a 1 NaN
1 02-01-2020 q a 2 0.20
2 01-01-2020 r a 3 NaN
3 02-01-2020 r a 4 0.40
4 01-01-2020 q b 12 0.28
5 02-01-2020 q b 34 0.47
6 01-01-2020 r b 54 NaN
7 02-01-2020 r b 67 0.05
8 01-01-2020 q c 18 0.18
9 02-01-2020 q c 29 0.40
10 01-01-2020 r c 39 NaN
11 02-01-2020 r c 47 NaN
Lastly drop the NaN values in items and sort the values on time and level to get the final result:
time level columns value items
0 01-01-2020 q b 12 0.28
1 01-01-2020 q c 18 0.18
2 02-01-2020 q a 2 0.20
3 02-01-2020 q b 34 0.47
4 02-01-2020 q c 29 0.40
5 02-01-2020 r a 4 0.40
6 02-01-2020 r b 67 0.05
Solution 2:[2]
There may be a more elegant way, but this works. Extract data for each column name (a, b, c), select those that have flags set to 1 and concatenate the results.
df.set_index(['time', 'level'], inplace=True)
parts = []
for name in 'a','b','c':
d = df[[name, f'{name}_1', f'{name}_2']]\
.rename(columns={name: 'values', f'{name}_1': 'items', f'{name}_2': 'flag'})
d['column'] = name
parts.append(d[d.flag == 1])
pd.concat(parts)[['column','values','items']].reset_index()
Solution 3:[3]
Step 1: reorder the columns such that the numbers come before the letters:
res = df.copy()
res.columns = ["_".join(entry.split("_")[::-1]) for entry in res]
Step2 : reorder the columns (again) such that "num" is prefixed if the column is in ("a","b","c")
res.columns = [f"num_{letter}" if letter in ("a", "b", "c")
else letter
for letter in res]
res
time level num_a num_b num_c 1_a 2_a 1_b 2_b 1_c 2_c
0 01-01-2020 q 1 12 18 0.1 0 0.28 1 0.18 1
1 02-01-2020 q 2 34 29 0.2 1 0.47 1 0.40 1
2 01-01-2020 r 3 54 39 0.3 0 0.02 0 0.12 0
3 02-01-2020 r 4 67 47 0.4 1 0.05 1 0.01 0
Step 3: Use pandas wide to long to reshape the data, filter for rows equal to 1, rename the columns and finally reset the index:
(
pd.wide_to_long(
res,
stubnames=["num", "1", "2"],
i=["time", "level"],
j="column",
sep="_",
suffix=".",
)
# this is where the filter for rows equal to 1 occur
.query("`2`==1")
.drop(columns="2")
.set_axis(["values", "items"], axis="columns")
.reset_index()
)
time level column values items
0 01-01-2020 q b 12 0.28
1 01-01-2020 q c 18 0.18
2 02-01-2020 q a 2 0.20
3 02-01-2020 q b 34 0.47
4 02-01-2020 q c 29 0.40
5 02-01-2020 r a 4 0.40
6 02-01-2020 r b 67 0.05
This is another way, but same idea of renaming the columns - makes it easy to reshape with wide to long:
result = df.rename(
columns=lambda x: f"values_{x}"
if x in ("a", "b", "c")
else f"items_{x[0]}"
if re.search(".1$", x)
else f"equals1_{x[0]}"
if re.search(".2$", x)
else x
)
(
pd.wide_to_long(
result,
stubnames=["values", "items", "equals1"],
i=["time", "level"],
j="column",
sep="_",
suffix=".",
)
.query("equals1==1")
.iloc[:, :-1]
.reset_index()
)
Another option is the pivot_longer function from pyjanitor, using the .value placeholder:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(index = ['time', 'level'],
names_to = ["column", ".value"],
names_pattern = r"(.)_?(.?)",
sort_by_appearance = True)
.query('`2` == 1')
.drop(columns = '2')
.rename(columns={'':'values', '1':'items'})
)
time level column values items
1 01-01-2020 q b 12 0.28
2 01-01-2020 q c 18 0.18
3 02-01-2020 q a 2 0.20
4 02-01-2020 q b 34 0.47
5 02-01-2020 q c 29 0.40
9 02-01-2020 r a 4 0.40
10 02-01-2020 r b 67 0.05
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | DYZ |
| Solution 3 |
