'Melting a dataframe based on a flag

I've a dataframe like this:

pd.DataFrame({'time':['01-01-2020','02-01-2020','01-01-2020','02-01-2020'],'level':['q','q','r','r'],'a':[1,2,3,4],'b':[12,34,54,67],'c':[18,29,39,47],'a_1':[0.1,0.2,0.3,0.4],'a_2':[0,1,0,1],'b_1':[0.28,0.47,0.02,0.05],'b_2':[1,1,0,1],'c_1':[0.18,0.40,0.12,0.01],'c_2':[1,1,0,0]})
>>  time      level a   b   c   a_1 a_2  b_1   b_2  c_1   c_2
0   01-01-2020  q   1   12  18  0.1 0   0.28    1   0.18    1
1   02-01-2020  q   2   34  29  0.2 1   0.47    1   0.40    1
2   01-01-2020  r   3   54  39  0.3 0   0.02    0   0.12    0
3   02-01-2020  r   4   67  47  0.4 1   0.05    1   0.01    0

I wish to melt the data with time and level as index have all other columns as rows which had a flag 1 corresponding to their prefixes. Eg. I wish to have the values of a and a_1 listed as values and items if the value a_2 was 1. Desired output:

>>   time   level column values items
0   01-01-2020  q   b   12  0.28
1   01-01-2020  q   c   18  0.18
2   02-01-2020  q   a   2   0.20
3   02-01-2020  q   b   34  0.47
4   02-01-2020  q   c   29  0.40
5   02-01-2020  r   a   4   0.40
6   02-01-2020  r   b   67  0.05

I can get all the values irrespective of the flags and then filter for flags==1. But,not sure how to "melt" /"unstack" in this case. I tried a lot of ways but in vain. Please help me out here.



Solution 1:[1]

Let's try with melt:

i, c = ['time', 'level'], pd.Index(['a', 'b','c'])

# mask the values where flag=0
m = df[c + '_1'].mask(df[c + '_2'].eq(0).values)

# melt the dataframe & assign the items column
s = df[[*i, *c]].melt(i, var_name='columns')\
                .assign(items=m.values.T.reshape((-1, 1)))

# drop the nan values and sort the dataframe
s = s.dropna(subset=['items']).sort_values(i, ignore_index=True)

Details:

mask the values in columns ending with suffix _1 where the values in the corresponding flag columns equals 0:

   a_1   b_1   c_1
0  NaN  0.28  0.18
1  0.2  0.47  0.40
2  NaN   NaN   NaN
3  0.4  0.05   NaN

melt the dataframe containing the columns a, b, c, then reshape the masked values and assign a new column items in melted dataframe:

          time level columns  value  items
0   01-01-2020     q       a      1    NaN
1   02-01-2020     q       a      2   0.20
2   01-01-2020     r       a      3    NaN
3   02-01-2020     r       a      4   0.40
4   01-01-2020     q       b     12   0.28
5   02-01-2020     q       b     34   0.47
6   01-01-2020     r       b     54    NaN
7   02-01-2020     r       b     67   0.05
8   01-01-2020     q       c     18   0.18
9   02-01-2020     q       c     29   0.40
10  01-01-2020     r       c     39    NaN
11  02-01-2020     r       c     47    NaN

Lastly drop the NaN values in items and sort the values on time and level to get the final result:

         time level columns  value  items
0  01-01-2020     q       b     12   0.28
1  01-01-2020     q       c     18   0.18
2  02-01-2020     q       a      2   0.20
3  02-01-2020     q       b     34   0.47
4  02-01-2020     q       c     29   0.40
5  02-01-2020     r       a      4   0.40
6  02-01-2020     r       b     67   0.05

Solution 2:[2]

There may be a more elegant way, but this works. Extract data for each column name (a, b, c), select those that have flags set to 1 and concatenate the results.

df.set_index(['time', 'level'], inplace=True)
parts = []
for name in 'a','b','c':
    d = df[[name, f'{name}_1', f'{name}_2']]\
         .rename(columns={name: 'values', f'{name}_1': 'items', f'{name}_2': 'flag'})
    d['column'] = name
    parts.append(d[d.flag == 1])
pd.concat(parts)[['column','values','items']].reset_index()

Solution 3:[3]

Step 1: reorder the columns such that the numbers come before the letters:

res = df.copy()
res.columns = ["_".join(entry.split("_")[::-1]) for entry in res]

Step2 : reorder the columns (again) such that "num" is prefixed if the column is in ("a","b","c")

res.columns = [f"num_{letter}" if letter in ("a", "b", "c") 
               else letter 
               for letter in res]
res

time    level   num_a   num_b   num_c   1_a 2_a 1_b 2_b 1_c 2_c
0   01-01-2020  q   1   12  18  0.1 0   0.28    1   0.18    1
1   02-01-2020  q   2   34  29  0.2 1   0.47    1   0.40    1
2   01-01-2020  r   3   54  39  0.3 0   0.02    0   0.12    0
3   02-01-2020  r   4   67  47  0.4 1   0.05    1   0.01    0

Step 3: Use pandas wide to long to reshape the data, filter for rows equal to 1, rename the columns and finally reset the index:

(
    pd.wide_to_long(
        res,
        stubnames=["num", "1", "2"],
        i=["time", "level"],
        j="column",
        sep="_",
        suffix=".",
    )
     # this is where the filter for rows equal to 1 occur
    .query("`2`==1")
    .drop(columns="2")
    .set_axis(["values", "items"], axis="columns")
    .reset_index()
)


time    level   column  values  items
0   01-01-2020  q   b   12  0.28
1   01-01-2020  q   c   18  0.18
2   02-01-2020  q   a   2   0.20
3   02-01-2020  q   b   34  0.47
4   02-01-2020  q   c   29  0.40
5   02-01-2020  r   a   4   0.40
6   02-01-2020  r   b   67  0.05

This is another way, but same idea of renaming the columns - makes it easy to reshape with wide to long:

result = df.rename(
    columns=lambda x: f"values_{x}"
    if x in ("a", "b", "c")
    else f"items_{x[0]}"
    if re.search(".1$", x)
    else f"equals1_{x[0]}"
    if re.search(".2$", x)
    else x
)


(
    pd.wide_to_long(
        result,
        stubnames=["values", "items", "equals1"],
        i=["time", "level"],
        j="column",
        sep="_",
        suffix=".",
    )
    .query("equals1==1")
    .iloc[:, :-1]
    .reset_index()
)

Another option is the pivot_longer function from pyjanitor, using the .value placeholder:

# pip install pyjanitor
import pandas as pd
import janitor

(df
.pivot_longer(index = ['time', 'level'], 
              names_to = ["column", ".value"], 
              names_pattern = r"(.)_?(.?)", 
              sort_by_appearance = True)
.query('`2` == 1')
.drop(columns = '2')
.rename(columns={'':'values', '1':'items'})
)

          time level column  values  items
1   01-01-2020     q      b      12   0.28
2   01-01-2020     q      c      18   0.18
3   02-01-2020     q      a       2   0.20
4   02-01-2020     q      b      34   0.47
5   02-01-2020     q      c      29   0.40
9   02-01-2020     r      a       4   0.40
10  02-01-2020     r      b      67   0.05

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 DYZ
Solution 3