'Drop first nan rows in multiple columns
I have the below df:
ID Number Number 2 Number 3
1 10001 NaN NaN 5
2 10001 25 NaN 12
3 10001 78 4 NaN
4 10002 3 NaN NaN
5 10002 234 201 NaN
6 10002 NaN 510 34
7 10003 NaN 765 NaN
8 10003 NaN 422 NaN
9 10003 NaN 753 56
10 10003 231 7587 2345
I want to structure the data so the first NaN rows are deleted by column.
Resuling df:
ID Number Number 2 Number 3
1 10001 25 4 5
2 10001 78 NaN 12
3 10001 NaN NaN NaN
4 10002 3 201 34
5 10002 234 510 NaN
6 10002 NaN NaN NaN
7 10003 231 765 56
8 10003 NaN 422 2345
9 10003 NaN 753 NaN
10 10003 NaN 7587 NaN
I'm essentially trying to shift the column data up by n rows depending on where the data starts for that column, so at the first rows of ID there is always data in at least 1 of the Number columns.
I've tried first_row_index but this doesn't work by individual column
I've tried dropna but I can't find a solution where I'm defining what number of rows to drop per column.
Solution 1:[1]
EDIT:
I missed the ID condition. You need to use groupby in that case.
# In one liner
df.groupby('ID').apply(lambda x:x.reset_index().apply(lambda y: y.shift(-y.first_valid_index())).set_index('index'))
Explanation:
- Group the dataframe by ID
- First
apply(one withlambda x) receives the grouped dataframe as parameterreset_index()is to make the grouped dataframe starts with index 0 (else it'll use index from the whole dataframe)
- Second
apply(one withlambda y) receives the column of the grouped dataframe as parameter- Get the
first_valid_indexand shift it upwards
- Get the
- As the resetted index is now useless, we give back the
indexcolumn to be used as index
ID Number Number 2 Number 3
0 10001.0 25.0 4.0 5.0
1 10001.0 78.0 NaN 12.0
2 10001.0 NaN NaN NaN
3 10002.0 3.0 201.0 34.0
4 10002.0 234.0 510.0 NaN
5 10002.0 NaN NaN NaN
6 10003.0 231.0 765.0 56.0
7 10003.0 NaN 422.0 2345.0
8 10003.0 NaN 753.0 NaN
9 10003.0 NaN 7587.0 NaN
df.apply(lambda x: x.shift(-x.first_valid_index()))
Solution 2:[2]
Using a stacked version of the dataframe, then shifting by the number on leading NaN per group+column:
(df.set_index('ID', append=True).stack(dropna=False)
.groupby(level=[1,2])
.apply(lambda s: s.shift(-(~s.notna().cummax()).sum()))
.unstack(-1)
.reset_index()
)
Output:
ID Number Number 2 Number 3
1 10001 25.0 4.0 5
2 10001 78.0 NaN 12
3 10001 NaN NaN NaN
4 10002 3.0 201.0 34
5 10002 234.0 510.0 NaN
6 10002 NaN NaN NaN
7 10003 231.0 765.0 56
8 10003 NaN 422.0 2345
9 10003 NaN 753.0 NaN
10 10003 NaN 7587.0 NaN
Solution 3:[3]
Here is another approach, the first column will always have a value and for any ID, only the last row will have NaNs, if at all
# melt to make it a single column, so we drop all the NAN cells/rows
df2=df.melt('ID').dropna(axis=0)
# count the number of values for an ID
df2['ID_Count'] = df2.groupby(['ID']).cumcount()
# Group the result into a set of 3, since we have three columns number, number_2, number_3
df2['new_var'] = (df2['ID_Count'] // (3))
# Generate a new column name
df2['new_var_group'] = 'Number_' + df2.groupby(['ID','new_var']).cumcount().astype(str)
# finally reverse the melt and gnerate the table same as before
df2 = df2.pivot_table(index=['ID','new_var' ], columns='new_var_group', values='value').reset_index().drop(columns='new_var', axis=1)
df2
new_var_group ID Number_0 Number_1 Number_2
0 10001 25.0 78.0 4.0
1 10001 5.0 12.0 NaN
2 10002 3.0 234.0 201.0
3 10002 510.0 34.0 NaN
4 10003 231.0 765.0 422.0
5 10003 753.0 7587.0 56.0
6 10003 2345.0 NaN NaN
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | mozway |
| Solution 3 | Naveed |
