'How to randomly create null values in pandas dataframe and store original values in the replaced cells?
Lets say I have a dataframe df:
| A | B | C |
|---|---|---|
| 1 | 2 | 6 |
| 3 | 4 | 5 |
| 5 | 6 | 2 |
| 2 | 3 | 3 |
and I want to create random null values say 25% per column something like this:
| A | B | C |
|---|---|---|
| 1 | null | null |
| null | 4 | 5 |
| 5 | null | 2 |
| null | 3 | null |
Now i want to save original values of these null values may be as an array or dict?
so that I have original values of only replaced nulls.
Original ={'A2': 3,'A4': 2, 'B1':2, 'B3':6}
Replaced ={'A2': null,'A4': null, 'B1':null, 'B3':null}
Ideally I want to have original values for the replaced cells in an array.
Solution 1:[1]
Starting data: df
A B C
0 1 2 6
1 3 4 5
2 5 6 2
3 2 3 3
Get a random sample of indices/columns:
df_nulls = df.apply(lambda x: x.sample(frac=0.25))
A B C
0 NaN 2.0 NaN
2 5.0 NaN NaN
3 NaN NaN 3.0
Note: NaN represent unselected values
Using numpy you can get the coordinates for non-null values:
rows, cols = np.where(df_nulls.notnull())
print(rows)
# [0 1 2]
print(cols)
# [1 0 2]
Build your Original dict:
Original = {f"{df.columns[c]}{r}": df.iloc[r, c] for r, c in zip(rows, cols)}
print(Original)
# {'B0': 2, 'A1': 3, 'C2': 2}
Build your Replaced dict:
Replaced = {k: float("nan") for k in Original}
print(Replaced)
# {'B0': nan, 'A1': nan, 'C2': nan}
Complete the replacement:
for r, c in zip(rows, cols):
df.iloc[r, c] = float("nan")
print(df)
A B C
0 1.0 NaN 6.0
1 NaN 4.0 5.0
2 5.0 6.0 NaN
3 2.0 3.0 3.0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
