'how to fill N/A values with scaling in pandas
I have a DataFrame by pandas, and it contains a lots of NaN values.
the following figure is about data what I have,
| 2ndFlrSF | SalePrice | |
|---|---|---|
| 0 | 854 | 208500 |
| 1 | 0 | 181500 |
| 2 | 866 | 223500 |
| 3 | 756 | 140000 |
| 4 | 1053 | 250000 |
| ... | ... | ... |
| 1455 | 694 | 175000 |
| 1456 | 0 | 210000 |
| 1457 | 1152 | 266500 |
| 1458 | 0 | 142125 |
| 1459 | 0 | 147500 |
and next one is what I expected.
I have tried to fill NaN values with average(mean) and most frequents, but it is not what i want to.
Is there any package or method to fill the values with scaled for this?
one thing I would like to comment is, I do NOT want to drop this values.
if any solution, please let me know. thanks.
EDITED:
I found this question (dataframe-column-interpolation-weigthed-by-values-of-another-column), but it doesn't work.
I did:
df[['2ndFlrSF', 'SalePrice']].sort_values('SalePrice').set_index('SalePrice')['2ndFlrSF'].interpolate(method='index')
# doesn't work neither
# df2 = df[['2ndFlrSF', 'SalePrice']].sort_values('SalePrice')
# df2.interpolate(method='linear')
but got result:
34900 0
35311 0
37900 368
39300 0
40000 668
...
582933 1088
611657 0
625000 1796
745000 2065
755000 1872
Name: 2ndFlrSF, Length: 1460, dtype: int64
Solution 1:[1]
I think you could use interpolation to estimate your missing data with help of its neighbors. there are different Interpolation methods like:
linear: simply means to estimate a missing value by connecting dots in a straight line in increasing order
a.interpolate()
polynomial interpolation: is filling missing values with the lowest possible degree that passes through available data points
a.interpolate(method="polynomial", order=2)
padding: this means filling missing values with the same value present above them in the dataset.
a.interpolate(method="pad", limit=2)
so you can simply replace nan values as follow (let's suppose a is your data):
import pandas as pd
import numpy as np
a = pd.Series([0, 1, np.nan, 3, 4, 5, 7])
a.interpolate(method="pad", limit=2)
Solution 2:[2]
I still doesn't know the method in Pandas. but I know using this approach is unnormal.
BTW, for answer, resolve with implementation by myself.
import numpy as np
from copy import deepcopy
# fill src[] by weight from dst[]
def interpolate(dst: list, src: list):
n = len(src)
a = [[dst[i], src[i], i, src[i] / dst[i]] for i in range(n)]
a.sort(key=lambda x: (x[0], x[2]))
res = deepcopy(a)
for i in range(n):
if res[i][1] != 0: continue
l = i
while l >= 0 and res[l][1] == 0: l -= 1
r = i
while r < n and res[r][1] == 0: r += 1
if l != -1 and r != n:
step = (a[r][3]-a[l][3]) / (r-l+1)
if step != 0:
c = np.arange(a[l][3], a[r][3], step)
for j in range(len(c)-2):
res[l+j+1][1] = a[l+j+1][0] * c[j]
elif l == -1:
for j in range(r+1):
res[j][1] = a[j][0] * a[r][3]
else:
for j in range(n-l):
res[l+j][1] = a[l+j][0] * a[l][3]
return list(map(lambda p: p[1], sorted(res, key=lambda x: x[2])))
Before interplote
After interpolate values with SalePrice
I uploaded code and sample data.
you can see sample data and its result from: https://gist.github.com/joonas-yoon/f5d01db4470ff87e442dc01c99f04c47#file-sample-txt
Thanks for all of comments and replies.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | maryam_k |
| Solution 2 |






