'how to fill N/A values with scaling in pandas

I have a DataFrame by pandas, and it contains a lots of NaN values.

the following figure is about data what I have,

2ndFlrSF SalePrice
0 854 208500
1 0 181500
2 866 223500
3 756 140000
4 1053 250000
... ... ...
1455 694 175000
1456 0 210000
1457 1152 266500
1458 0 142125
1459 0 147500

enter image description here

and next one is what I expected.

enter image description here

I have tried to fill NaN values with average(mean) and most frequents, but it is not what i want to.

enter image description here enter image description here

Is there any package or method to fill the values with scaled for this?

one thing I would like to comment is, I do NOT want to drop this values.

if any solution, please let me know. thanks.

EDITED:

I found this question (dataframe-column-interpolation-weigthed-by-values-of-another-column), but it doesn't work.

I did:

df[['2ndFlrSF', 'SalePrice']].sort_values('SalePrice').set_index('SalePrice')['2ndFlrSF'].interpolate(method='index')

# doesn't work neither
# df2 = df[['2ndFlrSF', 'SalePrice']].sort_values('SalePrice')
# df2.interpolate(method='linear')

but got result:

34900        0
35311        0
37900      368
39300        0
40000      668
          ... 
582933    1088
611657       0
625000    1796
745000    2065
755000    1872
Name: 2ndFlrSF, Length: 1460, dtype: int64


Solution 1:[1]

I think you could use interpolation to estimate your missing data with help of its neighbors. there are different Interpolation methods like:

linear: simply means to estimate a missing value by connecting dots in a straight line in increasing order

a.interpolate()

polynomial interpolation: is filling missing values with the lowest possible degree that passes through available data points

a.interpolate(method="polynomial", order=2)

padding: this means filling missing values with the same value present above them in the dataset.

a.interpolate(method="pad", limit=2)

so you can simply replace nan values as follow (let's suppose a is your data):

import pandas as pd
import numpy as np
a = pd.Series([0, 1, np.nan, 3, 4, 5, 7])
a.interpolate(method="pad", limit=2)

Solution 2:[2]

I still doesn't know the method in Pandas. but I know using this approach is unnormal.

BTW, for answer, resolve with implementation by myself.

import numpy as np
from copy import deepcopy

# fill src[] by weight from dst[]
def interpolate(dst: list, src: list):
    n = len(src)
    a = [[dst[i], src[i], i, src[i] / dst[i]] for i in range(n)]
    a.sort(key=lambda x: (x[0], x[2]))
    res = deepcopy(a)
    for i in range(n):
        if res[i][1] != 0: continue
        l = i
        while l >= 0 and res[l][1] == 0: l -= 1
        r = i
        while r < n and res[r][1] == 0: r += 1

        if l != -1 and r != n:
            step = (a[r][3]-a[l][3]) / (r-l+1)
            if step != 0:
                c = np.arange(a[l][3], a[r][3], step)
                for j in range(len(c)-2):
                    res[l+j+1][1] = a[l+j+1][0] * c[j]
        elif l == -1:
            for j in range(r+1):
                res[j][1] = a[j][0] * a[r][3]
        else:
            for j in range(n-l):
                res[l+j][1] = a[l+j][0] * a[l][3]
    return list(map(lambda p: p[1], sorted(res, key=lambda x: x[2])))

Before interplote

Before

After interpolate values with SalePrice

After

I uploaded code and sample data.

you can see sample data and its result from: https://gist.github.com/joonas-yoon/f5d01db4470ff87e442dc01c99f04c47#file-sample-txt

Thanks for all of comments and replies.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 maryam_k
Solution 2