'Removing outliers creates nulls in pandas dataframe
I have a non null dataframe df which has about 100 columns. I want to remove outliers from each column, for which I'm doing the following.
df1 = df[np.abs(df - df.mean()) <= (3*df.std())]
I would expect df1 to contain lesser number of records than df but using the above method, shape remains same. In addition it is also creating a lof of null values.
My understanding is that its removing outliers but in place of the outliers now I have nulls. Is my understanding correct?
Solution 1:[1]
Your understanding is correct. It is removing outliers and replacing them with NaN:
np.random.seed(0)
df = pd.DataFrame(np.random.normal(0,1,(100,10)))
idx = np.abs(df - df.mean()) <= (3*df.std())
outlier_locations = np.where(idx == False)
df1 = df[idx]
print(outlier_locations)
(array([58]), array([9]))
If you expect df1 to contain less records than df, then perhaps you are wanting to drop the rows or columns that contain the outliers, or simply remove the entry in the row so you are left with ragged arrays.
Solution 2:[2]
Use neulab to delete outliers (https://pypi.org/project/neulab).
There you can select algorithm that you prefer and delete outliers in your data frame.
Example:
from neulab.OutlierDetection import Chauvenet
d = {'col1': [8.02, 8.16, 3.97, 8.64, 0.84, 4.46, 0.81, 7.74, 8.78, 9.26, 20.46, 29.87, 10.38, 25.71], 'col2': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(data=d)
chvn = Chauvenet(dataframe=df, info=True, autorm=True)
Output: Detected outliers: {'col1': [29.87, 25.71, 20.46, 0.84, 0.81, 3.97, 4.46, 10.38, 7.74, 9.26]}
col1 col2
0 8.02 1
1 8.16 1
3 8.64 1
8 8.78 1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Nathaniel |
| Solution 2 | kndahl |
