'Could you explain outliers filtering?
I have a DataFrame that contains a priori outliers values. I would like to remove at least outliers from "Rainfall" variable. I do it as following. It looks to work but i still have outliers in the second figure. Is it normal ?
- Before removing outliers
- Removing outliers
rainfall = df["Rainfall"]
q3 = np.quantile(rainfall, 0.75)
q1 = np.quantile(rainfall, 0.25)
iqr = q3 - q1
upper_bound = q1 + 1.5 * iqr
lower_bound = q3 - 1.5 * iqr
rainfall_wo_outliers = df[(rainfall <= lower_bound) | (rainfall >= upper_bound)]["Rainfall"]
- After removing outliers
Ps: I have scaled the data before with MinMaxScaler
Solution 1:[1]
First thing, your condition to remove the outliers is inverted, you should use (rainfall >= lower_bound) & (rainfall <= upper_bound)
Now back to the fundamental question. Yes, this is normal. Outlier removal (with your method at least), relies on the current distribution quartiles to compute the IQR and decide which points to drop.
However, once you drop the data, the new population has new statistical parameters, meaning you will eventually get new outliers, relative to the new Q1, and Q3.
This is particularly visible for normal or uniform data:
import numpy as np
import matplotlib.pyplot as plt
def iqr_outliers_removal(s):
q1, q3 = np.quantile(s, [0.25, 0.75])
iqr = q3 - q1
upper_bound = q1 + 1.5 * iqr
lower_bound = q3 - 1.5 * iqr
return s[(s>=lower_bound) & (s<=upper_bound)]
# generate random data
s = np.random.normal(size=10_000)
# iteratively remove outliers
s2 = s.copy()
n = len(s2)
out = [s2]
for _ in range(100):
print('.', end='')
s2 = iqr_outliers_removal(s2)
out.append(s2)
ax = plt.subplot()
ax.plot(list(map(len, out)), marker='.', ls='')
ax.set_ylabel('data size')
ax.set_xlabel('iteration')
ax.set_yscale('log')
size of the sample for 100 iterations of outliers removal:
Now, it can happen that you remove outliers and the new population becomes stable. If you use s = np.random.uniform(size=10_000) and run the simulation a few times, there is a chance you sometimes get something like:
But this is just by chance ;)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | mozway |




