'Could you explain outliers filtering?

I have a DataFrame that contains a priori outliers values. I would like to remove at least outliers from "Rainfall" variable. I do it as following. It looks to work but i still have outliers in the second figure. Is it normal ?

  1. Before removing outliers

enter image description here

  1. Removing outliers
rainfall = df["Rainfall"]
q3 = np.quantile(rainfall, 0.75)
q1 = np.quantile(rainfall, 0.25)

iqr = q3 - q1

upper_bound = q1 + 1.5 * iqr
lower_bound = q3 - 1.5 * iqr

rainfall_wo_outliers = df[(rainfall <= lower_bound) | (rainfall >= upper_bound)]["Rainfall"]
  1. After removing outliers

enter image description here

Ps: I have scaled the data before with MinMaxScaler



Solution 1:[1]

First thing, your condition to remove the outliers is inverted, you should use (rainfall >= lower_bound) & (rainfall <= upper_bound)

Now back to the fundamental question. Yes, this is normal. Outlier removal (with your method at least), relies on the current distribution quartiles to compute the IQR and decide which points to drop.

However, once you drop the data, the new population has new statistical parameters, meaning you will eventually get new outliers, relative to the new Q1, and Q3.

This is particularly visible for normal or uniform data:

import numpy as np
import matplotlib.pyplot as plt

def iqr_outliers_removal(s):
    q1, q3 = np.quantile(s, [0.25, 0.75])
    iqr = q3 - q1
    upper_bound = q1 + 1.5 * iqr
    lower_bound = q3 - 1.5 * iqr
    
    return s[(s>=lower_bound) & (s<=upper_bound)]

# generate random data
s = np.random.normal(size=10_000)

# iteratively remove outliers
s2 = s.copy()
n = len(s2)
out = [s2]
for _ in range(100):
    print('.', end='')
    s2 = iqr_outliers_removal(s2)
    out.append(s2)
    
ax = plt.subplot()
ax.plot(list(map(len, out)), marker='.', ls='')
ax.set_ylabel('data size')
ax.set_xlabel('iteration')
ax.set_yscale('log')

size of the sample for 100 iterations of outliers removal:

iterative outlier removal

Now, it can happen that you remove outliers and the new population becomes stable. If you use s = np.random.uniform(size=10_000) and run the simulation a few times, there is a chance you sometimes get something like:

stable population

But this is just by chance ;)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 mozway