'Outlier Detection and Remove

I am trying to remove the outliers in my dataset which has more than 80,000 plus rows and 20 plus variables. I am using the lm function and plotting the regression, then removing the outliers displayed in the graphs. However, when I removed the outliers and plot again. the new outliers emerged. Is there a way to detect and remove all at once? Or a better way for detection and removal.enter image description here



Solution 1:[1]

What is an outlier? The example below that shows detection of outliers and removal of those points based on an arbitrary statistic is not sound practice.

I will be using box-and-whiskers plots, not linear regression since there is no data in the question but the general principle is the same.

  1. The data vector x is sampled from a standard gaussian variate;
  2. Function boxplot is called in order to see which data points are outside the whiskers;
  3. A new vector y is created with those points removed. Why? If they were standard gaussian why are they outliers? The answer is they are not. Not everything is average and data away from measures of central tendency are still data.
  4. Now comes trouble. Plot the new vector y and not all points are inside the whiskers, some points (just one in this case) are now outliers. What a surprise, stats depend on data! Iterate until there's none left and all problems are solved.
knitr::opts_chunk$set(fig.width = 8, fig.height = 6, dpi = 200, warning = FALSE)

set.seed(2022)

x <- rnorm(1e3)
xbp <- boxplot(x, plot = FALSE)
which(x %in% xbp$out)
#>  [1]   6  89  95 459 464 482 693 694 748 756 926 939

y <- x[-which(x %in% xbp$out)]
ybp <- boxplot(y, plot = FALSE)
i_yout <- which(y %in% ybp$out)

old_par <- par(mfrow = c(1, 2))
boxplot(x, main = "Before outlier removal")
points(1, y[i_yout], col = "red", pch = 20, cex = 2)
boxplot(y, main = "After outlier removal")
points(1, y[i_yout], col = "red", pch = 20, cex = 2)

par(old_par)

Created on 2022-03-11 by the reprex package (v2.0.1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Rui Barradas