'Replace and delete first and last percentile in dataframe or multiple columns at once

I have this dataset:

A <- paste0("event_", c(1:100))
some_number <- sample.int(1000,size=100) 
X1 <- c(1:100)
X2 <- c(101:200)
X3 <- c(201:300)
X4 <- c(301:400)
X5 <- c(401:500)
DF <- data.frame(A, some_number, X1, X2, X3, X4, X5)

As I'm treating outliers, I'm looking to delete the rows that contains the 1th and the latest percentile, considering only the X variables for the percentile computation and all X variables as ONE group. Hence, the percentiles will consider X1 to X5 as ONE group. For this it occurs to me these steps:

Replace the values of X1 to X5 with 1 to 100 (1 for each percentile). Remember, I'm not looking for the percentiles of each X, but for all X's as a whole.
Delete the rows where the variables X1 to X5 contains 1 or 100

My attempt: (based on how to find percentiles, replace outliers with the 5th and 95th percentile, remove data greater than 95th percentile in data frame)

as.data.frame(sapply(select(DF, X1:X5), function (x) {
     qx <- quantile(x, probs = c(1:100)/100)
     cut(x, qx, labels = c(1:100))
}))

But.. my attempt raises the error that the number of breaks is different to the number of labels, I'm struggling to assign the new dataframe without losing A and some_number variables (in my real problem they are not two columns, but nearly 50)

Any suggestions?

Solution 1:^[1]

Using both across and c_across in dplyr, you may also do this-

Steps explained -

c_across is usually used with row_wise as it creates a complete copy of data subsetted through its inner argument. But I have done it without rowwise() so instead of creating one row it is creating a copy of whole data as desired.
thereafter two quantiles of this data will be deduced. (which will be scalar quantities)
Now only job remains is to to check these values with every other value in data. So I used here across directly.
Using across I built a lambda formula which starts with a twiddle and its argument is . only. This twiddle style formula ~ . is equivalent to function(x) x and the rest is clear.

DF %>% mutate(across(starts_with('X'), ~ifelse(. > quantile(c_across(starts_with('X')), 0.99) |
                                                 . < quantile(c_across(starts_with('X')), 0.01),
                                               NA, .) 
                     )) %>% na.omit()

#>           A some_number X1  X2  X3  X4  X5
#> 6   event_6          69  6 106 206 306 406
#> 7   event_7         871  7 107 207 307 407
#> 8   event_8         356  8 108 208 308 408
.
.
.
#> 93 event_93         432 93 193 293 393 493
#> 94 event_94         967 94 194 294 394 494
#> 95 event_95         516 95 195 295 395 495

Since starts_with works only in across or c_across and to avoid slower rowwise here, we can also do this directly

DF %>% filter(rowSums(cur_data()[str_detect(names(DF), 'X')] > quantile(c_across(starts_with('X')), 0.99)) == 0 &
                rowSums(cur_data()[str_detect(names(DF), 'X')] < quantile(c_across(starts_with('X')), 0.01)) == 0)

This will also give 90 rows in output as desired

Solution 2:^[2]

You can try the following -

library(dplyr)
vec <- DF %>% select(starts_with('X')) %>% as.matrix() %>% quantile(c(0.01, 0.99))

DF %>% filter(if_all(starts_with('X'), ~. > vec[1] & . < vec[2]))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2	Ronak Shah

'Replace and delete first and last percentile in dataframe or multiple columns at once

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]