'Replace and delete first and last percentile in dataframe or multiple columns at once
I have this dataset:
A <- paste0("event_", c(1:100))
some_number <- sample.int(1000,size=100)
X1 <- c(1:100)
X2 <- c(101:200)
X3 <- c(201:300)
X4 <- c(301:400)
X5 <- c(401:500)
DF <- data.frame(A, some_number, X1, X2, X3, X4, X5)
As I'm treating outliers, I'm looking to delete the rows that contains the 1th and the latest percentile, considering only the X variables for the percentile computation and all X variables as ONE group. Hence, the percentiles will consider X1 to X5 as ONE group. For this it occurs to me these steps:
- Replace the values of
X1toX5with 1 to 100 (1 for each percentile). Remember, I'm not looking for the percentiles of eachX, but for all X's as a whole. - Delete the rows where the variables
X1toX5contains 1 or 100
My attempt: (based on how to find percentiles, replace outliers with the 5th and 95th percentile, remove data greater than 95th percentile in data frame)
as.data.frame(sapply(select(DF, X1:X5), function (x) {
qx <- quantile(x, probs = c(1:100)/100)
cut(x, qx, labels = c(1:100))
}))
But.. my attempt raises the error that the number of breaks is different to the number of labels, I'm struggling to assign the new dataframe without losing A and some_number variables (in my real problem they are not two columns, but nearly 50)
Any suggestions?
Solution 1:[1]
Using both across and c_across in dplyr, you may also do this-
Steps explained -
c_acrossis usually used withrow_wiseas it creates a complete copy of data subsetted through its inner argument. But I have done it withoutrowwise()so instead of creating one row it is creating a copy of whole data as desired.- thereafter two quantiles of this data will be deduced. (which will be scalar quantities)
- Now only job remains is to to check these values with every other value in data. So I used here
acrossdirectly. - Using across I built a lambda formula which starts with a
twiddleand its argument is.only. This twiddle style formula~ .is equivalent tofunction(x) xand the rest is clear.
DF %>% mutate(across(starts_with('X'), ~ifelse(. > quantile(c_across(starts_with('X')), 0.99) |
. < quantile(c_across(starts_with('X')), 0.01),
NA, .)
)) %>% na.omit()
#> A some_number X1 X2 X3 X4 X5
#> 6 event_6 69 6 106 206 306 406
#> 7 event_7 871 7 107 207 307 407
#> 8 event_8 356 8 108 208 308 408
.
.
.
#> 93 event_93 432 93 193 293 393 493
#> 94 event_94 967 94 194 294 394 494
#> 95 event_95 516 95 195 295 395 495
Since starts_with works only in across or c_across and to avoid slower rowwise here, we can also do this directly
DF %>% filter(rowSums(cur_data()[str_detect(names(DF), 'X')] > quantile(c_across(starts_with('X')), 0.99)) == 0 &
rowSums(cur_data()[str_detect(names(DF), 'X')] < quantile(c_across(starts_with('X')), 0.01)) == 0)
This will also give 90 rows in output as desired
Solution 2:[2]
You can try the following -
library(dplyr)
vec <- DF %>% select(starts_with('X')) %>% as.matrix() %>% quantile(c(0.01, 0.99))
DF %>% filter(if_all(starts_with('X'), ~. > vec[1] & . < vec[2]))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Ronak Shah |
