'If the number of rows in a group exceeds X number of observations, randomly sample X number of rows
I need to reduce the number of rows in a data set. To do this my strategy is to the number of rows in a group exceeds X number of observations, randomly sample X
number of rows from each group if the number of rows in a group exceeds X
rows.
Assume the following data set:
set.seed(123)
n <- 10
df <- data.frame(id = c(1:n),
group = sample(1:3, n, replace = T))
> df
id group
1 1 3
2 2 3
3 3 3
4 4 2
5 5 3
6 6 2
7 7 2
8 8 2
9 9 3
10 10 1
where X == 2
. Let's count the number of rows in each group.
> table(df$group)
1 2 3
1 4 5
This means that in the end result, I want 1 observation in groups one, and 2 in groups 2 and 3. The row that is kept in groups 2 and 3 should be randomly selected. This would reduce the data's size from 10 rows to 5.
How do I do this in an efficient way?
Thanks!
Solution 1:[1]
Here is one way to group by group column and create a condition in slice
to check if the number of rows (n()
) is greater than 'X', sample the sequence of rows (row_number()
) with X
or else return row_number()
(or sample in case X
is different value
library(dplyr)
X <- 2
df %>%
group_by(group) %>%
slice(if(n() >= X) sample(row_number(), X, replace = FALSE) else
sample(row_number())) %>%
ungroup
-output
# A tibble: 5 × 2
id group
<int> <int>
1 10 1
2 8 2
3 4 2
4 1 3
5 9 3
Solution 2:[2]
Probably we can use pmin
when we randomly sample
rows by groups
df %>%
group_by(group) %>%
slice(sample(n(), pmin(n(), 2))) %>%
ungroup()
and below is an data.table
equivalent
setDT(df)[, .SD[sample(.N, pmin(.N, X))], group]
Solution 3:[3]
I'm adding an alternative method using data.table
that seems to work. Good to have alternatives.
library(data.table)
X <- 2
setkey(df, "id")
df[
df[,
if(.N >= X)
.I[sample(.N, X, replace = FALSE)]
else
.I[sample(.N, replace = FALSE)],
by = group
]$V1
]
Which leads to this outcome.
id group
1: 5 3
2: 2 3
3: 6 2
4: 4 2
5: 10 1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | ThomasIsCoding |
Solution 3 |