'If the number of rows in a group exceeds X number of observations, randomly sample X number of rows

I need to reduce the number of rows in a data set. To do this my strategy is to the number of rows in a group exceeds X number of observations, randomly sample X number of rows from each group if the number of rows in a group exceeds X rows.

Assume the following data set:

set.seed(123)
n <- 10

df <- data.frame(id = c(1:n),
                 group = sample(1:3, n, replace = T))

> df
   id group
1   1     3
2   2     3
3   3     3
4   4     2
5   5     3
6   6     2
7   7     2
8   8     2
9   9     3
10 10     1

where X == 2. Let's count the number of rows in each group.

> table(df$group)

1 2 3 
1 4 5 

This means that in the end result, I want 1 observation in groups one, and 2 in groups 2 and 3. The row that is kept in groups 2 and 3 should be randomly selected. This would reduce the data's size from 10 rows to 5.

How do I do this in an efficient way?

Thanks!



Solution 1:[1]

Here is one way to group by group column and create a condition in slice to check if the number of rows (n()) is greater than 'X', sample the sequence of rows (row_number()) with X or else return row_number() (or sample in case X is different value

library(dplyr)
X <- 2
df %>% 
  group_by(group) %>% 
  slice(if(n() >= X) sample(row_number(), X, replace = FALSE) else 
     sample(row_number())) %>%
  ungroup

-output

# A tibble: 5 × 2
     id group
  <int> <int>
1    10     1
2     8     2
3     4     2
4     1     3
5     9     3

Solution 2:[2]

Probably we can use pmin when we randomly sample rows by groups

df %>%
    group_by(group) %>%
    slice(sample(n(), pmin(n(), 2))) %>%
    ungroup()

and below is an data.table equivalent

setDT(df)[, .SD[sample(.N, pmin(.N, X))], group]

Solution 3:[3]

I'm adding an alternative method using data.table that seems to work. Good to have alternatives.

library(data.table)

X <- 2

setkey(df, "id") 

df[
  df[,
     if(.N >= X)
       .I[sample(.N, X, replace = FALSE)]
     else
       .I[sample(.N, replace = FALSE)], 
     by = group
  ]$V1
]

Which leads to this outcome.

   id group
1:  5     3
2:  2     3
3:  6     2
4:  4     2
5: 10     1

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 ThomasIsCoding
Solution 3