'sample values by group with conditions
I have grouped data and I want to create a new variable value that will take the value 0 or 1.
- Every group needs at least one observation where
value==1. - But groups cannot have more than 2 observations where
value==1. - Ideally I can set it so no more than 25% of groups only have one observation where
value==1.
library(tidyverse)
set.seed(1)
# sample can break the rules
tibble(group = c(rep("A", 3),
rep("B", 6),
rep("C", 4),
rep("D", 5))) %>%
group_by(group) %>%
mutate(value = sample(c(0, 1), n(), replace = TRUE, prob = c(0.8, 0.2)))
Solution 1:[1]
One solution would be to create a listing of your unique group labels and shuffle those (here I get the unique group labels via nest). Then depending on whether the group is in the first 25% of rows of the data frame, you can assign either a) a random number between 1 and 2, or b) always 2. Finally, you can use the assigned number to define how 0s and 1s should be sampled for each group, and then unnest the result.
set.seed(0)
result <- df %>%
nest(data = -group) %>%
.[sample(1:nrow(.), nrow(.)), ] %>% # shuffle the group order
mutate(
value_count = ifelse(row_number() / n() <= 0.25, sample(1:2, n(), replace = T), 2)
) %>%
rowwise() %>%
mutate(
count = nrow(data),
value = list(sample(c(rep(1, value_count), rep(0, count - value_count)), count))
) %>%
unnest(value) %>%
select(-data, -value_count, -count)
group value
<chr> <dbl>
1 B 0
2 B 0
3 B 0
4 B 0
5 B 1
6 B 0
7 A 1
8 A 1
9 A 0
10 D 1
11 D 0
12 D 1
13 D 0
14 D 0
15 C 1
16 C 0
17 C 0
18 C 1
Solution 2:[2]
Looks like I was beat to the punch, but here's another way to do it:
library(tidyverse)
set.seed(1)
# sample can break the rules
x <- tibble(group = c(rep("A", 3),
rep("B", 6),
rep("C", 4),
rep("D", 5)))
# Make all 'var' =1, then set all but first of each group to 0.
xx <- x %>% group_by(group) %>%
mutate(var = row_number()) %>%
mutate(var = ifelse(var == 1, 1, 0))
pct_with_two <- .75 # percentage of groups with two 1's
samp_size <- floor(length(unique(xx$group)) * pct_with_two) #round down to whole number
addl_one <- sample(unique(xx$group), size = samp_size, replace = F)
xx %>%
mutate(var2 = case_when(
group %in% addl_one & row_number() == 2 ~ 1,
TRUE ~0)) %>%
mutate(var = var+var2) %>%
select(-var2)
#> # A tibble: 18 x 2
#> # Groups: group [4]
#> group var
#> <chr> <dbl>
#> 1 A 1
#> 2 A 1
#> 3 A 0
#> 4 B 1
#> 5 B 0
#> 6 B 0
#> 7 B 0
#> 8 B 0
#> 9 B 0
#> 10 C 1
#> 11 C 1
#> 12 C 0
#> 13 C 0
#> 14 D 1
#> 15 D 1
#> 16 D 0
#> 17 D 0
#> 18 D 0
Created on 2022-03-11 by the reprex package (v0.3.0)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | jdobres |
| Solution 2 | mrhellmann |
