'Create a column in the original dataset to indicate whether the row was drawn in a random stratified sample

I would like to draw a stratified random sample (n = 375) from a dataset. Based on the stratified random sample, I would like to add a column to the original dataset indicating whether the row is in the stratified random sample (1) or not (0).


iris <- iris

# Get a random stratified sample
library(tidyverse)
stratified <- iris %>%
  group_by(Species) %>%
  sample_n(size=1)

# The final result I would like to get:
iris$sample3 <- 0
iris[21,6] <- 1
iris[65,6] <- 1
iris[106,6] <- 1

After doing that, I would like to repeat the procedure by drawing a second stratified random sample (n = 125) from my first stratified random sample (n = 375) and repeat the creation of a column.

r


Solution 1:[1]

You can add a column to your data frame that has the required number of 1s per group (and 0 otherwise).

set.seed(1)

samples <- 1

sample1 <- iris %>%
  group_by(Species) %>%
  mutate(sampled = as.numeric(row_number() %in% sample(n(), samples)))

sample1

sample1
#> # A tibble: 150 x 6
#> # Groups:   Species [3]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species sampled
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>     <dbl>
#>  1          5.1         3.5          1.4         0.2 setosa        0
#>  2          4.9         3            1.4         0.2 setosa        0
#>  3          4.7         3.2          1.3         0.2 setosa        0
#>  4          4.6         3.1          1.5         0.2 setosa        1
#>  5          5           3.6          1.4         0.2 setosa        0
#>  6          5.4         3.9          1.7         0.4 setosa        0
#>  7          4.6         3.4          1.4         0.3 setosa        0
#>  8          5           3.4          1.5         0.2 setosa        0
#>  9          4.4         2.9          1.4         0.2 setosa        0
#> 10          4.9         3.1          1.5         0.1 setosa        0
#> # ... with 140 more rows

To get the sampled values, simply filter to find the 1s:

sample1 %>% filter(sampled == 1)
#> # A tibble: 3 x 6
#> # Groups:   Species [3]
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species    sampled
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>        <dbl>
#> 1          4.6         3.1          1.5         0.2 setosa           1
#> 2          5.6         3            4.1         1.3 versicolor       1
#> 3          6.3         3.3          6           2.5 virginica        1

Created on 2022-05-16 by the reprex package (v2.0.1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1