'Stratified sampling
I am still quite new in R and I have a probably quite easy question, I hope you will be able to answer.
I work the dataset: GSS 2010. I have an id for each respondent, I have the variable 'region' with 9 numeric levels. In all 2044 observations of 794 variables.
I want to draw a sample of size 100 with each stratum sampled in proportion to its population size (the full GSS).
I have looked in the package 'sampling' and 'survey', but unfortunately I haven't been able to draw the sample.
So far my best guess is something like this:
#Stratified subsample of GSS2010; regions as strata
s=strata(GSS2010,c("region"),size=c(100), method="systematic", pik=id$region)
I hope you will be able to help. Thank you very much in advance.
Best, Sofie
Solution 1:[1]
As mentioned in one of the comments you could use dplyr::sample_frac setting the fraction to be 100/nrow(gss2010). Below I propose a solution that requires only base R:
#' @title Stratified sampling
#' @description Perform proportional sampling according to specified strata
#' @param x A vector to sample from
#' @param strata A vector denoting the strata to sample by
#' @param size Number of items to sample
stratified_sample <- function(x, strata, size){
if(size >= length(x)) stop("Can't use size >= length(x)")
if(length(x) != length(strata)) stop("x and strata are of different lengths")
samples <- round(table(strata)/length(strata)*size)
idx <- 1:length(x)
unlist(sapply(names(samples), function(y)
x[sample(idx[strata == y], size = samples[names(samples) == y])]))
}
Below you can check out this code for yourself (I didn't find a region column in the data I've found so used degree instead:
library(openintro)
data("gss2010")
gss2010 <- data.frame(gss2010) # convert from tibble back to regular df
gss2010$grass[is.na(gss2010$grass)] <- "LEGAL"
samp <- gss2010[stratified_sample(1:nrow(gss2010), gss2010$degree, size = 100),]
table(samp$degree)/nrow(samp)
table(gss2010$degree)/nrow(gss2010)
Solution 2:[2]
I think the Sample Function in base R should be enough:
s <- GSS2010[sample(dim(GSS2010)[1]),]
This will select hundred lines in your data frame. The probability of each region being picked will be proportional to the number of lines of this region in the data frame.
If this is not what you want please edit the sentence
"with each stratum sampled in proportion to its population size (the full GSS)."
To make it clearer.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | cmbarbu |
