'Creating a function for Automatic optimal stratification
Sample data:
dat <- as.data.frame(runif(100000, 0, 3000))
names(dat)[1] <- "Y"
dat$type <- sample(LETTERS, 100000, replace = TRUE)
dat$region <- stri_rand_strings(100000, 1, "[A-D]")
setDT(dat)
dat[, `:=` (count_by_type = .N, var_Y = var(Y)), by = type]
I would like to write a function that does the process of optimal stratification for me. It should essentially check whether the variance decreases by more than 5%, when and extra stratum is added.
The requirements are as follows:
The maximum amount of groups is 8, which would then have the following cut-off points
max_stratums <- c(50, 100, 250, 500, 750, 1000, 1500, 3000)
There is a minimum amount of observations per group is 200. I thought of doing something like creating a function below;
optimal_strat <- function(dat, max_stratums) {
for i in max_stratums {
# Loop through possible cut-off points
setDT(dat)[, SO_cat := cut2(Y, max_stratums[i])]
# Calculate variance for new division
dat[, `:=` (sub_type = .N, var_Y2 = var(Y)), by = c("SO_cat", "type")]
# Now I need to compare if the weighted variance of one of the divisions is 5% less than the old and pick the division with the biggest weighted decrease. Something like:
dat[var_Y_2 > var_Y2,
}
}
How should I code this last part?
Ideally the function would not run once, but until the variance per group no longer decreases enough..
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
