'Data splitting: ordinal grouped data with custom probability of outcomes
The createDataPartition in caret has a Data Splitting function which can sample data preserving the relative outcome of each rating. I am looking for something similar, but that can preserve groups and handle ordinal data
I am trying to specify the target distribution of my outcomes. I want to preserve groups and see which groups I should conduct a follow-up experiment with if I want to reach a target distribution (rather than simply the current distribution). I have made code that attempts this in a very blunt way:
# Load data
library(rethinking)
data(Trolley)
d <- Trolley
# Inspect current distribution of ratings
d$response <- factor(d$response)
round(summary(d$response)/dim(d)[1],2)
# Find 5 cases that roughly have my target distribution
targetdist <- c(0.3,0.1,0.1,0.1,0.1,0.1,0.1) # Arbitrary goal
# Unique cases
uniqcase <- unique(d$case)
# Poor method
runs <- 100
difmatrix <- matrix(NA,runs,2)
for(i in 1:runs){
# Take subset
difmatrix[i,1] <- i
set.seed(i)
casetests<- sample(uniqcase,5)
datasub <- subset(d, case %in% casetests)
# Find ratings of subset
difmatrix[i,2] <- sum(abs(round(summary(datasub$response)/dim(datasub)[1],2)-targetdist))
}
difmatrix[which.min(difmatrix[,2]),]
# Look at best distribution
set.seed(which.min(difmatrix[,2]))
casetests<- sample(uniqcase,5)
datasub <- subset(d, case %in% casetests)
round(summary(datasub$response)/dim(datasub)[1],2) # Current best distribution
In this toy example, the overall distribution in the data is:
0.13 0.09 0.11 0.23 0.15 0.15 0.15
I aim to get a distribution of
0.3,0.1,0.1,0.1,0.1,0.1,0.1 and get one of:
0.21 0.12 0.12 0.22 0.12 0.11 0.10
I cannot help but think there is a better way to do it. For my actual case, I want to select about 200 from a group of 10,000 so it seems unlikely that I can luck on a good choice.
Thanks for reading. I hope this makes sense at all. I have been working on it for a while, yet still have issues formulating it concisely.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
