'Two random unique samples of the same pool
I am trying to get two samples with unique elements in each sample. That is, the strings on the "first" vector cannot be in the "second" vector. Unfortunately, I always get repeated strings and I can't see to find a way of solving this. I tried to solve using if-else, but with no success.
edit: the final output should be pairs. The same numbers in first should be in second. The only thing that will vary is the letters. Each letter have to appear exactly three times. The reason I don't want repeated elements, is that when I am creating the pairs, I get pairs such as 1_W and 1_W. That cannot happen.
The output should be something like:
first: 12_U, 23_U, 6_U, 8_T, 24_T, 22_T, 7_S, 10_S, 19_S, 21_W, 14_W, 2_W
second: 12_W, 23_W, 6_W, 8_S, 24_S, 22_S, 7_T, 10_T, 19_T, 21_U, 14_U, 2_U
Edit 2:
I did a terrible job at explaining what I need. This code is going to be used to select headlines for a study I'm going to collect data.
Each theme represents a headline about a specific topic, such as global warming. There are 24 themes. Each version (U, T, S, W) represents variations of a true headline (T).
I have a headlines bank with a total of 96 headlines that varies in terms of themes and versions. 1_U is the U version of theme 1. I want to check which versions participants will choose for each pair.
What I need is
- to select 12 themes;
- to create pairs within the same theme so participants can choose between two versions of the same headline.
- participants need to see always: 12 pairs (2 versions of the same theme).
- I also need to guarantee that they will see equal proportions of each version. That's why I created vector “first” and vector “second” that meet this criteria.
However I am getting pairs with repeated versions. Therefore, some pairs I am getting is 12_S and 12_S, when they should be 12_S and any other version (12_U, 12_S or 12_W) because it does not make sense for a participant to choose between the S version of theme 12 and the S version of theme 12.
By creating two vectors I was able to get exactly what I wanted except for the fact that some pairs contain the same headline.
themes <- c(1:24)
set.seed(1)
twelve <- sample(themes, 12)
versions <- c('U', 'T', 'S', 'W')
set.seed(14)
first <- sample(paste(sample(twelve), rep(versions, 3), sep='_'))
second <- sample(paste(sample(twelve), rep(versions, 3), sep='_'))
repeated <- first[first %in% second]
if (is.null(repeated)) {
print(second) #if there are no elements in the vector "repeated", then print repeated
} else {
x <- sample(paste(sample(twelve), rep(versions, 3), sep='_')) #otherwise, pick another sample
}
Solution 1:[1]
To make sure you get 2 vectors first and second where themes in first do not exist in second you either need repeated themes within a vector, or you must use sampling to split the themes up.
set.seed(1)
themes <- 1:24
versions <- c('U', 'T', 'S', 'W')
split_idx <- sample(length(themes), 0.5*length(themes))
set_1 <- themes[split_idx]
set_2 <- themes[-split_idx]
Which creates 2 unique samples, verified by
set_1 %in% set_2
Which should return a boolean vector with only FALSE entries.
If you only want 3 letters in the final 2 vectors I suggest the following:
first <- paste(sample(set_1), sample(versions, 3), sep = "_")
secnd <- paste(sample(set_2), sample(versions, 3), sep = "_")
The usage of rep(versions, 3) is unnecessary, as R automatically replicates if one vector is shorter.
To get new vectors with changing themes that preserve these properties, you must split themes again into 2 sets.
Edit 1: In response to the updated question.
To generate one sample of themes:
set.seed(1)
themes <- 1:24
versions <- c('U', 'T', 'S', 'W')
theme_sample <- sample(themes, 12)
To get the versions to be random and different between the two vectors, the following "hacky" solution came to mind.
first_versions <- sample(versions)
while(sum((second_versions <- sample(versions)) == first_versions) != 0){}
The above creates one sample, then continuously recreates a second sample until versions are no longer repeated elementwise. All that is left is to get the final vectors
first <- paste(theme_sample, first_versions, sep = "_")
second <- paste(theme_sample, second_versions, sep = "_")
As required.
Solution 2:[2]
I think you make your life easier to sample your pairs (with no duplicates) and then paste with your theme value. So we first sample 12 themes, then apply over that list and paste it with your pair of versions. You get a matrix with 2 rows with your pairs.
set.seed(1)
themes <- 1:24
versions <- c("U", "T", "S", "W")
pairs <- sapply(sample(themes, 12), FUN = function(x) paste(x, sample(versions, 2), sep = "_"))
pairs
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
# [1,] "4_T" "7_S" "1_S" "2_U" "11_U" "14_U" "18_T" "22_T" "5_W" "16_U" "10_T" "6_T"
# [2,] "4_W" "7_U" "1_U" "2_W" "11_T" "14_W" "18_W" "22_U" "5_S" "16_S" "10_W" "6_W"
first <- pairs[1, ]
# [1] "4_T" "7_S" "1_S" "2_U" "11_U" "14_U" "18_T" "22_T" "5_W" "16_U" "10_T" "6_T"
second <- pairs[2, ]
# [1] "4_W" "7_U" "1_U" "2_W" "11_T" "14_W" "18_W" "22_U" "5_S" "16_S" "10_W" "6_W"
Solution 3:[3]
Here a brute force approach. I would create two samples for two themes the 12 participants choose from. sample the versions in the same way. repeat until there is no dupe for each participant in both (i.e. in each row of the resulting matrices). Next, copy rows of samp_vs each two times and paste both together using Map. Wrap it in a function samp_fun.
samp_fun <- \(themes, versions) {
themes_12 <- sample(themes, 12)
repeat {
samp_th <- replicate(2, sample(themes_12))
samp_vs <- replicate(2, sample(versions))
if (!any(apply(samp_th, 1, duplicated)) &
!any(apply(samp_vs, 1, duplicated))) break
}
samp_vs <- samp_vs[rep(seq_len(nrow(samp_vs)), each=3), ]
Map(\(...) paste(..., sep='_'),
as.data.frame(samp_th), as.data.frame(samp_vs)) |>
setNames(c('first', 'second'))
}
Usage
themes <- 1:24
versions <- c('U', 'T', 'S', 'W')
set.seed(42)
res <- samp_fun(themes, versions)
Result
Gives a list with the two groups.
res$first
# [1] "4_S" "15_S" "9_S" "18_T" "5_T" "20_T"
# [7] "17_W" "24_W" "8_W" "7_U" "1_U" "10_U"
res$second
# [1] "15_U" "4_U" "10_U" "8_W" "7_W" "24_W"
# [7] "5_S" "18_S" "1_S" "17_T" "9_T" "20_T"
If you want first, second in workspace, use list2env.
list2env(res, .GlobalEnv)
first
second
Note: R >= 4.1 used.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 |
