'Making Combinations of Items
Suppose I have the following lists of factor:
factor_1 = c("A1", "A2", "A3")
factor_2 = c("B1", "B2")
factor_3 = c("C1", "C2", "C3", "C4")
factor_4 = c("D1", "D2", "D3")
I made the following data frame that contains all (3 * 2 * 4 * 3 = ) 72 combinations of these factors:
data_exp <- expand.grid(factor_1, factor_2, factor_3, factor_4)
data_exp$id = 1:nrow(data_exp)
> head(data_exp)
Var1 Var2 Var3 Var4 id
1 A1 B1 C1 D1 1
2 A2 B1 C1 D1 2
3 A3 B1 C1 D1 3
4 A1 B2 C1 D1 4
5 A2 B2 C1 D1 5
6 A3 B2 C1 D1 6
I want to randomly split this data (data_exp) into 3 datasets such that each row only appears in one of these datasets - furthermore, these 3 datasets do not have to be the same size. I tried to do this with the following code.
First, I randomly generate 3 random numbers corresponding to the number of rows for each of these datasets, such that these 3 random numbers add to 72:
# https://stackoverflow.com/questions/24845909/generate-n-random-integers-that-sum-to-m-in-r
rand_vect <- function(N, M, sd = 1, pos.only = TRUE) {
vec <- rnorm(N, M/N, sd)
if (abs(sum(vec)) < 0.01) vec <- vec + 1
vec <- round(vec / sum(vec) * M)
deviation <- M - sum(vec)
for (. in seq_len(abs(deviation))) {
vec[i] <- vec[i <- sample(N, 1)] + sign(deviation)
}
if (pos.only) while (any(vec < 0)) {
negs <- vec < 0
pos <- vec > 0
vec[negs][i] <- vec[negs][i <- sample(sum(negs), 1)] + 1
vec[pos][i] <- vec[pos ][i <- sample(sum(pos ), 1)] - 1
}
vec
}
r = rand_vect(3, 72)
[1] 26 23 23
Next, I tried to create these datasets using these random numbers:
data_1 = data_exp[sample(nrow(data_exp), r[1]), ]
data_2 = data_exp[sample(nrow(data_exp), r[2]), ]
data_3 = data_exp[sample(nrow(data_exp), r[3]), ]
- The problem with this approach is that
data_1, data_2, data_3have common rows, and not all the rows from data_exp are necessarily present withindata_1, data_2, data_3.
Is there a way to fix this problem?
Thank you!
Solution 1:[1]
Use the following:
n <- 3
split(data_exp, sample(rep(seq(n), rmultinom(1, nrow(data_exp), rep(1, n)))))
Solution 2:[2]
You can try the code below (given n <- 3)
split(data_exp, paste0("data_", sample(n, nrow(data_exp), replace = TRUE)))
or
split(data_exp, paste0("data_", sort(sample(n, nrow(data_exp), replace = TRUE))))
If you want to generate variables in the global environment, you should use list2env:
list2env(
split(data_exp, paste0("data_", sample(n, nrow(data_exp), replace = TRUE))),
envir = .GlobalEnv
)
or
list2env(
split(data_exp, paste0("data_", sort(sample(n, nrow(data_exp), replace = TRUE)))),
envir = .GlobalEnv
)
Solution 3:[3]
A simple way to do it would be
data_list <- split(data_exp,
sample(rep(c(1:3),
diff(c(0, sort(sample(72, 2)), 72)))))
This generates three numbers (say, x, y and z) that add to 72 (third line). The rep generates a sequence of x 1s, y 2s and z 3s, that are interpreted as the factor levels for splitting the dataframe. sample shuffles this sequence of factors.
The output is a list of three dataframes, which you can easily convert to three separate dataframes with data_1 <- data_list[[1]] etc.
Note that this method of generating three numbers adding up to 72 will create quite a wide variation between runs, compared to some of the other answers here that use a method along the lines of sample(1:3, 72) which will tend to produce about 24 of each number, roughly normally distributed. You should think about what sort of distribution you need. The following simulations illustrate the difference...
summary(as.vector(sapply(1:1000, function(i) diff(c(0, sort(sample(72, 2)), 72)))))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 10 21 24 36 71
summary(as.vector(sapply(1:1000, function(i) tabulate(sample(3, 72, replace = TRUE)))))
Min. 1st Qu. Median Mean 3rd Qu. Max.
11 21 24 24 27 37
Solution 4:[4]
You can use the following code which will split the data in three data frames based on your r to the corresponding names:
data <- setNames(split(data_exp, sample(rep(1:3, r))), c("data_1","data_2","data_3"))
When checking the data by number of rows:
> nrow(data$data_1)
[1] 26
> nrow(data$data_2)
[1] 23
> nrow(data$data_3)
[1] 23
Let's check the dataframes:
> data$data_1
Var1 Var2 Var3 Var4 id
3 A3 B1 C1 D1 3
8 A2 B1 C2 D1 8
10 A1 B2 C2 D1 10
11 A2 B2 C2 D1 11
12 A3 B2 C2 D1 12
13 A1 B1 C3 D1 13
17 A2 B2 C3 D1 17
20 A2 B1 C4 D1 20
26 A2 B1 C1 D2 26
27 A3 B1 C1 D2 27
28 A1 B2 C1 D2 28
31 A1 B1 C2 D2 31
35 A2 B2 C2 D2 35
36 A3 B2 C2 D2 36
38 A2 B1 C3 D2 38
39 A3 B1 C3 D2 39
40 A1 B2 C3 D2 40
45 A3 B1 C4 D2 45
48 A3 B2 C4 D2 48
49 A1 B1 C1 D3 49
50 A2 B1 C1 D3 50
52 A1 B2 C1 D3 52
54 A3 B2 C1 D3 54
56 A2 B1 C2 D3 56
68 A2 B1 C4 D3 68
70 A1 B2 C4 D3 70
> data$data_2
Var1 Var2 Var3 Var4 id
1 A1 B1 C1 D1 1
4 A1 B2 C1 D1 4
7 A1 B1 C2 D1 7
9 A3 B1 C2 D1 9
15 A3 B1 C3 D1 15
23 A2 B2 C4 D1 23
25 A1 B1 C1 D2 25
29 A2 B2 C1 D2 29
37 A1 B1 C3 D2 37
41 A2 B2 C3 D2 41
42 A3 B2 C3 D2 42
44 A2 B1 C4 D2 44
51 A3 B1 C1 D3 51
53 A2 B2 C1 D3 53
55 A1 B1 C2 D3 55
57 A3 B1 C2 D3 57
60 A3 B2 C2 D3 60
61 A1 B1 C3 D3 61
63 A3 B1 C3 D3 63
64 A1 B2 C3 D3 64
66 A3 B2 C3 D3 66
69 A3 B1 C4 D3 69
71 A2 B2 C4 D3 71
> data$data_3
Var1 Var2 Var3 Var4 id
2 A2 B1 C1 D1 2
5 A2 B2 C1 D1 5
6 A3 B2 C1 D1 6
14 A2 B1 C3 D1 14
16 A1 B2 C3 D1 16
18 A3 B2 C3 D1 18
19 A1 B1 C4 D1 19
21 A3 B1 C4 D1 21
22 A1 B2 C4 D1 22
24 A3 B2 C4 D1 24
30 A3 B2 C1 D2 30
32 A2 B1 C2 D2 32
33 A3 B1 C2 D2 33
34 A1 B2 C2 D2 34
43 A1 B1 C4 D2 43
46 A1 B2 C4 D2 46
47 A2 B2 C4 D2 47
58 A1 B2 C2 D3 58
59 A2 B2 C2 D3 59
62 A2 B1 C3 D3 62
65 A2 B2 C3 D3 65
67 A1 B1 C4 D3 67
72 A3 B2 C4 D3 72
As you can see the rows appear only once in one of the three dataframes. The data frames got the number of rows from your function.
Edit
Great suggestion by @GKi. You can also use the following code:
setNames(split(data_exp, sample(3, nrow(data_exp), TRUE)), c("data_1","data_2","data_3"))
Solution 5:[5]
Keeping your original idea (which is a nice idea, although using split as proposed by KU99 is the most efficient and simple way of proceeding I think):
the only missing step in your code is to remove the rows you are sampling, so you do not sample them later.
Using the code you gave, you could do:
r = rand_vect(3, 72)
rowlist = 1:nrow(data_exp)
lapply(seq(r),function(i){
chosen_rows <- sample(rowlist,r[i])
rowlist <<- rowlist[!rowlist %in% chosen_rows]
data_exp[chosen_rows,]
})
To obtain the list of your 3 subsets.
Here, for each subsetting, I remove the chosen lines from the lines to be sampled for the next subset (rowlist <<- rowlist[!rowlist %in% chosen_id]). This way, you do not sample lines that were already sampled.
Solution 6:[6]
Here is another way of randomly splitting the data without sampling by modifying your code:
- Generate N random numbers, with N equals the number of rows of
data_exp, and then assign its orders to an object. This object will contain the same numbers as the row numbers ofdata_exp, but with a random order.
set.seed(1) # This is set to enable reproduction of the result.
rand = runif(nrow(data_exp))
orders = order(rand)
orders
# [1] 27 47 10 55 69 56 38 24 12 34 5 11 22 54 66 1 25 62 57 64 71
#[22] 30 2 19 28 14 26 60 40 53 63 48 51 67 31 33 16 58 45 44 3 32
#[43] 9 42 65 23 8 59 36 13 50 17 39 49 68 15 20 43 46 37 41 35 72
#[64] 52 29 70 6 4 61 21 7 18
- Use your
rand_vectfunction to generate three numbers of which the sum is 72.
set.seed(1)
r = rand_vect(3,72)
r
#[1] 24 25 23
ind = cumsum(r)
ind
#[1] 24 49 72
- Subset
data_expby usingrthis way:
data_1 = data_exp[1:ind[1], ]
data_2 = data_exp[(ind[1]+1):ind[2], ]
data_3 = data_exp[(ind[2]+1):ind[3], ]
Check if the number of rows of data_1, data_2, and data_3 equals the element of r and check if there are common rows among them:
nrow(data_1)
#[1] 24
nrow(data_2)
#[1] 25
nrow(data_3)
#[1] 23
merge(data_1, data_2)
#[1] Var1 Var2 Var3 Var4 id
#<0 rows> (or 0-length row.names)
merge(data_1, data_3)
#[1] Var1 Var2 Var3 Var4 id
#<0 rows> (or 0-length row.names)
merge(data_2, data_3)
#[1] Var1 Var2 Var3 Var4 id
#<0 rows> (or 0-length row.names)
Here is a function that implements the steps mentioned above:
rand_split = function(yourdf, nsplit, seed = 1){
#1. Generate a random order of row
set.seed(seed)
rand = runif(nrow(yourdf))
orders = order(rand)
#2. Set how many parts into which yourdf will be split. Each part has a random number of rows stored in r.
r = rand_vect(nsplit, nrow(yourdf))
#3. Generate the indices that will be used to select the random order of row
ind = cumsum(r)
n = length(ind)
k = c(1, ind[1:(n-1)] +1)
m = Map(seq, k, ind)
#4. Split yourdf randomly, stored the resulted parts in a list, and name each element with data_1, data_2, and so on.
outdf = lapply(m, function(x) data_exp[orders[x],])
names(outdf) = paste0("data_", seq_len(n))
output = c(nrow = r, outdf)
return(output)
}
This function can accept any nsplit and seed number. Example:
rand_split(data_exp, 5, 2)
$nrow1
[1] 13
$nrow2
[1] 14
$nrow3
[1] 15
$nrow4
[1] 15
$nrow5
[1] 15
$data_1
Var1 Var2 Var3 Var4 id
51 A3 B1 C1 D3 51
31 A1 B1 C2 D2 31
52 A1 B2 C1 D3 52
70 A1 B2 C4 D3 70
20 A2 B1 C4 D1 20
43 A1 B1 C4 D2 43
7 A1 B1 C2 D1 7
30 A3 B2 C1 D2 30
27 A3 B1 C1 D2 27
40 A1 B2 C3 D2 40
24 A3 B2 C4 D1 24
44 A2 B1 C4 D2 44
32 A2 B1 C2 D2 32
# and more...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | onyambu |
| Solution 2 | |
| Solution 3 | |
| Solution 4 | |
| Solution 5 | |
| Solution 6 |
