'Fid sample size based on num of rows in data
I have a dataset that looks like this:
| Region | Name |
|---|---|
| Region 1 | Name 14 |
| Region 2 | Name 18 |
| Region 2 | Name 2 |
| Region 2 | Name 21 |
| Region 2 | Name 44 |
| Region 3 | Name 64 |
| Region 3 | Name 24 |
| Region 4 | Name 1 |
| Region 4 | Name 1 |
| Region 4 | Name 98 |
| Region 5 | Name 98 |
| Region 5 | Name 8 |
| Region 5 | Name 8 |
| Region 5 | Name 8 |
| Region 5 | Name 98 |
I need to breakup the data by Region, and then select a random sample of only 5% of the "Name" per Region, based on the number of rows in Region.
So lets say there are 30 Name in Region 2, then i need a random sample of 3*.05. If there are 50 Name in Region 6, then i need a random sample of 5*.05.
So far, ive been able to split() the data using
d = split(data, f = data$Region)
but when i try to run an lapply function i get an error that there are different number of rows in the list that split() provided
lapply(data, function(x) {
sample_n(data, nrow(d)*.05)
} )
Any thoughts?
Thank you
Solution 1:[1]
Here's a base R solution.
lapply(split(data, data$Region),
\(x) x[sample(nrow(x), nrow(x) * 0.05),])
You can then convert it back into a data frame with rbind
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Aron |
