'ks.test without reference sample
I'm trying to check that 2 samples follow the same unknown distribution using the ks.test function. I have two datasets:
- the dataset A tells me the percentage of time a value has been observed in a given environment.
- the dataset B is basically a list of observed values in another environment.
My understanding is I need to pass two sample set of observed values, so I should (?) build a sample set from the dataset A where the values are present in a percentage as defined in dataset A.
Here is a code snippet to illustrate the idea. Please note the actual values in set_A and set_B are irrelevant, I'm just trying to have a structure that highlights the problem.
library(data.table)
# one sample set showing the percentage of time each value is observed in env A
value <- runif(10, 1, 99)
time_percent <- runif(10)
time_percent <- time_percent / sum(time_percent) * 100
set_A <- data.table(obs = round(value, 0), time_percent = round(time_percent, 0))
# a sample set of all observed values in env B
set_B = data.table(obs = runif(30, 1, 200))
# I want to check the set_B follows the same distribution as the set_A
# I generate a dummy sample where the number of times a value is present is the same percentage as the one defined in set_A
#set_C <- data.table(obs = set_A[, rep(obs, time_percent)])
set_C = data.table(obs = rep(set_A$obs, time = set_A$time_percent))
ks <- ks.test(set_B$obs, set_C$obs)
if (ks$p.value < 0.05) {
print("the 2 samples don't follow the same distribution whatever it is")
} else {
print("the 2 samples do follow the same distribution whatever it is")
}
And now my question: does that make sense?
Solution 1:[1]
For Kolmogorov–Smirnov test, if we know the probability of dataset A and dataset B and we form a dummy data using dataset A, we can get a fixed Kolmogorov–Smirnov static. However, if we don't know the sample size, we can't get a fixed p-value for Kolmogorov–Smirnov test because it depends on the Kolmogorov–Smirnov static, the number of samples and the level.
To verify this, we could run and check the value D and p-value,
library(data.table)
# one sample set showing the percentage of time each value is observed in env A
value <- runif(10, 1, 99)
time_percent <- runif(10)
time_percent <- time_percent / sum(time_percent) * 100
set_A <- data.table(obs = round(value, 0), time_percent = round(time_percent, 0))
# a sample set of all observed values in env B
set_B = data.table(obs = runif(30, 1, 200))
# I want to check the set_B follows the same distribution as the set_A
# I generate a dummy sample where the number of times a value is present is the same percentage as the one defined in set_A
#set_C <- data.table(obs = set_A[, rep(obs, time_percent)])
set_C = data.table(obs = rep(set_A$obs, time = set_A$time_percent))
(ks_1 <- ks.test(set_B$obs, set_C$obs))
(ks_2 <- ks.test(rep(set_B$obs, 2), set_C$obs))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |