'ks.test without reference sample

I'm trying to check that 2 samples follow the same unknown distribution using the ks.test function. I have two datasets:

the dataset A tells me the percentage of time a value has been observed in a given environment.
the dataset B is basically a list of observed values in another environment.

My understanding is I need to pass two sample set of observed values, so I should (?) build a sample set from the dataset A where the values are present in a percentage as defined in dataset A.

Here is a code snippet to illustrate the idea. Please note the actual values in set_A and set_B are irrelevant, I'm just trying to have a structure that highlights the problem.

library(data.table)


# one sample set showing the percentage of time each value is observed in env A
value <- runif(10, 1, 99)
time_percent <- runif(10)
time_percent <- time_percent / sum(time_percent) * 100
set_A <- data.table(obs = round(value, 0), time_percent = round(time_percent, 0))

# a sample set of all observed values in env B
set_B = data.table(obs = runif(30, 1, 200))

# I want to check the set_B follows the same distribution as the set_A
# I generate a dummy sample where the number of times a value is present is the same percentage as the one defined in set_A
#set_C <- data.table(obs = set_A[, rep(obs, time_percent)])
set_C = data.table(obs = rep(set_A$obs, time = set_A$time_percent))


ks <- ks.test(set_B$obs, set_C$obs)

if (ks$p.value < 0.05) {
  print("the 2 samples don't follow the same distribution whatever it is")
} else {
  print("the 2 samples do follow the same distribution whatever it is")
}

And now my question: does that make sense?

Solution 1:^[1]

For Kolmogorov–Smirnov test, if we know the probability of dataset A and dataset B and we form a dummy data using dataset A, we can get a fixed Kolmogorov–Smirnov static. However, if we don't know the sample size, we can't get a fixed p-value for Kolmogorov–Smirnov test because it depends on the Kolmogorov–Smirnov static, the number of samples and the level.

To verify this, we could run and check the value D and p-value,

library(data.table)


# one sample set showing the percentage of time each value is observed in env A
value <- runif(10, 1, 99)
time_percent <- runif(10)
time_percent <- time_percent / sum(time_percent) * 100
set_A <- data.table(obs = round(value, 0), time_percent = round(time_percent, 0))

# a sample set of all observed values in env B
set_B = data.table(obs = runif(30, 1, 200))

# I want to check the set_B follows the same distribution as the set_A
# I generate a dummy sample where the number of times a value is present is the same percentage as the one defined in set_A
#set_C <- data.table(obs = set_A[, rep(obs, time_percent)])
set_C = data.table(obs = rep(set_A$obs, time = set_A$time_percent))

(ks_1 <- ks.test(set_B$obs, set_C$obs))
(ks_2 <- ks.test(rep(set_B$obs, 2), set_C$obs))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'ks.test without reference sample

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]