'"Predicting" the Result of a Query Before Running the Query

I have the following dataset (each variable can take values from 1-10):

factor <- c(1,2,3,4,5,6,7,8,9,10)

var_1 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
var_2 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))    
var_3 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))    
var_4 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
var_5 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))

my_data = data.frame(var_1, var_2, var_3, var_4, var_5)

head(my_data)
  var_1 var_2 var_3 var_4 var_5
1     2     9     8     3     5
2    10     1     4     7     5
3     4     7     1     7     6
4     9     8     3     4     7
5     6     5     5     9     7
6     2     8    10     2     7

I also have a data frame that contains a series of "conditions" used for selecting rows from this data frame:

conditions = data.frame(var_1 = c("1,2", "2"), var_2 = c("1,2,3,4", "1,2,3,8,9"), var_3 = c("4,6", "5,6,7"))

 conditions
  var_1     var_2 var_3
1   1,2   1,2,3,4   4,6
2     2 1,2,3,8,9 5,6,7

Suppose I want to run the following queries on this dataset:

query_1 <- my_data[Reduce(`&`,
  Map(`%in%`,
      lapply(my_data[,1:3], as.character),
      lapply(conditions, function(z) strsplit(z, ",")[[1]]))),]

query_2 <- my_data[Reduce(`&`,
  Map(`%in%`,
      lapply(my_data[,1:3], as.character),
      lapply(conditions, function(z) strsplit(z, ",")[[2]]))),]

# traditional form for reference 
# query_1 = my_data[my_data$var_1 %in% c("1", "2") & my_data$var_2 %in% c("1", "2", "3", "4") & my_data$var_3 %in% c("4", "6") ,  ]

# query_2 = my_data[my_data$var_1 %in% c("2") & my_data$var_2 %in% c("1", "2", "3", "8", "9") & my_data$var_3 %in% c("5", "6", "7") ,  ]

It is immediately clear that the results of both of these queries are not identical:

> identical(query_1, query_2)
[1] FALSE

However, we can also see that even though the results of both queries are non-identical, there are still common rows between both of these queries:

combined <- rbind(query_1, query_2)

duplicate_rows <- unique(combined[duplicated(combined), ])

ifelse(nrows(duplicate_rows) == 0, "EMPTY", "NOT EMPTY")

[1] "NOT EMPTY"

My Question: Is it possible to somehow add some "random noise" to the "Conditions" data frame such that "query_1" and "query_2" will have no common rows?

I know that a simple solution would be to make sure that no common numbers are present whatsoever in the conditions for "query_1" and "query_2" - but using basic logic, it is possible that some common numbers can be present in the conditions for "query_1" and "query_2", and still result in "query_1" and "query_2" having no common rows.

For example:

  • Query_1: Height = Tall AND Basketball = Yes

  • Query_2: Height = Tall AND Basketball = No

  • Query_3: Height = Not Tall AND Basketball = No

  • Query_4: Height = Tall AND Basketball = Yes

In this case, it is obvious that Query_3 and Query_4 will share no common records, seeing as their conditions for height and basketball are completely opposite.

However, Query_1 and Query_2 share a similar condition for "height" but a different condition for "basketball" - this will also result in Query_1 and Query_2 sharing no common rows. For example, Robert Wadlow (https://en.wikipedia.org/wiki/Robert_Wadlow) was Tall AND did not play Basketball, but Wilt Chamberlain (https://en.wikipedia.org/wiki/Wilt_Chamberlain) was Tall AND did play basketball. Both shared "Tall", but since they had a unique value for "Basketball", they fall in distinct categories. Logically speaking, you could take the entire population of the earth and still no human would be present in the results of both "query_1" and "query_2" due to exclusive nature of both queries.

The same way, I would like to predict and anticipate the result of the earlier queries before hand and make sure that these queries have no common rows - and if they do have common rows, I would like to add some "random noise" to the conditions (data frame) used to create these queries such that no common rows are shared by these queries.

Thus, is there a way to take the "conditions" data frame and add some "random noise" to this data frame such that Query_1 and Query_2 will not share any common rows?



Solution 1:[1]

I think there are three ways you can go about this.

  1. Beforehand, make sure that conditions are disjoint.
  2. Write a rejection sampling funciton (as mentioned by bdecaf) or find a package that implements it
  3. Afterwards, take the intersection of both data sets and randomly assign its elements to either of the two sets.

Since you were already aware of 1. and 2. is more complex, I'm going to implement 3.

Your data

factor <- c(1,2,3,4,5,6,7,8,9,10)

var_1 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
var_2 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))    
var_3 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))    
var_4 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
var_5 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))

my_data = data.frame(var_1, var_2, var_3, var_4, var_5)

conditions = data.frame(var_1 = c("1,2", "2"), var_2 = c("1,2,3,4", "1,2,3,8,9"), var_3 = c("4,6", "5,6,7"))

query_1 <- my_data[Reduce(`&`,
  Map(`%in%`,
      lapply(my_data[,1:3], as.character),
      lapply(conditions, function(z) strsplit(z, ",")[[1]]))),]

query_2 <- my_data[Reduce(`&`,
  Map(`%in%`,
      lapply(my_data[,1:3], as.character),
      lapply(conditions, function(z) strsplit(z, ",")[[2]]))),]
# Your code up to here

Randomly assign observations

library(tidyverse)
l <- list(intersect(query_1, query_2),
          setdiff(query_1, query_2),
          setdiff(query_2, query_1)) %>%
  {
    map2(split(.[[1]], sample(1:2, nrow(.[[1]]), T)),
         .[-1],
         bind_rows)
  }

Inspect result

map(l, as_tibble) # use tibble for nicer printing
#> $`1`
#> # A tibble: 123 × 5
#>    var_1 var_2 var_3 var_4 var_5
#>    <fct> <fct> <fct> <fct> <fct>
#>  1 2     3     6     10    8    
#>  2 2     3     6     8     4    
#>  3 2     1     6     1     1    
#>  4 2     2     6     1     3    
#>  5 2     1     6     2     1    
#>  6 2     1     6     10    5    
#>  7 2     2     6     9     1    
#>  8 2     2     6     2     2    
#>  9 2     3     6     6     8    
#> 10 2     3     6     2     3    
#> # … with 113 more rows
#> 
#> $`2`
#> # A tibble: 122 × 5
#>    var_1 var_2 var_3 var_4 var_5
#>    <fct> <fct> <fct> <fct> <fct>
#>  1 2     2     6     1     4    
#>  2 2     3     6     5     10   
#>  3 2     2     6     3     6    
#>  4 2     2     6     2     7    
#>  5 2     1     6     9     1    
#>  6 2     3     6     6     4    
#>  7 2     1     6     7     8    
#>  8 2     2     6     6     2    
#>  9 2     1     6     10    3    
#> 10 2     1     6     7     5    
#> # … with 112 more rows

reduce(l, intersect) # Show frames are disjoint
#> [1] var_1 var_2 var_3 var_4 var_5
#> <0 rows> (or 0-length row.names)
map_int(l, nrow) # length of frames
#>   1   2 
#> 123 122

bind_rows(query_1, query_2) |> # length of joint set -> adds up
  distinct() |> 
  nrow()
#> [1] 245

Created on 2022-04-13 by the reprex package (v2.0.1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 shs