'Check for the presence of ID in n+1 timepoint and create a new presence/absence variable

I'm sorry if I fail to describe this properly.

Basically I have 3 columns -- Pair, Set1 and Set2.

What I want to know is does Pair appear in either Set1 or Set2 when Gen is equal to Time+1.

So I can then have a 5th variable which is essentially 1 or 0 (yes it's present in the next timepoint, no it is not).

For example, in this example dataset Pair F3 (in Time 1) is not present under either Set when Time = 2.

The only way I can think about doing this is to filter by Time and see if it appears in another time:

    time2 <- df %>% 
      filter(Time == 2)
    
    df %>% 
  filter(Time == 1) %>% 
  mutate(Present = ifelse(Pair %in% time2$Set1 | Pair %in% time2$Set2, 1, 0))

This works but I'd like to have a more efficient way of looping through each Time point, checking the subsequent time point and then moving forward ...

I don't want to do a for loop within a for loop -- but maybe this is necessary?


data

structure(list(Time = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L), Pair = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), .Label = c("F1", "F2", "F3", 
"F4", "F5", "N1", "N2", "N3", "N4", "N5"), class = "factor"), 
    Set1 = structure(c(1L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 
    1L, 2L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 1L, 3L, 4L, 5L, 
    6L, 7L, 8L, 9L, 10L, 11L), .Label = c("F1", "F2", "F3", "F5", 
    "F7", "F9", "N1", "N3", "N5", "N7", "N9"), class = "factor"), 
    Set2 = structure(c(2L, 3L, 4L, 5L, 1L, 7L, 8L, 9L, 10L, 6L, 
    2L, 3L, 4L, 5L, 1L, 7L, 8L, 9L, 10L, 6L, 2L, 3L, 4L, 5L, 
    1L, 7L, 8L, 9L, 10L, 6L), .Label = c("F10", "F2", "F4", "F6", 
    "F8", "N10", "N2", "N4", "N6", "N8"), class = "factor"), 
    Present = c(1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA)), class = "data.frame", row.names = c(NA, -30L
))


Solution 1:[1]

If odf is your original data structure, then you can do the following:

  1. Melt the odf long, adding a new variable which is Time-1, putting the Set1 and Set2 values into a single column, called SetValue
library(data.table)
setDT(odf)
df <- melt(odf[,Present:=NULL], id=c("Time", "Pair"),value.name = "SetValue" )[,Time2:=Time-1][]
  1. Join, this long df on itself, where Pair=Setvalue and Time=Time2, drop is.na() rows, select columns to retain, and ensure uniqueness, then merge back to your original, assigning back to original frame
odf = merge(
  odf,
  unique(df[df[,.(SetValue,Time, Time2)], on=.(Pair=SetValue,Time=Time2)][!is.na(Time2), .(Time, Pair,exists=1)]),
  by=c("Time","Pair"), all.x = T
)
  1. If you like, you can update the NA in your new exists column to zero, if this is not the last timepoint
odf[Time<max(Time)&is.na(exists), exists:=0][]

Output:

    Time Pair Set1 Set2 exists
 1:    1   F1   F1   F2      1
 2:    1   F2   F3   F4      1
 3:    1   F3   F5   F6      0
 4:    1   F4   F7   F8      1
 5:    1   F5   F9  F10      1
 6:    1   N1   N1   N2      1
 7:    1   N2   N3   N4      1
 8:    1   N3   N5   N6      1
 9:    1   N4   N7   N8      1
10:    1   N5   N9  N10      1
11:    2   F1   F1   F2      1
12:    2   F2   F2   F4      1
13:    2   F3   F5   F6      1
14:    2   F4   F7   F8      1
15:    2   F5   F9  F10      1
16:    2   N1   N1   N2      1
17:    2   N2   N3   N4      1
18:    2   N3   N5   N6      1
19:    2   N4   N7   N8      1
20:    2   N5   N9  N10      1
21:    3   F1   F1   F2     NA
22:    3   F2   F3   F4     NA
23:    3   F3   F5   F6     NA
24:    3   F4   F7   F8     NA
25:    3   F5   F9  F10     NA
26:    3   N1   N1   N2     NA
27:    3   N2   N3   N4     NA
28:    3   N3   N5   N6     NA
29:    3   N4   N7   N8     NA
30:    3   N5   N9  N10     NA

Updated - Tidyverse option

df <- odf %>%
  select(!Present) %>%
  pivot_longer(cols = starts_with("Set"),values_to = "SetValue") %>% 
  mutate(Time2= Time-1)

left_join(
  odf %>% select(!Present), 
  inner_join(df, df, by=c("Pair"="SetValue", "Time"="Time2"))  %>% 
    filter(!is.na(Time2)) %>% 
    select(Time, Pair) %>% 
    mutate(exists=1) %>% 
    distinct(),
  by=c("Time", "Pair")
) %>% 
  mutate(exists=if_else(is.na(exists) & Time<max(Time), 0, exists))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1