'Compare values present in two data frames with the usage of sliding window function in R

We have two data frames

Data frame 1

sl no. Segment_name Segment
1      Segment1     AACG
2      Segment2     ACTG
3      Segment3     GTCA

Data frame 2

sl no. Dinucleotides Free energy Values
1      AA            -1.0
2      AC            -1.76
3      CG            -1.5
4      CT            -1.23
5      TG            -1.67
6      GT            -1.82
7      TC            -1.43
8      CA            -1.98

We want to compare the column 'Segment' of Data frame 1 and the column 'Free energy Values' of Data frame 2. Comparison of particular segment with the given free energy values (through a sliding window algorithm i.e. AA, AC, CG respectively for segment1=AACG) would give us the value of -4.26 for the sum of the nucleotides AA,AC,CG respectively of the segment1. We want to repeat the the same for the rest of the segments and store the summation of free energy values in a separate column in the data frame 1 as

sl no. Segment_name Segment  Free energy
1      Segment1     AACG     -4.26
2      Segment2     ACTG     -4.66
3      Segment3     GTCA     -5.23


Solution 1:[1]

I used my own sample data (see bottom), since columnnames with spaces in them are a pain in the ass to work with.

The [] at the end of each line are to show you the in-between-lines results. You can omit them in your production code.

library(data.table)
# set to data.table format
setDT(df1); setDT(df2)
# cut Segment into two parts
dt1[, c("from", "to") := tstrsplit(Segment, "(?<=..)(?=..)", perl = TRUE)][]
# find index
dt1[dt2, from.sl := i.sl, on = .(from = Dinucleotides)][]
dt1[dt2, to.sl := i.sl, on = .(to = Dinucleotides)][]
# now, sum
setkey(dt1, sl)
dt1[dt1, Free_energy := sum(dt2[i.from.sl:i.to.sl, FEV]), by = .EACHI][]
# drop temp columns
dt1[, `:=`(from = NULL, to = NULL, from.sl = NULL, to.sl = NULL)][]
#    sl Segment_name Segment Free_energy
# 1:  1     Segment1    AACG       -4.26
# 2:  2     Segment2    ACTG       -6.16
# 3:  3     Segment3    GTCA       -5.23


#sample data
df1 <- fread("sl Segment_name Segment
1      Segment1     AACG
2      Segment2     ACTG
3      Segment3     GTCA")
df2 <- fread("sl Dinucleotides FEV
1      AA            -1.0
2      AC            -1.76
3      CG            -1.5
4      CT            -1.23
5      TG            -1.67
6      GT            -1.82
7      TC            -1.43
8      CA            -1.98")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1