'Count number of common columns between every row of a dataframe to create an all-vs-all matrix

Here is a sample of the data set I have:

rows represent different accessions and columns represent different genes.

    gene1   gene2   gene3
1 PRESENT    LOST PRESENT
2 PRESENT PRESENT    LOST
3    LOST PRESENT PRESENT

I would like to find the number of the common genes between accessions and create a dataframe accession vs accession as such:

> test_self
  1 2 3
1 2 1 1
2 1 2 1
3 1 1 2
 

I used R to create this dataframe with the following code

gene1  = c("PRESENT", "PRESENT", "LOST")
gene2 = c("LOST", "PRESENT", "PRESENT")
gene3 = c("PRESENT","LOST","PRESENT")
test_PAV <- data.frame (gene1,gene2,gene3)

test_self <- data.frame(matrix(ncol = nrow(test_PAV), nrow = nrow(test_PAV)))
colnames(test_self)<-row.names(test_PAV)
row.names(test_self)<-row.names(test_PAV)

for (rowInQuestion in 1:nrow(test_PAV)){
  for (rowBeingComparedTo in 1:nrow(test_PAV)){
    commonGeneCount <- 0
    for (col in 1:ncol(test_PAV)){
      if(test_PAV[rowInQuestion,col]=='PRESENT'&&test_PAV[rowBeingComparedTo,col]=='PRESENT'){
        commonGeneCount <- commonGeneCount + 1
      }
      test_self[rowInQuestion,rowBeingComparedTo]<-commonGeneCount
    } 
  }
}

But I'm well aware that this is a very inefficient solution. First of all, since the upper and the lower triangles are symmetric they don't have to be calculated twice.

I found a similar solution here but there the commonality is across a row but in my case in order to call it a commonality, the same gene (column) has to be 'PRESENT' in both accessions.

Thank you.



Solution 1:[1]

First, let's keep only the PRESENT genes and build a frozenset of present genes per row:

s = (
 df.where(df.eq('PRESENT'))
   .stack()
   .reset_index(level=1)
   .groupby(level=0)['level_1'].agg(frozenset)
)

# 1    (gene3, gene1)
# 2    (gene2, gene1)
# 3    (gene3, gene2)
# Name: level_1, dtype: object

Now we can get the itertools.product (or, to be faster the combinations) to build the expected dataframe from the length of the intersection of genes set:

from itertools import product

out = (
 pd.Series({(i,j): len(a.intersection(b))
           for (i,a),(j,b) in  product(zip(s.index, s), repeat=2)})
   .unstack()
)

#    1  2  3
# 1  2  1  1
# 2  1  2  1
# 3  1  1  2

combinations are faster as only 1 of A/B or B/A is computed, and also not A/A:

from itertools import combinations

out = (
 pd.Series({(i,j): len(a.intersection(b))
           for (i,a),(j,b) in combinations(zip(s.index, s), r=2)})
   .unstack()
)

#      2    3
# 1  1.0  1.0
# 2  NaN  1.0

Solution 2:[2]

So I came up with another solution as well if anyone wants to use it;

Instead of using a dataframe with 'PRESENT' and 'LOST' I have used sed to transfer them into 1's and 0's.

sed -i '' 's/PRESENT/1/g' pav-matrix.binary.txt
sed -i '' 's/LOST/0/g' pav-matrix.binary.txt

Then used this binary matrix to compare every product of each row's and get the sum of the product array.

for (rowInQuestion in 1:nrow(pavMatrix_transposeBinary)){
  for (rowBeingComparedTo in 1:nrow(pavMatrix_transposeBinary)){
    pavSelfSimilarity[rowInQuestion,rowBeingComparedTo] <- sum(pavMatrix_transposeBinary[rowInQuestion,]*pavMatrix_transposeBinary[rowBeingComparedTo,])
  }
}

Solution 3:[3]

This for loop with awk should work.

for i in {2..4}; do 
   for j in {2..4};do 
      tmp=$(awk '{print $'$i',$'$j'}' input |sed 1d|awk '$1==$2 {print}'|grep -c "PRESENT");echo -en "$tmp\t"
   done 
   echo -en "\n" 
done >> test_self

Output:

2       1       1
1       2       1
1       1       2

You might have to add the headers by yourself, the order should be the same.

Thanks!!

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 mozway
Solution 2 ibozanova
Solution 3 reddy7