'Count number of common columns between every row of a dataframe to create an all-vs-all matrix
Here is a sample of the data set I have:
rows represent different accessions and columns represent different genes.
gene1 gene2 gene3
1 PRESENT LOST PRESENT
2 PRESENT PRESENT LOST
3 LOST PRESENT PRESENT
I would like to find the number of the common genes between accessions and create a dataframe accession vs accession as such:
> test_self
1 2 3
1 2 1 1
2 1 2 1
3 1 1 2
I used R to create this dataframe with the following code
gene1 = c("PRESENT", "PRESENT", "LOST")
gene2 = c("LOST", "PRESENT", "PRESENT")
gene3 = c("PRESENT","LOST","PRESENT")
test_PAV <- data.frame (gene1,gene2,gene3)
test_self <- data.frame(matrix(ncol = nrow(test_PAV), nrow = nrow(test_PAV)))
colnames(test_self)<-row.names(test_PAV)
row.names(test_self)<-row.names(test_PAV)
for (rowInQuestion in 1:nrow(test_PAV)){
for (rowBeingComparedTo in 1:nrow(test_PAV)){
commonGeneCount <- 0
for (col in 1:ncol(test_PAV)){
if(test_PAV[rowInQuestion,col]=='PRESENT'&&test_PAV[rowBeingComparedTo,col]=='PRESENT'){
commonGeneCount <- commonGeneCount + 1
}
test_self[rowInQuestion,rowBeingComparedTo]<-commonGeneCount
}
}
}
But I'm well aware that this is a very inefficient solution. First of all, since the upper and the lower triangles are symmetric they don't have to be calculated twice.
I found a similar solution here but there the commonality is across a row but in my case in order to call it a commonality, the same gene (column) has to be 'PRESENT' in both accessions.
Thank you.
Solution 1:[1]
First, let's keep only the PRESENT genes and build a frozenset of present genes per row:
s = (
df.where(df.eq('PRESENT'))
.stack()
.reset_index(level=1)
.groupby(level=0)['level_1'].agg(frozenset)
)
# 1 (gene3, gene1)
# 2 (gene2, gene1)
# 3 (gene3, gene2)
# Name: level_1, dtype: object
Now we can get the itertools.product (or, to be faster the combinations) to build the expected dataframe from the length of the intersection of genes set:
from itertools import product
out = (
pd.Series({(i,j): len(a.intersection(b))
for (i,a),(j,b) in product(zip(s.index, s), repeat=2)})
.unstack()
)
# 1 2 3
# 1 2 1 1
# 2 1 2 1
# 3 1 1 2
combinations are faster as only 1 of A/B or B/A is computed, and also not A/A:
from itertools import combinations
out = (
pd.Series({(i,j): len(a.intersection(b))
for (i,a),(j,b) in combinations(zip(s.index, s), r=2)})
.unstack()
)
# 2 3
# 1 1.0 1.0
# 2 NaN 1.0
Solution 2:[2]
So I came up with another solution as well if anyone wants to use it;
Instead of using a dataframe with 'PRESENT' and 'LOST' I have used sed to transfer them into 1's and 0's.
sed -i '' 's/PRESENT/1/g' pav-matrix.binary.txt
sed -i '' 's/LOST/0/g' pav-matrix.binary.txt
Then used this binary matrix to compare every product of each row's and get the sum of the product array.
for (rowInQuestion in 1:nrow(pavMatrix_transposeBinary)){
for (rowBeingComparedTo in 1:nrow(pavMatrix_transposeBinary)){
pavSelfSimilarity[rowInQuestion,rowBeingComparedTo] <- sum(pavMatrix_transposeBinary[rowInQuestion,]*pavMatrix_transposeBinary[rowBeingComparedTo,])
}
}
Solution 3:[3]
This for loop with awk should work.
for i in {2..4}; do
for j in {2..4};do
tmp=$(awk '{print $'$i',$'$j'}' input |sed 1d|awk '$1==$2 {print}'|grep -c "PRESENT");echo -en "$tmp\t"
done
echo -en "\n"
done >> test_self
Output:
2 1 1
1 2 1
1 1 2
You might have to add the headers by yourself, the order should be the same.
Thanks!!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | mozway |
| Solution 2 | ibozanova |
| Solution 3 | reddy7 |
