'Duplicate function retuning non-duplicated results on a BLAST hittable
New to python (3 weeks!) and unsurprisingly having difficulties. I'm working on a BLAST hittable and am trying to identify sequences coming from the same hit by using duplicate on the accession number only. I do not want to discard these results but rather save them to a new file so I can take a look to see if anything interesting if popping up.
A snippet of the table (this is purely an example, it includes 11 columns but it seems excessive to print them all here):
| Query | Accession | % identity | mismatches | gaps | s start | s end |
|---|---|---|---|---|---|---|
| Q112 | ABCDEFG111222 | 90.99 | 9 | 3 | 1000 | 2000 |
| Q112 | HIJKLMN222111 | 80 | 14 | 98 | 128 | 900 |
| Q112 | OPQRSTUV33111 | 76 | 2 | 23 | 12 | 900 |
I'm importing the file to make it a data frame using pandas, then using reset_index to replace the query number with an index.
I have then done the following:
- To find out if I have any duplicates in Accession column
print(file.Accession.duplicated().sum())
- To print those results into a new data frame (n is the same)
filedupes = (file.loc[file.Accession.duplicated()])
- Finally, to then write it to a csv for me to look through
fileDupes.to_csv('Filedupes.csv', sep='\t', encoding='utf-8')
This does half work, as my CSV does contain duplicated entries based on Accession number only but it also contains some unique Accession number entries. These entries only seem to have the first 2 letters identical to other entries but then the rest is unique
ie I have XM_JK1234343 and XM_983HSJAN and XM_83QMZBDH1 included despite having no other entry present (have checked using find/replace). The other 11 columns are also unique to these strange entries.
I am stumped, is it my code? Have I not specified enough and allowed the above examples to be chucked in with other legit duplicates? I have tried to find if someone else had asked a similar question but no luck - thank you kindly in advance for any insight and apologises in advance if this is silly mistake!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
