'how to find locations of the substring given by the recognition sequence in the genome
i want locations of the substring given by the recognition sequence in the genome. wrote a function which pass through the sequence and store the indices where we match the recognition sequence but its returning empty conatiner. the sequence looks like
ATCGGCGCGCGCGCGTATATATATATATATAGGCGCGCGCGCGCGTATATATATATATAGCGGCGCGCGCGCG
def restriction_sites(seq, recog_seq):
"""Find the indices of all restriction sites in a sequence."""
#here variable "seq" contains fasta sequence and "recog_seq" contains recognition sequence
# Initialize list of restriction sites
sites = []
# Check every substring for a match
for i in range(len(seq) - len(recog_seq)):
if seq[i:i+len(recog_seq)] == recog_seq:
sites.append(i)
return sites
when i run it like this:
print('HindIII:', restriction_sites(seq, 'AAGCTT'))
it shows me empty:
[]
i want output like this
HindIII: [23129, 25156, 27478, 36894, 37458, 37583, 44140]
Solution 1:[1]
Your original function works. It returned you a blank list because your example seq has not 'AAGCTT'. For instance, it returned [15, 17, 19, 21, 23, 25, 27, 45, 47, 49, 51, 53, 55] to me for "TATA".
I could recommend use re for matching sites. That looks more explicit:
import re
from typing import *
def restriction_sites(seq: str, recog_seq: str) -> List[int]:
return [match.start() for match in re.finditer(fr"(?={recog_seq})", seq)]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
