'Chunking the list for efficient logical comparison
I have the following pieces of code that I want to optimize. It outputs the correct result fast, however, only for the list with a max of 10^5 instances. But I have a list containing 2*10^8 which takes an enormous amount of time in case of compiling over 24 similar kinds of lists. Could anyone help by coming up with an efficient solution that optimizes the performance without changing the desired output?
m = df2['first.start'].tolist()
n = df2['first.end'].tolist()
# these following lists will get changed
c = df3['first.seqnames'].tolist()
temp_c = df3['first.seqnames'].tolist()
c2 = df3['second.seqnames'].tolist()
temp_c2 = df3['second.seqnames'].tolist()
x = df3['first.start'].tolist()
y = df3['first.end'].tolist()
a = df3['second.start'].tolist()
b = df3['second.end'].tolist()
for idx1,i in enumerate(x): # working with the first start and end only rn
for idx2,j in enumerate(m): # [m,n] -> df2[start,end] ##### [x,y] -> df1[start,end] ### [a,b] -> df1[start2,end2]
if (m[idx2]<=x[idx1]):
if (x[idx1]<=n[idx2]):
#(start, end) = (n+1,y)
temp = x[idx1]
x[idx1] = n[idx2]+1
a[idx1] = a[idx1] + (x[idx1]-temp)
else:
continue
else:
if(y[idx1]>=n[idx2]):
#(start, end) = (x,m-1)
#(start, end) = (n-1,y)
temp1 = x[idx1]
temp2 = y[idx1]
temp3 = b[idx1]
y[idx1] = m[idx2] - 1
x.insert(idx1+1, n[idx2]-1)
y.insert(idx1+1, temp2)
b[idx1] = a[idx1] + (y[idx1]-x[idx1])
a.insert(idx1+1, temp3-(y[idx1+1]-x[idx1+1]))
b.insert(idx1+1, temp3)
temp_c.insert(idx1+1, temp_c[idx1])
temp_c2.insert(idx1+1, temp_c2[idx1])
elif (y[idx1]>=m[idx2]):
#(start, end) = (x,m-1)
y[idx1] = m[idx2]-1
b[idx1] = a[idx1] + (y[idx1]-x[idx1])
else:
continue
The df3 dataframe looks like this:
first.seqnames first.start first.end first.width first.strand second.seqnames second.start second.end second.width second.strand
0 chr1 11462 11468 7 * chr1 10882 10888 7 *
1 chr1 11470 11471 2 * chr1 10890 10891 2 *
2 chr1 11473 11484 12 * chr1 10893 10904 12 *
3 chr1 11676 11677 2 * chr1 11096 11097 2 *
4 chr1 11782 11849 68 * chr1 11202 11269 68 *
... ... ... ... ... ... ... ... ... ... ...
1929046 chr1 249235900 249235941 42 * chr2B 131613429 131613470 42 *
1929047 chr1 249235943 249235949 7 * chr2B 131613472 131613478 7 *
1929048 chr1 249236698 249236700 3 * chr2B 131614226 131614228 3 *
1929049 chr1 249236702 249236708 7 * chr2B 131614230 131614236 7 *
1929050 chr1 249237320 249237335 16 * chr2B 131614842 131614857 16 *
The df2 looks like:
first.seqnames first.start first.end 3 4 5
3503 chr1 346213 346984 . 0 .
3504 chr1 3135466 3136202 . 0 .
3505 chr1 3190760 3191377 . 0 .
3506 chr1 3354604 3355258 . 0 .
3507 chr1 5388136 5388749 . 0 .
... ... ... ... ... ... ...
4530 chr1 245026995 245027904 . 0 .
4531 chr1 246492153 246492971 . 0 .
4532 chr1 246882492 246883154 . 0 .
4533 chr1 247887347 247888175 . 0 .
4534 chr1 249151889 249152623 . 0 .
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
