'How do I search through a very large csv file?
I have 2 csv files (well, one of them is .tab), both of them with 2 columns of numbers. My job is to go through each row of the first file, and see if it matches any of the rows in the second file. If it does, I print a blank line to my output csv file. Otherwise, I print 'R,R' to the output csv file. My current algorithm does the following:
- Scan each row of the second file (two integers each), go to the position of those two integers in a 2D array (so if the integers are 2 and 3, I'll go to position [2,3]) and assign a value of 1.
- Go through each row of the first file, check if the position of the two integers of each row has a value of 1 in the array, and then print the according output to a third csv file.
Unfortunately the csv files are very large, so I instantly get "MemoryError:" when running this. What is an alternative for scanning through large csv files?
I am using Jupyter Notebook. My code:
import csv
import numpy
def SNP():
thelines = numpy.ndarray((6639,524525))
tempint = 0
tempint2 = 0
with open("SL05_AO_RO.tab") as tsv:
for line in csv.reader(tsv, dialect="excel-tab"):
tempint = int(line[0])
tempint2 = int(line[1])
thelines[tempint,tempint2] = 1
return thelines
def common_sites():
tempint = 0
tempint2 = 0
temparray = SNP()
print('Checkpoint.')
with open('output_SL05.csv', 'w', newline='') as fp:
with open("covbreadth_common_sites.csv") as tsv:
for line in csv.reader(tsv, dialect="excel-tab"):
tempint = int(line[0])
tempint2 = int(line[1])
if temparray[tempint,tempint2] == 1:
a = csv.writer(fp, delimiter=',')
data = [['','']]
a.writerows(data)
else:
a = csv.writer(fp, delimiter=',')
data = [['R','R']]
a.writerows(data)
print('Done.')
return
common_sites()
Files: https://drive.google.com/file/d/0B5v-nJeoVouHUjlJelZtV01KWFU/view?usp=sharing and https://drive.google.com/file/d/0B5v-nJeoVouHSDI4a2hQWEh3S3c/view?usp=sharing
Solution 1:[1]
You're dataset really isn't that big, but it is relatively sparse. You aren't using a sparse structure to store the data which is causing the problem.
Just use a set of tuples to store the seen data, and then the lookup on that set is O(1), e.g:
In [1]:
import csv
with open("SL05_AO_RO.tab") as tsv:
seen = set(map(tuple, csv.reader(tsv, dialect="excel-tab")))
with open("covbreadth_common_sites.csv") as tsv:
common = [line for line in csv.reader(tsv, dialect="excel-tab") if tuple(line) in seen]
common[:10]
Out[1]:
[['1049', '7280'], ['1073', '39198'], ['1073', '39218'], ['1073', '39224'], ['1073', '39233'],
['1098', '661'], ['1098', '841'], ['1103', '15100'], ['1103', '15107'], ['1103', '28210']]
10 loops, best of 3: 150 ms per loop
In [2]:
len(common), len(seen)
Out[2]:
(190, 138205)
Solution 2:[2]
I have 2 csv files (well, one of them is .tab), both of them with 2 columns of numbers. My job is to go through each row of the first file, and see if it matches any of the rows in the second file. If it does, I print a blank line to my output csv file. Otherwise, I print 'R,R' to the output csv file.
import numpy as np
f1 = np.loadtxt('SL05_AO_RO.tab')
f2 = np.loadtxt('covbreadth_common_sites.csv')
f1.sort(axis=0)
f2.sort(axis=0)
i, j = 0, 0
while i < f1.shape[0]:
while j < f2.shape[0] and f1[i][0] > f2[j][0]:
j += 1
while j < f2.shape[0] and f1[i][0] == f2[j][0] and f1[i][1] > f2[j][1]:
j += 1
if j < f2.shape[0] and np.array_equal(f1[i], f2[j]):
print()
else:
print('R,R')
i += 1
- Load data to
ndarrayto optimize memory usage - Sort data
- Find matches in sorted arrays
Total complexity is O(n*log(n) + m*log(m)), where n and m are sizes of input files.
Using of set() will not reduce memory usage per unique entry so I do not recommend to use it with large datasets.
Solution 3:[3]
Since CSV is just a DB dump, import it to any SQL DB, then do query on it. This is very efficient way.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | outoftime |
| Solution 3 | Marian Pa?dzioch |
