'How do I search through a very large csv file?

I have 2 csv files (well, one of them is .tab), both of them with 2 columns of numbers. My job is to go through each row of the first file, and see if it matches any of the rows in the second file. If it does, I print a blank line to my output csv file. Otherwise, I print 'R,R' to the output csv file. My current algorithm does the following:

  1. Scan each row of the second file (two integers each), go to the position of those two integers in a 2D array (so if the integers are 2 and 3, I'll go to position [2,3]) and assign a value of 1.
  2. Go through each row of the first file, check if the position of the two integers of each row has a value of 1 in the array, and then print the according output to a third csv file.

Unfortunately the csv files are very large, so I instantly get "MemoryError:" when running this. What is an alternative for scanning through large csv files?

I am using Jupyter Notebook. My code:

import csv
import numpy

def SNP():
    thelines = numpy.ndarray((6639,524525))
    tempint = 0
    tempint2 = 0
    with open("SL05_AO_RO.tab") as tsv:
        for line in csv.reader(tsv, dialect="excel-tab"):
            tempint = int(line[0])
            tempint2 = int(line[1])
            thelines[tempint,tempint2] = 1
    return thelines

def common_sites():
    tempint = 0
    tempint2 = 0
    temparray = SNP()
    print('Checkpoint.')
    with open('output_SL05.csv', 'w', newline='') as fp:
        with open("covbreadth_common_sites.csv") as tsv:
            for line in csv.reader(tsv, dialect="excel-tab"):
                tempint = int(line[0])
                tempint2 = int(line[1])
                if temparray[tempint,tempint2] == 1:
                    a = csv.writer(fp, delimiter=',')
                    data = [['','']]
                    a.writerows(data)
                else:
                    a = csv.writer(fp, delimiter=',')
                    data = [['R','R']]
                    a.writerows(data)
    print('Done.')
    return

common_sites()

Files: https://drive.google.com/file/d/0B5v-nJeoVouHUjlJelZtV01KWFU/view?usp=sharing and https://drive.google.com/file/d/0B5v-nJeoVouHSDI4a2hQWEh3S3c/view?usp=sharing



Solution 1:[1]

You're dataset really isn't that big, but it is relatively sparse. You aren't using a sparse structure to store the data which is causing the problem.
Just use a set of tuples to store the seen data, and then the lookup on that set is O(1), e.g:

In [1]:
  import csv
  with open("SL05_AO_RO.tab") as tsv:
      seen = set(map(tuple, csv.reader(tsv, dialect="excel-tab")))
  with open("covbreadth_common_sites.csv") as tsv:
      common = [line for line in csv.reader(tsv, dialect="excel-tab") if tuple(line) in seen]
  common[:10]
Out[1]:
  [['1049', '7280'], ['1073', '39198'], ['1073', '39218'], ['1073', '39224'], ['1073', '39233'],
   ['1098', '661'], ['1098', '841'], ['1103', '15100'], ['1103', '15107'], ['1103', '28210']]

10 loops, best of 3: 150 ms per loop

In [2]:
  len(common), len(seen)
Out[2]:
  (190, 138205)

Solution 2:[2]

I have 2 csv files (well, one of them is .tab), both of them with 2 columns of numbers. My job is to go through each row of the first file, and see if it matches any of the rows in the second file. If it does, I print a blank line to my output csv file. Otherwise, I print 'R,R' to the output csv file.

import numpy as np

f1 = np.loadtxt('SL05_AO_RO.tab')
f2 = np.loadtxt('covbreadth_common_sites.csv')

f1.sort(axis=0)
f2.sort(axis=0)

i, j = 0, 0
while i < f1.shape[0]:
    while j < f2.shape[0] and f1[i][0] > f2[j][0]:
        j += 1
    while j < f2.shape[0] and f1[i][0] == f2[j][0] and f1[i][1] > f2[j][1]:
        j += 1
    if j < f2.shape[0] and np.array_equal(f1[i], f2[j]):
        print()
    else:
        print('R,R')
    i += 1
  1. Load data to ndarray to optimize memory usage
  2. Sort data
  3. Find matches in sorted arrays

Total complexity is O(n*log(n) + m*log(m)), where n and m are sizes of input files.

Using of set() will not reduce memory usage per unique entry so I do not recommend to use it with large datasets.

Solution 3:[3]

Since CSV is just a DB dump, import it to any SQL DB, then do query on it. This is very efficient way.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 outoftime
Solution 3 Marian Pa?dzioch