'Filter large csv files (10GB+) based on column value in Python
EDITED : Added Complexity
I have a large csv file, and I want to filter out rows based on the column values. For example consider the following CSV file format:
Col1,Col2,Nation,State,Col4...
a1,b1,Germany,state1,d1...
a2,b2,Germany,state2,d2...
a3,b3,USA,AL,d3...
a3,b3,USA,AL,d4...
a3,b3,USA,AK,d5...
a3,b3,USA,AK,d6...
I want to filter all rows with Nation == 'USA', and then based on each of the 50 state. What's the most efficient way of doing this? I'm using Python. Thanks
Also, is R better than Python for such tasks?
Solution 1:[1]
Use boolean indexing or DataFrame.query:
df1 = df[df['Nation'] == "Japan"]
Or:
df1 = df.query('Nation == "Japan"')
Second should be faster, see performance of query.
If still not possible (not a lot of RAM) try use dask as commented Jon Clements (thank you).
Solution 2:[2]
One way would be to filter the csv first and then load, given the size of the data
import csv
with open('yourfile.csv', 'r') as f_in:
with open('yourfile_edit.csv', 'w') as f_outfile:
f_out = csv.writer(f_outfile, escapechar=' ',quoting=csv.QUOTE_NONE)
for line in f_in:
line = line.strip()
row = []
if 'Japan' in line:
row.append(line)
f_out.writerow(row)
Now load the csv
df = pd.read_csv('yourfile_edit.csv', sep = ',',header = None)
You get
0 1 2 3 4
0 2 a3 b3 Japan d3
Solution 3:[3]
You could open the file, index the position of the Nation header, then iterate over a reader().
import csv
temp = r'C:\path\to\file'
with open(temp, 'r', newline='') as f:
cr = csv.reader(f, delimiter=',')
# next(cr) gets the header row (row[0])
i = next(cr).index('Nation')
# list comprehension through remaining cr iterables
filtered = [row for row in cr if row[i] == 'Japan']
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Vaishali |
| Solution 3 | pstatix |
