'How to read csv on python, tp get a dataframe, but only one row every 3 rows?

I have a very massive csv file. I would like to get one row, every 3 rows, in a dataframe. It is more or less like resampling the csv.

Let's say, I have a csv file like this :

4  5
9  2
3  7
1  5
2  4
9  10

And I want my dataframe to be :

4  5
1  5

If I read the csv and then drop 1 row every 3 rows, it is useless because it is taking too much time. Does someone have an idea ? :) (By the way, I am using Python)

Cheers



Solution 1:[1]

You need to create a csv reader object first, then create a generator which will read only nth item from the iterator, then use it as dataframe source. By doing it in that way you will avoid excessive memory usage.

import csv
import pandas as pd

with open('file.csv', newline='') as f:
    reader = csv.reader(f)
    data = (x for i, x in enumerate(reader) if i % 3 == 0)
df = pd.Dataframe(data)

It looks like there is also a simpler way: passing lambda to skiprows argument of read_csv

import pandas as pd

fn = lambda x: x % 3 != 0
df = pd.read_csv('file.csv', skiprows=fn)

Solution 2:[2]

If I understood correctly, you want to cut your read time to (at most) 1/3 of the total time. Pandas has many function to customize your csv read, but none will avoid reading (despite then discarding) your whole file, since it is stored on contiguous blocks on your disk.

What I think is that if your constraint is time (and not memory), a 1/3 reduction of the time is not going to be enough in any case, of any size of your file. What you can do is:

  • read the whole csv
  • filter it keeping just 1 row each 3
  • store the result in an other file
  • on following runs, read the filtered csv

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 rikyeah