'Shortcut use panda's duplicated() with a thick csv
I can't read a whole 5 GB CSV file in one go, but using Pandas' read_csv() with chuncksize set seems to be a fast and easy way:
import pandas as panda
def run_pand(csv_db):
reader = panda.read_csv(csv_db, chunksize=5000)
dup=reader.duplicated(subset=["Region","Country","Ship Date"])
#and after i will write duplicates in new csv
As I understand it, reading in chunks will not let me find a duplicate if they are in different pieces, or will it still?
Is there a way to search for matches using a Pandas method?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
