'Python: How to resolve " 'ZipExtFile' object has no attribute 'startswith' error" while reading large csv file in zip?

I have a 10GB csv file compressed in zip. Below is the code that I am trying to read the csv file. However, this method throwing the Attribute error. Please recommend the other fast ways to read large csv as well.

import zipfile as zp
import dask.dataframe as dd

file_dir = 'W:\\XYZ\\salaryofemployees.CSV.ZIP'
csv_file = "salaryofemployees.CSV"

with zp.ZipFile(file_dir) as z: 
   with z.open(csv_file) as f:
      dask_df = dd.read_csv(f)


Error: AttributeError: 'ZipExtFile' object has no attribute 'startswith'


Solution 1:[1]

You are passing a file-like object to read_csv, but the docstring says it must be a path or list of paths

urlpath : string or list Absolute or relative filepath(s). Prefix with a protocol like s3:// to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.

Fortunately, you can use fsspec-style compound URLs for this kind of thing

df = dd.read_csv("zip://salaryofemployees.CSV::W:\\XYZ\\salaryofemployees.CSV.ZIP")

HOWEVER, this is a single file and compressed, so you are getting no chunking/parallelism here. Are you sure you wanted dask rather than pandas alone?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 mdurant