'Opening zipfile of unsupported compression-type silently returns empty filestream, instead of throwing exception
Seem to be knocking my head off a newbie error and I am not a newbie.
I have a 1.2G known-good zipfile 'train.zip' containing a 3.5G file 'train.csv'.
I open the zipfile and file itself without any exceptions (no LargeZipFile), but the resulting filestream appears to be empty. (UNIX 'unzip -c ...' confirms it is good)
The file objects returned by Python ZipFile.open() are not seek'able or tell'able, so I can't check that.
Python distribution is 2.7.3 EPD-free 7.3-1 (32-bit) ; but should be ok for large zips. OS is MacOS 10.6.6
import csv
import zipfile as zf
zip_pathname = os.path.join('/my/data/path/.../', 'train.zip')
#with zf.ZipFile(zip_pathname).open('train.csv') as z:
z = zf.ZipFile(zip_pathname, 'r', zf.ZIP_DEFLATED, allowZip64=True) # I tried all permutations
z.debug = 1
z.testzip() # zipfile integrity is ok
z1 = z.open('train.csv', 'r') # our file keeps coming up empty?
# Check the info to confirm z1 is indeed a valid 3.5Gb file...
z1i = z.getinfo(file_name)
for att in ('filename', 'file_size', 'compress_size', 'compress_type', 'date_time', 'CRC', 'comment'):
print '%s:\t' % att, getattr(z1i,att)
# ... and it looks ok. compress_type = 9 ok?
#filename: train.csv
#file_size: 3729150126
#compress_size: 1284613649
#compress_type: 9
#date_time: (2012, 8, 20, 15, 30, 4)
#CRC: 1679210291
# All attempts to read z1 come up empty?!
# z1.readline() gives ''
# z1.readlines() gives []
# z1.read() takes ~60sec but also returns '' ?
# code I would want to run is:
reader = csv.reader(z1)
header = reader.next()
return reader
Solution 1:[1]
My solution for handling compression types that aren't supported by Python's ZipFile was to rely on a call to 7zip when ZipFile.extractall fails.
from zipfile import ZipFile
import subprocess, sys
def Unzip(zipFile, destinationDirectory):
try:
with ZipFile(zipFile, 'r') as zipObj:
# Extract all the contents of zip file in different directory
zipObj.extractall(destinationDirectory)
except:
print("An exception occurred extracting with Python ZipFile library.")
print("Attempting to extract using 7zip")
subprocess.Popen(["7z", "e", f"{zipFile}", f"-o{destinationDirectory}", "-y"])
Solution 2:[2]
Compression type 9 is Deflate64/Enhanced Deflate, which Python's zipfile module doesn't support (essentially since zlib doesn't support Deflate64, which zipfile delegates to).
And if smaller files work fine, I suspect this zipfile was created by Windows Explorer: for larger files Windows Explorer can decided to use Deflate64.
(Note that Zip64 is different to Deflate64. Zip64 is supported by Python's zipfile module, and just makes a few changes to how some metadata is stored in the zipfile, but still uses regular Deflate for the compressed data.)
However, stream-unzip now supports Deflate64. Modifying its example to read from the local disk, and to read a CSV file as in your example:
import csv
from io import IOBase, TextIOWrapper
import os
from stream_unzip import stream_unzip
def get_zipped_chunks(zip_pathname):
with open(zip_pathname, 'rb') as f:
while True:
chunk = f.read(65536)
if not chunk:
break
yield chunk
def get_unzipped_chunks(zipped_chunks, filename)
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks):
if file_name != filename:
for chunk in unzipped_chunks:
pass
continue
yield from unzipped_chunks
def to_str_lines(iterable):
# Based on the answer at https://stackoverflow.com/a/70639580/1319998
chunk = b''
offset = 0
it = iter(iterable)
def up_to_iter(size):
nonlocal chunk, offset
while size:
if offset == len(chunk):
try:
chunk = next(it)
except StopIteration:
break
else:
offset = 0
to_yield = min(size, len(chunk) - offset)
offset = offset + to_yield
size -= to_yield
yield chunk[offset - to_yield:offset]
class FileLikeObj(IOBase):
def readable(self):
return True
def read(self, size=-1):
return b''.join(up_to_iter(float('inf') if size is None or size < 0 else size))
yield from TextIOWrapper(FileLikeObj(), encoding='utf-8', newline='')
zipped_chunks = get_zipped_chunks(os.path.join('/my/data/path/.../', 'train.zip'))
unzipped_chunks = get_unzipped_chunks(zipped_chunks, b'train.csv')
str_lines = to_str_lines(unzipped_chunks)
csv_reader = csv.reader(str_lines)
for row in csv_reader:
print(row)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Hugo |
| Solution 2 |
