'Python remove entry from zipfile

I'm currently writing an open source library for a container format, which involves modifying zip archives. Therefore I utilized pythons build-in zipfile module. Due to some limitations I decided to modify the module and ship it with my library. These modifications include a patch for removing entries from the zip file from the python issue tracker: https://bugs.python.org/issue6818 To be more specific I included the zipfile.remove.2.patch from ubershmekel. After some modifications for Python-2.7 the patch works just fine according to the shipped unit-tests.

But nevertheless I'm running into some problems, when removing, adding and removing + adding files without closing the zipfile in between.

Error
Traceback (most recent call last):
  File "/home/martin/git/pyCombineArchive/tests/test_zipfile.py", line 1590, in test_delete_add_no_close
    self.assertEqual(zf.read(fname), data)
  File "/home/martin/git/pyCombineArchive/combinearchive/custom_zip.py", line 948, in read
    with self.open(name, "r", pwd) as fp:
  File "/home/martin/git/pyCombineArchive/combinearchive/custom_zip.py", line 1003, in open
    % (zinfo.orig_filename, fname))
BadZipFile: File name in directory 'foo.txt' and header 'bar.txt' differ.

Meaning the zip file is ok, but somehow the central dictionary/entry header gets messed up. This unittest reproduces this error:

def test_delete_add_no_close(self):
    fname_list = ["foo.txt", "bar.txt", "blu.bla", "sup.bro", "rollah"]
    data_list = [''.join([chr(randint(0, 255)) for i in range(100)]) for i in range(len(fname_list))]

    # add some files to the zip
    with zipfile.ZipFile(TESTFN, "w") as zf:
        for fname, data in zip(fname_list, data_list):
            zf.writestr(fname, data)

    for no in range(0, 2):
        with zipfile.ZipFile(TESTFN, "a") as zf:
            zf.remove(fname_list[no])
            zf.writestr(fname_list[no], data_list[no])
            zf.remove(fname_list[no+1])
            zf.writestr(fname_list[no+1], data_list[no+1])

            # try to access prior deleted/added file and prior last file (which got moved, while delete)
            for fname, data in zip(fname_list, data_list):
                self.assertEqual(zf.read(fname), data)

My modified zipfile module and the complete unittest file can be found in this gist: https://gist.github.com/FreakyBytes/30a6f9866154d82f1c3863f2e4969cc4



Solution 1:[1]

After some intensive debugging, I'm quite sure something went wrong with moving the remaining chunks. (The ones stored after the removed file) So I went ahead and rewrote this code part, so it copies these files/chunks each at a time. Also I rewrite the file header for each of them (to make sure it is valid) and the central directory at the end of the zipfile. My remove function now looks like this:

def remove(self, member):
    """Remove a file from the archive. Only works if the ZipFile was opened
    with mode 'a'."""

    if "a" not in self.mode:
        raise RuntimeError('remove() requires mode "a"')
    if not self.fp:
        raise RuntimeError(
              "Attempt to modify ZIP archive that was already closed")
    fp = self.fp

    # Make sure we have an info object
    if isinstance(member, ZipInfo):
        # 'member' is already an info object
        zinfo = member
    else:
        # Get info object for member
        zinfo = self.getinfo(member)

    # start at the pos of the first member (smallest offset)
    position = min([info.header_offset for info in self.filelist])  # start at the beginning of first file
    for info in self.filelist:
        fileheader = info.FileHeader()
        # is member after delete one?
        if info.header_offset > zinfo.header_offset and info != zinfo:
            # rewrite FileHeader and copy compressed data
            # Skip the file header:
            fp.seek(info.header_offset)
            fheader = fp.read(sizeFileHeader)
            if fheader[0:4] != stringFileHeader:
                raise BadZipFile("Bad magic number for file header")

            fheader = struct.unpack(structFileHeader, fheader)
            fname = fp.read(fheader[_FH_FILENAME_LENGTH])
            if fheader[_FH_EXTRA_FIELD_LENGTH]:
                fp.read(fheader[_FH_EXTRA_FIELD_LENGTH])

            if zinfo.flag_bits & 0x800:
                # UTF-8 filename
                fname_str = fname.decode("utf-8")
            else:
                fname_str = fname.decode("cp437")

            if fname_str != info.orig_filename:
                if not self._filePassed:
                    fp.close()
                raise BadZipFile(
                      'File name in directory %r and header %r differ.'
                      % (zinfo.orig_filename, fname))

            # read the actual data
            data = fp.read(fheader[_FH_COMPRESSED_SIZE])

            # modify info obj
            info.header_offset = position
            # jump to new position
            fp.seek(info.header_offset, 0)
            # write fileheader and data
            fp.write(fileheader)
            fp.write(data)
            if zinfo.flag_bits & _FHF_HAS_DATA_DESCRIPTOR:
                # Write CRC and file sizes after the file data
                fp.write(struct.pack("<LLL", info.CRC, info.compress_size,
                        info.file_size))
            # update position
            fp.flush()
            position = fp.tell()

        elif info != zinfo:
            # move to next position
            position = position + info.compress_size + len(fileheader) + self._get_data_descriptor_size(info)

    # Fix class members with state
    self.start_dir = position
    self._didModify = True
    self.filelist.remove(zinfo)
    del self.NameToInfo[zinfo.filename]

    # write new central directory (includes truncate)
    fp.seek(position, 0)
    self._write_central_dir()
    fp.seek(self.start_dir, 0)  # jump to the beginning of the central directory, so it gets overridden at close()

You can find the complete code in the latest revision of the gist: https://gist.github.com/FreakyBytes/30a6f9866154d82f1c3863f2e4969cc4

or in the repo of the library I'm writing: https://github.com/FreakyBytes/pyCombineArchive

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1