'Can zlib compressed output avoid using certain byte value?

It seems that the output of zlib.compress uses all possible byte values. Is this possible to use 255 of 256 byte values (for example avoid using \n)?

Note that I just use the python manual as a reference, but the question is not specific to python (i.e. any other languages that has a zlib library).



Solution 1:[1]

No, this is not possible. Apart from the compressed data itself, there is standardized control structures which contain integers. Those integers may accidentially lead to any 8-bit character ending up in the bytestream.

Your only chance would be to encode the zlib bytestream into another format, e.g. base64.

Solution 2:[2]

The whole point of compression is to reduce the size as much as possible. If zlib or any compressor only used 255 of the 256 byte values, the size of the output would be increased by at least 0.07%.

That may be perfectly fine for you, so you can simply post-process the compressed output, or any data at all, to remove one particular byte value at the expense of some expansion. The simplest approach would be to replace that byte when it occurs with a two-byte escape sequence. You would also then need to replace the escape prefix with a different two-byte escape sequence. That would expand the data on average by 0.8%. That is exactly what Hans provided in another answer here.

If that cost is too high, you can do something more sophisticated, which is to decode a fixed Huffman code that encodes 255 symbols of equal probability. To decode you then encode that Huffman code. The input is a sequence of bits, not bytes, and most of the time you will need to pad the input with some zero bits to encode the last symbol. The Huffman code turns one symbol into seven bits and the other 254 symbols into eight bits. So going the other way, it will expand the input by a little less than 0.1%. For short messages it will be a little more, since often less than seven bits at the very end will be encoded into a symbol.

Implementation in C:

// Placed in the public domain by Mark Adler, 26 June 2020.

// Encode an arbitrary stream of bytes into a stream of symbols limited to 255
// values. In particular, avoid the \n (10) byte value. With -d, decode back to
// the original byte stream. Take input from stdin, and write output to stdout.

#include <stdio.h>
#include <string.h>

// Encode arbitrary bytes to a sequence of 255 symbols, which are written out
// as bytes that exclude the value '\n' (10). This encoding is actually a
// decoding of a fixed Huffman code of 255 symbols of equal probability. The
// output will be on average a little less than 0.1% larger than the input,
// plus one byte, assuming random input. This is intended to be used on
// compressed data, which will appear random. An input of all zero bits will
// have the maximum possible expansion, which is 14.3%, plus one byte.
int nolf_encode(FILE *in, FILE *out) {
    unsigned buf = 0;
    int bits = 0, ch;
    do {
        if (bits < 8) {
            ch = getc(in);
            if (ch != EOF) {
                buf |= (unsigned)ch << bits;
                bits += 8;
            }
            else if (bits == 0)
                break;
        }
        if ((buf & 0x7f) == 0) {
            buf >>= 7;
            bits -= 7;
            putc(0, out);
            continue;
        }
        int sym = buf & 0xff;
        buf >>= 8;
        bits -= 8;
        if (sym >= '\n' && sym < 128)
            sym++;
        putc(sym, out);
    } while (ch != EOF);
    return 0;
}

// Decode a sequence of symbols from a set of 255 that was encoded by
// nolf_encode(). The input is read as bytes that exclude the value '\n' (10).
// Any such values in the input are ignored and flagged in an error message.
// The sequence is decoded to the original sequence of arbitrary bytes. The
// decoding is actually an encoding of a fixed Huffman code of 255 symbols of
// equal probability.
int nolf_decode(FILE *in, FILE *out) {
    unsigned long lfs = 0;
    unsigned buf = 0;
    int bits = 0, ch;
    while ((ch = getc(in)) != EOF) {
        if (ch == '\n') {
            lfs++;
            continue;
        }
        if (ch == 0) {
            if (bits == 0) {
                bits = 7;
                continue;
            }
            bits--;
        }
        else {
            if (ch > '\n' && ch <= 128)
                ch--;
            buf |= (unsigned)ch << bits;
        }
        putc(buf, out);
        buf >>= 8;
    }
    if (lfs)
        fprintf(stderr, "nolf: %lu unexpected line feeds ignored\n", lfs);
    return lfs != 0;
}

// Encode (no arguments) or decode (-d) from stdin to stdout.
int main(int argc, char **argv) {
    if (argc == 1)
        return nolf_encode(stdin, stdout);
    else if (argc == 2 && strcmp(argv[1], "-d") == 0)
        return nolf_decode(stdin, stdout);
    fputs("nolf: unknown options (use -d to decode)\n", stderr);
    return 1;
}

Solution 3:[3]

As @ypnos says, this isn't possible within zlib itself. You mentioned that base64 encoding is too inefficient, but it's pretty easy to use an escape character to encode a character you want to avoid (like newlines).

This isn't the most efficient code in the world (and you might want to do something like finding the least used bytes to save a tiny bit more space), but it's readable enough and demonstrates the idea. You can losslessly encode/decode, and the encoded stream won't have any newlines.

def encode(data):
    # order matters
    return data.replace(b'a', b'aa').replace(b'\n', b'ab')

def decode(data):
    def _foo():
        pair = False
        for b in data:
            if pair:
                # yield b'a' if b==b'a' else b'\n'
                yield 97 if b==97 else 10
                pair = False
            elif b==97:  # b'a'
                pair = True
            else:
                yield b
    return bytes(_foo())

As some measure of confidence you can check this exhaustively on small bytestrings:

from itertools import *

all(
    bytes(p) == decode(encode(bytes(p)))
        for c in combinations_with_replacement(b'ab\nc', r=6)
        for p in permutations(c)
)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ypnos
Solution 2
Solution 3