'How to compress an array of random positive integers in a certain range?

I want to compress an array consisting of about 10^5 random integers in range 0 to 2^15. The integers are unsorted and I need to compress them lossless.

I don't care much about the amount of computation and time needed to run the algorithm, just want to have better compression ratio.

Are there any suggested algorithms for this?



Solution 1:[1]

Assuming you don“t need to preserve original order, instead of passing the numbers themselves, pass the count. If they have a normal distribution, you can expect each number to be repeated 3 or 4 times. With 3 bits per number, we can count up to 7. You can make an array of 2^15 * 3 bits and every 3 bits set the count of that number. To handle extreme cases that have more than 7, we can also send a list of numbers and their counts for these cases. Then you can read the 3 bits array and overwrite with the additional info for count higher than 7.

Solution 2:[2]

For your exact example: just encode each number as a 15-bit unsigned int and apply bit packing. This is optimal since you have stated each integer in uniformly random in [0, 2^15), and the Shannon Entropy of this distribution is 15 bits.

For a more general solution, apply Quantile Compression (https://github.com/mwlon/quantile-compression/). It takes advantage of any smooth-ish data and compresses near optimally on shuffled data. It works by encoding each integer with a Huffman code for it coarse range in the distribution, then an exact offset within that range.

These approaches are both computationally cheap, but more compute won't get you further in this case.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 mwlon