'How should I choose the best data compression method for my data?
I've done a bit of research but I think I can say I'm a complete beginner when it comes to data compression.
I need to compress data from a GNSS receiver. These data consists of a series of parameters measured over time -- more specifically over X seconds at 1Hz -- as such:
X uint8 parameters, X uint8 parameters, X double parameters, X double, X single, X single.
The data is stored in this sequence as a binary file.
Using general purpose LZ77 compressing tools I've managed to achieve a compression ratio of 1.4 (this was achieved with zlib DEFLATE), and I was wondering if it was possible to compress it even further. I am aware that this highly depends on the data itself, so what I'm asking is what algorithms or what software can I use that is more suitable for the structure of data that I'm trying to compress. Arranging the data differently is also something that I can change. In fact I've even tried to transform all data into double precision data and then use a compressor specifically for a stream of doubles but to no avail, the data compression is even less than 1.4.
In other words, how would you address the compression of this data? Due to my lack of knowledge regarding data compression, I'm afraid I'm not providing the data in the most appropriate way for the compressor, or that I should be using a different compression algorithm. If you could help, I would be grateful. Thank you!
Solution 1:[1]
Use delta coding. Subtract subsequent values from the previous corresponding values. Add the deltas at the other end to restore the original data. The delta-coded data should be more compressible.
Solution 2:[2]
The current state of the art for time series compression is Quantile Compression. It compresses numerical sequences (e.g. integers, floats, timestamps) and typically achieves 35% higher compression ratio than other approaches. It has delta encoding as a built-in feature.
CLI example:
cargo run --release compress \
--csv my.csv \
--col-name my_col \
--level 6 \
--delta-order 1 \
out.qco
Rust API example:
let my_nums: Vec<i64> = ...
let compressor = Compressor::<i64>::from_config(CompressorConfig {
compression_level: 6,
delta_encoding_order: 1,
});
let bytes: Vec<u8> = compressor.simple_compress(&my_nums);
println!("compressed down to {} bytes", bytes.len());
It does this by describing each number with a Huffman code for a range (a [lower, upper] bound) followed by an exact offset into that range. By strategically choosing the ranges based on your data, it comes close the Shannon entropy of the data distribution.
For sequences of your data that are very smooth, you may even consider delta orders higher than 1 (e.g. delta order 2 is "delta-of-deltas").
Solution 3:[3]
The problem was that I wasn't writing the data to the file sequentially. I'm reading the data from different satellites, and I was writing the data to the file one parameter at a time, which meant actual sequential readings where interrupted by data from other satellites. The key was to write all the data referring to a specific satellite sequentially and then another sat and so on, making the data stream as continuous as possible, thus creating smaller deltas I believe
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Mark Adler |
| Solution 2 | mwlon |
| Solution 3 | Hugo Pontes |
