'Integer Time Series compression

Is there a well known documented algorithm for (positive) integer streams / time series compression, that would:

  • have variable bit length
  • work on deltas

My input data is a stream of temperature measurements from a sensor (more specifically a TMP36 read out by an Arduino). It is physically impossible that big jumps occur between measurements (time constant of sensor). I therefore think my compression algorithm should work on deltas (set a base on stream start and then only difference to next value). Because gaps are limited, I want variable bit length, because differences lower than 4 fit on 2 bits, lower than 8 on 3 bits and so on... But there is a dilemma between telling in stream the bit size of the next delta and just working on, say, 3 bit deltas and telling size only when bigger for instance.

Any idea what algorithm solves than one?



Solution 1:[1]

Use variable-length integers to code the deltas between values, and feed that to zlib to do the compression.

Solution 2:[2]

First of all there are different formats in existent. One thing I would do first is getting rid of the sign. A sign is usually a distraction when thinking about compression. I usually use the scheme where every positive is 2*v and every negative value is just 2*(-v)-1. So 0 = 0, -1 = 1, 1 = 2, -2 = 3, 2 = 4... .

Since with that scheme you have nothing like 0b11111111 = -1 the leading bits are gone. Now you can think about how to compress those symbols / numbers. One thing you can do is create a representive sample and use it to train a static huffman code. This should be possible within your on chip constraints. Another more simple aproach is using huffman codes for bit lengths and write the bits to stream. So 0 = bitlength 0, -1 = bitlength 1, 2,3 = bitlength length 2, ... . By using huffman codes to describe this bitlength you become quite compact literals.

I usually use a mixture. I use the most frequent symbols / values as raw values and use not so frequent numbers by using bit lengths + bit pattern of the actual value. This way you stay compact and do not have to deal with excessive tables (there are only 64 symbols for 64 bits lengths possible).

Also there are other schemes like leading bit where for example of every byte the first bit (or the highest) marks the last byte of the value so as long as the bit is set there will be another byte for the integer. If it is zero its the last byte of the value.

I usually train a static huffman code for such purposes. Its easy and you can even do the encoding and decoding becoming source code / generate source code out from your code (simply create ifs/switch statements and write your tables as arrays in your code).

Solution 3:[3]

You can use Integer compression methods with delta or delta of delta encoding like used in TurboPFor Integer Compression. Gamma coding can be also used if the deltas have very small values.

Solution 4:[4]

The current state of the art for this problem is Quantile Compression. It compresses numerical sequences such as integers and typically achieves 35% higher compression ratio than other approaches. It has delta encoding as a built-in feature.

CLI example:

cargo run --release compress \
  --csv my.csv \
  --col-name my_col \
  --level 6 \
  --delta-order 1 \
  out.qco

Rust API example:

let my_nums: Vec<i64> = ...
let compressor = Compressor::<i64>::from_config(CompressorConfig {
  compression_level: 6,
  delta_encoding_order: 1,
});
let bytes: Vec<u8> = compressor.simple_compress(&my_nums);
println!("compressed down to {} bytes", bytes.len());

It does this by describing each number with a Huffman code for a range (a [lower, upper] bound) followed by an exact offset into that range. By strategically choosing the ranges based on your data, it comes close the Shannon entropy of the data distribution.

Since your data comes from a temperature sensor, your data should be very smooth, and you may even consider delta orders higher than 1 (e.g. delta order 2 is "delta-of-deltas").

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Mark Adler
Solution 2 Martin Kersten
Solution 3 powturbo
Solution 4 mwlon