'About Lucene index postings list, why are all deltas between 0 and 255 during FOR encoding?

From this blog, it says that postings lists are split into blocks of 256 docs and then each block is compressed separately. But what if a term's postings list is [72, 373]? Is there anything that Lucene does to avoid a deltas greater than 255, like altering doc sequence so the docs have appropriate doc ids?



Solution 1:[1]

The article doesn't say there is a limit of 256 for delta, but the example in the article does.

Lucene computes the maximum number of bits required to store deltas in a block, adds this information to the block header, and then encodes all deltas of the block using this number of bits.

For example, If a posting list contains doc ids 1 to 256 like [1,2,3,.....256], the delta encoded block would be [1,1,1,1,....1] which means the block needs only 1 bit per doc id to store.

Taking the example in your question [72,373..], the delta encoded block would be [72, 301,...] which means the block will need 9 bits per doc id (assuming that 301 is the largest delta in the block).

Hope it clears.

Solution 2:[2]

The "size" is not changed. You just write outside of the array in the memory. In general you should not do that because you can't predict which memory you change by doing this.

But the size of a array is never increased in C. If you want a bigger one, you have to reserve a new one with more space in the memory.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jasir
Solution 2 lmnch