vu128: Efficient variable-length integers

lifthrasiir · 2024-05-21T13:11:18 1716297078

As with many other variable-length integer encodings, its efficiency really depends on the expected input distribution. One glaring issue in this regard is that its 2-byte encoding is rather wasted: only 6.25% out of all possible encodings are useful. This can be possibly fixed by making the 2-byte encoding represent 240..495 instead of 0..255. You can even make use of some bits from the first byte, so that for example the sequence goes like: 0=00 1=01 2=02 .. BF=BF C0=C0C0 C1=C0C1 .. 30FF=F0FF 3100=F13100 .. (all in hex).

A radically different design, used by for example 7z, is also possible if you put all continuation bits into the first byte. For example, instead of having `1xxxxxxx 1yyyyyyy 0zzzzzzz` for 3-byte encoding, use `110xxxxx yyyyyyyy zzzzzzzz`. This is essentially as compact as more common VLQ and LEB128 encodings, but you can determine the length of the whole sequence from the first byte alone! Unfortunately this is less suitable for integers larger than 64 bits, because then you run out of bits in the first byte. But it ends up with a rather elegant encoding for 64-bit integers, where `11111110` indicates 7 bytes and `11111111` indicates 8 bytes with no additional 0 bit needed.

> An afternoon of web searching has failed to uncover a prior name for this particular encoding, so I'm going to name it vu128 and describe it below. If any reader knows of prior art, please send it my way.

I don't know if it has other names, but a similar encoding can be seen from X.690 [1] and the "additional information" bit field of CBOR [2].

[1] https://en.wikipedia.org/wiki/X.690#Length_octets

[2] https://www.rfc-editor.org/rfc/rfc8949.html#section-3

klabb3 · 2024-05-23T17:26:19 1716485179

> Unfortunately this is less suitable for integers larger than 64 bits, because then you run out of bits in the first byte.

Just spitballing here but is it necessary to support every increment of octets? Do we really need a 7 octet/56bit width for instance? Instead couldn’t we just allow widths of log2(n_octets)? An int64 would need 2 bits of length info (8, 16, 32, 64). Or maybe I’m thinking wrong cause not all bits are can be used.

> its efficiency really depends on the expected input distribution

Right. But everything else in computers is rounded to the nearest power of 2, so maybe it works here too haha.

lifthrasiir · 2024-05-24T00:05:24 1716509124

> Just spitballing here but is it necessary to support every increment of octets?

Indeed CBOR does this, and that's another reason you should put continuation bits to the first byte: it allows for much greater control.

vlovich123 · 2024-05-23T17:32:00 1716485520

If you don’t need the full 64-bit space and can go 61-bit, you can actually encode variableness in 3-bits and at most have an 8 byte value vs the 9byte value here (ie 0 overhead). Three bits to indicate whether the length is 1 to 8 bytes and the remainder for the values. This is still 5 bits more efficient if you used the full 64-bits but unless you can make use of those extra 5 bits somehow (eg tightly packing another value that’s also partially sliced), it’s not buying you anything.

repelsteeltje · 2024-05-23T18:06:52 1716487612

Clever. But..

If you're (worst case) is 8 bytes for 61 bits, that's no 0 overhead, but 3 bits overhead.

Another detail is that vu128 is double 64 bits, ie: 17 bytes worst case or 8 bits overhead. Would presumably be 4 bits in your example to represent 124 bit payload.

vlovich123 · 2024-05-23T18:09:19 1716487759

0 overhead in that it takes 8 bytes in memory and up to 8 bytes serialized. You can’t slice a byte partially unless you’re doing packing so I think it’s fair to characterize it as 0 overhead, but sure it’s 3 bits of overhead vs 8. It came up because I had a counter and just arbitrarily cut off the range cause I couldn’t possibly ever have it count that high. But yeah, the same idea could be extended to 128 bits.

repelsteeltje · 2024-05-23T18:28:50 1716488930

Fair enough. In case this was a counter, best make sure you get the overflow logic right. The normal 64 bit wrapping won't kick in if you're just able to (de) seriaize only 61 bits.

Or (depending on the context of this counter) you may never want to overflow at all - in which case you're best of with a variable length representation that is open ended (like r or python arbitrary precision)

vlovich123 · 2024-05-23T18:48:17 1716490097

It would take your absolute top of the line CPU doing nothing but incrementing the counter for more than 12 years to worry about overflowing a 61 bit number. And what this is counting is much more expensive than incrementing a counter so you can’t run this at 6Ghz. Maybe ~500mhz which gets you to > 100 years if that’s all you’re doing but this counter is a side operation that won’t see much traffic even at scale so you’re talking about millennia in practice I suspect.

So it’s just not a practical concern - I just panic if it hits the limit. I think you may be failing to comprehend just how large numbers like this get and how long it takes to count because it’s such a very real concern with 32 bit counters. Addition of course can always overflow but this is a straight up strictly monotonic counter that counts from 0 to 2^61.

repelsteeltje · 2024-05-23T19:18:35 1716491915

:-) fair enough.

And panicing is exactly the kind of mitigation I had in mind initially; just don't silently do the unexpected.

> I think you may be failing to comprehend just how large numbers like this get

On the other hand, us programmers are also notorious for failing to comprehend just how long our software will live on...

vlovich123 · 2024-05-23T20:03:18 1716494598

Honestly if any code I’ve written survives 100 years to cause problems I’ll be impressed. 1000 years I’ll be ecstatic.

plesner · 2024-05-23T17:46:32 1716486392

I implemented a slight variation of this a while back which worked as described up to 11110xxx but then a prefix byte of 11111xxx meant a payload of 64*2^xxx bits. So 11111000 is followed by a 64-bit value, 11111001 by a 128-bit, up to 11111111 which is 8192 bits.

lostmsu · 2024-05-23T20:00:38 1716494438

Also, real-world numbers tend to swarm around powers of 10 and, in computers, powers of 2. It might make sense to encode those in a special way. E.g. <pow2-prefix><var-len-power><0/1 - to subtract one> and <pow10-prefix><var-len-power><0/1 - to subtract one>. E.g. assuming 10 prefix reserved for pow2 and 11 for pow10, 255 would be encoded as 10_1000_1, 65535 would be 10_10000_1, 65536 would be 10_10000_0, and 10000 would be 11_11_0.

lifthrasiir · 2024-05-24T00:10:48 1716509448

That should be probably resolved outside of the serialization. For example you can easily map -4n-1 to 2^(8+n), -4n-2 to 2^(8+n)-1, -4n-3 to 10^(3+n) and -4n-4 to 10^(3+n)-1 for all n >= 0.

Dylan16807 · 2024-05-23T19:06:17 1716491177

> its efficiency really depends on the expected input distribution. One glaring issue in this regard is that its 2-byte encoding is rather wasted

Yeah, and it's worth noting that this encoding is very close to just length then value. It only saves space for numbers up to about ±100. In a lot of situations it wouldn't actually do better than tag-length-value.

terrelln · 2024-05-23T19:19:32 1716491972

Variable-length integer schemes are great when you are encoding one integer. But when you are encoding a list of integers, you should really consider a different scheme.

Variable-length integer schemes generally interleave the control bits & data bits. This means you don't know where the next integer starts until you at least partially decode the current integer.

When encoding a list of integers, you would rather put all the control information together, and all the data together. This generally allows for significantly more efficient decoding using SIMD.

tatersolid · 2024-05-23T22:09:55 1716502195

Don’t forget column-stores. Lists of floats and ints are generally “full size” but then XORed with the preceding value then the result is compressed in chunks with a high-speed compressor like LZ4. We see 20x or higher compression ratios on many of our column-stores which contain data such as invoice amounts as metrics. Decompression overhead is below zero (much faster than operating on the raw uncompressed data which is bound by memory bandwidth or IO)

lifthrasiir · 2024-05-23T23:59:57 1716508797

Yeah, see Stream VByte [1] for more information.

[1] https://arxiv.org/abs/1709.08990

Const-me · 2024-05-23T17:45:07 1716486307

When I need them I usually implement the codec used in MKV format: https://www.rfc-editor.org/rfc/rfc8794.html

Similar performance characteristics, i.e. just the first byte tells the encoded length, no need to test every byte. Luckily, most programming languages have intrinsic functions for BSWAP instruction.

repelsteeltje · 2024-05-23T18:13:13 1716487993

Mkv (and av1, vp8, vp9) use the leb128 format which the article mentions is inefficient parse for off the shelf CPUs.

Leb128 is relatively efficient in bitstream and sidesteps potential mpeg-la patent issues. (H264, maybe 14496 part 14)

Const-me · 2024-05-23T18:38:56 1716489536

I don't think they do. These formats are based on EBML, see the RFC link in my previous comment. That VINT encoding is not LEB128, it's better.

BTW I once implemented my own parser in C#, see that source file: https://github.com/Const-me/Vrmac/blob/master/VrmacVideo/Con...

kunley · 2024-05-23T19:09:20 1716491360

There is also SQLite varint, https://sqlite.org/src4/doc/trunk/www/varint.wiki

IIRC different from both Leb128 and VLQ, and quite neat. It's for 64bit (u)ints, though

danking00 · 2024-05-23T19:50:19 1716493819

If I’m following, this presents an alternative format and implementation to LEB128 which encodes and decodes substantially faster. Notably, the implementation is quite simple. Cool! And agreed that modern CPUs really suffer from branches.

Should I interpret the plot to mean the average elapsed wall clock time per integer decoded/encoded? And can I conclude the throughput is the reciprocal? So about 100,000 integers per second or around a 1 MB/s of decompressed data.

I know this is a bit unfair because the implementation is much more complex, but my first thought is why I would use vu128 instead of Lemire’s Stream VByte: https://arxiv.org/abs/1709.08990

A slight tangent but I stumbled on this library which stores floats XOR’ed with the previous float in the stream: https://github.com/velvia/compressed-vec it seems really clever to me! They reference “Gorilla: A Fast, Scalable, In-Memory Time Series Database” which in turn references two 2006 papers: “Fast Lossless Compression of Scientific Floating-Point Data” and “Fast and Efficient Compression of Floating-Point Data”. Frustratingly, the FB paper doesn’t benchmark their XOR-based floating point encoding but the earlier two papers do.

jmillikin · 2024-05-23T21:33:15 1716499995

The plot is the time to encode/decode a list of 10,000 random integers in a uniform distribution. The throughput is 2-10 GiB/s[0] on my desktop, with portable scalar operations (no SIMD). Distributions skewed toward small values are slightly faster[1].

I believe a decoder than used SIMD would perform even faster since the format is friendly to the BMI instructions found on newer CPUs. You could also store the data as concat(prefix_bytes) + concat(payloads), which would really get things going.

For scientific data with lots of multi-byte f64 values, the single branch for < 0xF0 might be unnecessary and undesirable. There are other formats that can be decoded by completely branchless SIMD, at the cost of larger sizes for very small values. For example, Google has a format that uses the first byte as a bitmask: https://static.googleusercontent.com/media/research.google.c...

[0] vu128 is less affected by value size than VLQ/LEB128, so on fast hardware it decodes 10,000 8-bit values in about the same time as 10,000 64-bit values.

[1] In the discussion on lobste.rs[2] a user wondered what the results would look like with a distribution containing lots of small values. A zipf distribution's results looks like <https://i.imgur.com/WGFtz6f.png>.

[2] https://lobste.rs/s/qvoe7a/vu128_efficient_variable_length#c...

danking00 · 2024-05-25T14:39:21 1716647961

Thanks for the reply and clarification! That’s an impressive throughput. Am I interpreting the behavior correctly to think that: your code has one branch per int and that branch will be predicted well when your integers are all of similar magnitude?

I never fully groked the Zipf distribution but am I correct to conclude the worst case for vu128 would be alternating integers on either side of the branch condition?

Thanks for the link to the slides, also clever!

jmillikin · 2024-05-27T11:49:49 1716810589

  > Am I interpreting the behavior correctly to think that: your code has
  > one branch per int and that branch will be predicted well when your
  > integers are all of similar magnitude?

Yep, that's generally correct. The branch prediction seems to be harmless either way -- interleaving values on either side of the branch doesn't seem to affect throughput much on my machine (results may vary).

  > I never fully groked the Zipf distribution but am I correct to conclude
  > the worst case for vu128 would be alternating integers on either side of
  > the branch condition?

It depends on whether you're measuring throughput by values/sec or bytes/sec.

In terms of bytes/sec, the worst case is a set of integers exclusively in the range [0xF0, 0xFF] -- big enough to skip the fast path, small enough that each byte has to go through the full bit-masking.

In terms of values/sec, the worst is really big integers because writing 8 bytes (or 16, for `encode_u128`) to a buffer takes longer than writing one byte.

However, regardless of measurement, the worst case for vu128 is faster than VLQ/LEB128 -- except for distributions containing only [0x00, 0x7F] in which case they're all equivalent.

---

Also, after some comments on Lobsters about cases where space efficiency in the u16 range really matters, I'm experimenting with a version that tries to match or exceed VLQ/LEB128 for size efficiency at all bit lengths.

It has three more branches, but in practice this hasn't turned out to matter much -- it seems that indefinite loops are slower than a fixed set of branches, so still significantly faster than VLQ/LEB128 on most sets of integers.

  // like LEB128 but continuation bits are packed into the
  // leading byte -- sometimes called "prefixed varint"
  
  0xxxxxxx                            //  7 bits of payload
  10xxxxxx xxxxxxxx                   // 14 bits of payload
  110xxxxx xxxxxxxx xxxxxxxx          // 21 bits payload
  1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx // 28 bits payload

  // For payloads > 28 bits, use (0xF0 | length) type prefix,
  // same as original post.
  1111____  xxxxxxxx xxxxxxxx [...]

It seems to be more fussy about compiler optimizations, though: https://github.com/rust-lang/rust/issues/125543

chubot · 2024-05-23T16:24:28 1716481468

FWIW the lobste.rs thread surfaced this 2016 comment thread about PrefixVarint vs. LEB128 varint in protobufs, which has lots of good info, and makes the UTF-8 analogy:

https://news.ycombinator.com/item?id=11263378

Also sqlite varint:

https://sqlite.org/src4/doc/trunk/www/varint.wiki

akira2501 · 2024-05-23T17:12:50 1716484370

> Also sqlite varint:

That's not actually in use. The SQLite3 format is far more basic:

"A variable-length integer or "varint" is a static Huffman encoding of 64-bit twos-complement integers that uses less space for small positive values. A varint is between 1 and 9 bytes in length. The varint consists of either zero or more bytes which have the high-order bit set followed by a single byte with the high-order bit clear, or nine bytes, whichever is shorter. The lower seven bits of each of the first eight bytes and all 8 bits of the ninth byte are used to reconstruct the 64-bit twos-complement integer. Varints are big-endian: bits taken from the earlier byte of the varint are more significant than bits taken from the later bytes."

From: https://www.sqlite.org/fileformat2.html Section 1.6

ixau · 2024-05-24T05:38:14 1716529094

In terms of serializing JSON to a binary representation, which includes variable length integers, check CBOR

https://datatracker.ietf.org/doc/html/rfc8949

zigzag312 · 2024-05-23T19:20:13 1716492013

How does it compare to vint64[0] performance-wise? For values that fit the 64-bit space.

[0] https://crates.io/crates/vint64

lifthrasiir · 2024-05-24T00:02:52 1716508972

This is another variation of putting control bits to the first byte, where the number of following bytes is indicated by the number of trailing 1 bits instead of leading 1 bits. I'm not sure which one is more efficient though.

vitiral · 2024-05-23T16:43:30 1716482610

zig-zag encoding is brilliant. Basically you can determine positive or negative based on odd/even and divide by 2 to get the value. I tried to come up with a similar encoding and never thought of this

edflsafoiewq · 2024-05-23T19:10:27 1716491427

It's the same as enumerating 2D points in a spiral, but restricted to 1D.

repelsteeltje · 2024-05-23T18:19:35 1716488375

Yeah basically it's ones complement (or twos complement) but with the sign bit on the right, rather than the left.