Wednesday, March 14, 2018

xxHash for small keys: the impressive power of modern compilers

 Several years ago, xxHash was created as a companion error detector for LZ4 frame format. The initial candidate for this role was CRC32, but it turned out being several times slower than LZ4 decompression, nullifying one of its key strength.

After some research, I found the great murmurhash, by Austin Appleby, alongside its validation tool SMHasher . It nearly fitted the bill, running faster than LZ4, but still a bit too close to my taste. Hence started a game to see how much speed could be extracted from a custom hash formula while preserving good distribution properties.

A lot of things happened since that starting point, but the main take away from this story is : xxHash was created mostly as a checksum companion, digesting long inputs.

Fast forward nowadays, and xxHash is being used in more places than it was originally expected. The design has been expanded to create a second variant, XXH64, which is successful in video content and data bases.
But for several of these uses cases, hashed keys are no longer necessarily "large".

In some cases, the need to run xxHash on small keys resulted in the creation of dedicated variants, that cut drastically through the decision tree to extract just the right suite of operations for the desired key. And it works quite well.

That pushes the hash algorithm into territories it was not explicitly optimized for. Thankfully, one of SMHasher's test module was dedicated for speed on small keys, so it helped to pay attention to the topic during design phase. Hence the performance on small key is correct, but the dedicated function push it to another level.

Let's analyse the 4-byte hash example.
Invoking the regular XXH32() function on 4-bytes samples, and running it on my Mac OS-X 10.13 laptop (with compilation done by llvm9), I measure 233 MH/s (Millions of hashes per second).
Not bad, but running the dedicated 4-bytes function, it jumps to 780 MH/s. That's a stark difference !

Let's investigate further.
xxHash offers an obscure build flag named XXH_PRIVATE_API. The initial intention is to make all XXH_* symbols static, so that they do not get exposed on the public side of a library interface. This is useful when several libraries use xxHash as an embedded source file. In such a case, an application linking to both libraries will encounter multiple XXH_* symbols, resulting in naming collisions.

A side effect of this strategy is that function bodies are now available during compilation, which makes it possible to inline them. Surely, for small keys, inlining the hash function might help compared to invoking a function from another module ?
Well, yes, it does help, but there is nothing magical. Using the same setup as previously, the speed improves to 272 MH/s . That's better, but still far from the dedicated function.

That's where the power of inlining can really kick in. In the specific case that the key has a predictable small length, it's possible to pass as length argument a compile-time constant, like sizeof(key), instead of a variable storing the same value. This, in turn, will allow the compiler to make some drastic simplification during binary generation, through dead code removal optimization, throwing away branches which are known to be useless.
Using this trick on the now inlined XXH32(), speed increases to 780 MH/s, aka the same speed as dedicated function.

I haven't checked but I wouldn't be surprised if both the dedicated function and the inlined one resulted in the same assembly sequence.
But the inlining strategy seems more powerful : no need to create, and then maintain, a dedicated piece of code. Plus, it's possible to generate multiple variants, by changing the "length" argument to some other compile-time constant.

object XXH32() XXH32 inlined XXH32 inlined +
length constant
dedicated XXH32 function
4-bytes field
233 MH/s
272 MH/s
780 MH/s
780 MH/s


Another learning is that inlining is quite powerful for small keys, but the XXH_PRIVATE_API build macro makes a poor job at underlining its effectiveness.

As a consequence, next release of xxHash will introduce a new build macro, named XXH_INLINE_ALL. It does exactly the same thing, but its performance impact is better documented, and I suspect the name itself will make it easier for developers to anticipate its implications.

Friday, February 16, 2018

When to use Dictionary Compression


 On the Zstandard website, there is a small chapter dedicated to Dictionary compression. In a nutshell, it explains that it can dramatically improve compression ratio for small files. Which is correct, but doesn’t nearly capture the impact of this feature. In this blog post, I want to address this topic, and present a few scenarios in which dictionary offers more than just “improved compression”.

Database example

Let’s start by a simple case. Since dictionary compression is good for small data, we need an application which handles small data. For example a log record storage, which can be simplified as an append-only database.
Writes are append-only, but reads can be random. A single query may require retrieving multiple individual records scattered throughout the database, in no particular order.
As usual, the solution needs to be storage efficient, so compression plays an important role. And records tend to be repetitive, so compression ratio will be good.
However, this setup offers a classic example of the “block-size trade-off” : in order to get any compression done, it’s necessary to group records together into “blocks”, then compress each block as a single entity. The larger the block, the better the compression ratio. But now, in order to retrieve a single record, it’s necessary to decompress the whole block, which translates into extra-work during read operations.

Impact of Block Size on Compression ratio (level 1)
Last data point compresses records individually ([250-550] bytes)

The “optimal block size” is application dependent. It highly depends on data (how repetitive it is), and on usage scenarios (how many random reads). Typical sizes may vary between 4 KB and 128 KB.

Enter dictionary compression. It is now possible to train the algorithm to become specifically good at compressing a selected type of data.
(For simplicity purpose, in this example, we’ll assume that the log storage server stores a limited variety of record types, small enough to fit into a single dictionary. If not, it is necessary to separate record types into “categories”, generate a dictionary per category, then route each record into a separate pool based on category. More complex, but it doesn’t change the principles.)
As a first step, we can now just use the dictionary to grab additional compression (see graph), and it's an instant win. But that wouldn’t capture all the value from this operation.

The other benefit is that now, the optimal block size can be adjusted to the new conditions, since dictionary compression preserves better compression ratio (even 1 KB block size becomes practicable!). What we gain in exchange is improved performance for random read extraction.

Impact of Block Size on single record extraction (Random Reads)

At its extreme, it might even be possible to compress each record individually, so that no more energy is lost to decompress additional records which aren’t used by the query. But that's not even required.
This change produces huge wins for random reads (collecting “single records” scattered throughout the database). In this example, the log storage can now extract up to 2.5M random records / second, while without dictionary, it was limited to 400K random records / second, and at the cost of a pitiful compression ratio.

In fact, the new situation unlocks new possibilities for the database, which can now accept queries that used to be considered too prohibitive, and might previously have required another dedicated database to serve them.

Normal Vs Dictionary Compression (level 1)
Comparing Compression Ratio and Random Reads per second

That’s when the benefits of dictionary compression really kick in : beyond “better compression”, it unlocks new capabilities that were previously rightly considered “out of reach”.

Massive connections

Another interesting scenario is when a server maintains a large number of connections (multiple thousands) with many clients. Let’s assume the connection is used to send / receive requests respecting a certain protocol, so there tend to be a lot of inter-messages repetition. But within a single message, compression perspective is low, because each message is rather small.

The solution to this topic is known : use streaming compression. Streaming will keep in memory the last N KB (configurable) of messages and use that knowledge to better compress future messages. It reaches excellent performance, as same tags and sequences repeated across messages get squashed in subsequent messages.

The problem is, preserving such a “context” costs memory. And a lot of memory that is. To provide some perspective, a Zstandard streaming context with an history of 32 KB requires 263 KB of memory (about the same as zlib). Of course, increasing history depth improves compression ratio, but also increases memory budget.
That doesn’t look like a large budget in isolation, or even for a few connections, but when applied to 10 thousands client connections, we are talking about > 2.5 GB of RAM. Situation worsen with more connections, or when trying to improve ratio through larger history and higher compression levels.

In such a situation, dictionary compression can help. Training a dictionary to be good at compressing a set of protocols, and then ignite each new connection with this dictionary, will produce instant benefits during the early stage of the connection lifetime. Fine, but then, streaming history takes over, so gains are limited, especially when connection lifetime is long.

A bigger benefit can be realised when understanding that the dictionary can be used to completely eliminate streaming. Dictionary will then partially offset compression ratio reduction, so mileage can vary, but in many cases, ratio just takes a small hit, as generic redundant information is already included in the dictionary anyway.
What we get in exchange are 2 things :
  • huge RAM savings : it’s now longer necessary to preserve a context per client, a single context per active thread is now enough, which is typically several order of magnitude less than the nb of clients.
  • vastly simplified communication framework, as each request is now technically “context-less”, eliminating an important logic framework dedicated to keeping contexts synchronised between sender and receiver (with all the funny parts related to time out, missed or disordered packets, etc.).
Memory budget comparison
Context-less strategy requires much less memory (and is barely visible)


Eliminating the RAM requirement, which evolves from dominant to negligible, and reducing complexity open in turn new possibilities, such as hosting even more connections per server, or improved performance for other local applications. Or simply reduce the number of servers required for the same task.

This kind of scenarios is where Dictionary Compression can give its best : beyond “better compression for small data”, it makes it possible to build target application differently, with second-order effects being as important if not more than compression savings alone.

These are just 2 examples. And I’m sure there are more. But that’s enough to illustrate the topic, and probably enough words for a simple blog post.

Friday, January 12, 2018

Zstandard Overview

 I recently realised that, while there is a specification for Zstandard, which describes in great details what is encoded where, there is no “overview” of the format, which would be neither too detailed nor too vague for programmers with a casual interest in data compression to understand its inner working. This blog post is an attempt to correct that.

Introduction

Zstandard is an LZ77-class compressor, which primarily achieves compression by referencing from past data some segment of bytes identical to following bytes. Zstandard features a few other additional capabilities, but it doesn’t change the core formula. This construction offers several advantages, primarily speed related, especially on the decoder side, since a memory copy operation is all it takes to regenerate a bunch of bytes. Moreover, simple pointer arithmetic is enough to locate the reference to copy, which is as frugal as it gets both cpu and memory wise.

Blocks

Zstandard format is block-oriented. It can only start decoding data when a first full block arrives (with the minor exception of uncompressed blocks). But it’s nonetheless stream-capable : any large input is automatically cut into smaller blocks, and decoding starts as soon as the first block arrives.
A block can have any size, up to a maximum of 128 KB. There are multiple reasons for such a limit to exist.
It’s not a concern related to initial latency for the first block, since the format allows this block to have any size up to maximum, so it can be made explicitly small whenever necessary.
The maximum block size puts an upper limit to the amount of data a decoder must handle in a single operation. The limit makes it possible to allocate a number of resources which are guaranteed to be enough for whatever data will follow.
There are also other concerns into the mix, such as the relative weight of headers and descriptors, time spent to build tables, local adaptation to source entropy, etc. 128 KB felt like a good middle ground providing a reasonable answer to all these topics.
It follows that a small source can be compressed into a single block, while larger ones will need multiple blocks.
The organisation of all these blocks into a single content is called a frame.

Frame

A frame will add a number of properties shared by all blocks in the current frame.
To begin with, it can restrict the maximum block size even further : largest maximum is 128 KB, but for a given frame, it can be defined to a value as small as 1 KB.
The frame can tell upfront how much data will be regenerated from its content, which can be useful to pre-allocate a destination buffer.
Most importantly, it can tell how much “past data” must be preserved in memory to decode next block. This is the “window size”, which has direct consequences on buffer sizes.
There are other properties stored there, but it’s not in the scope of this article to describe all of them. Should you wish to know more, feel free to consult the specification.
Once these properties are extracted, the decoding process is fairly straightforward : decompress data block after block.

Literals

A compressed block consists of 2 sections : literals and sequences.
Literals are the “left over” from LZ77 mechanism : remember, LZ77 compress next bytes by referencing an identical suite of bytes in the past. Sometimes, there is simply no such thing. Trivially, this is necessarily the case at the beginning of each source.
In such a case, the algorithm outputs some “raw bytes”. These bytes are not compressed by LZ77, but they generally can be compressed using another technique : Huffman compression.
The principle behind Huffman is quite different : it transforms bytes into prefix codes using variable number of bits, and assign a low number of bits to frequent characters, while sacrificing infrequent characters with more bits.
The Huffman algorithm makes it possible to select the most efficient repartition of prefix codes.
When all bytes are equally present, it’s not possible to compress anything. But it’s generally not the case.
Consider a standard text file using ASCII character set, a whole set of byte values will not be present (>128), and some characters (like ‘e’) are expected to be more common than others (like ‘q’).
This is the kind of irregularity that Huffman can exploit to provide some compression for these left-over bytes. Typical gains range between 20% and 40%.
The literal section can be uncompressed (mostly when it’s very small, since describing a Huffman table cost multiple bytes), or compressed as a single stream of bits, or using multiple (4) streams of bits.
The multi streams strategy has been explained in another post, and is primarily designed for improved decoding speed.
All literals are decompressed into their own buffer. The buffer size is primarily limited by the block size, since in worst case circumstances, LZ77 will fail completely, leaving only Huffman to do the job.

Sequences

Obviously, we expect LZ77 to be useful. Its outcome is described in the second section, called “sequences”.
A block is rebuilt by a succession of sequences.
A “sequence” describes a number of bytes to copy from literals buffer, and then a number of byte to copy from past data, with an associated offset to locate its reference.
These values are of different nature, so they are encoded using 3 separated alphabets.
Each alphabet must be described, and there is a small header for each of them at the beginning of the section.
The compression technique used here is Finite State Entropy, a tANS variant, which offers better compression for dominant symbols.
Dominant symbols lose a lot of precision with Huffman, resulting in a loss of compression. They are unlikely to be present in “left over” literals, but for sequence symbols, the situation is less favourable.
FSE solves this issue, by being able to encode symbols using a fractional number of bits.
If you are interested in how FSE works, there is a series of articles which tries to describe it, but be aware that it’s fairly complex.
All sequence symbols are interleaved in a single bitstream, which must be read backward, due to ANS property of inverting directions for encoding vs decoding.
On 64-bits CPU, a single read operation is generally enough to grab all bits necessary to decode the 3 symbols forming the sequence. All it takes now is to apply the sequence : copy some bytes from the literals buffer, then copy some bytes from the past.
Decode next sequence. Rince, repeat. Stop when there is no more sequence left in the bitstream.
At this stage, whatever remains in the literals buffer is simply copied to complete the block.
And the decoder can move on to the next block.

Window

While decoding literals and sequence is a block-oriented job, that could be achieved in parallel within multiple blocks (expect a multi-threaded version in the future), the LZ copy operation is not.
It depends on previous blocks being already decoded, and is therefore serial in nature.
That’s where the frame header comes into play : it specifies how much past data the decoder must keep around to be able to decode next block.
The specification recommends to keep this value <= 8 MB, though it’s only a recommendation. --ultra levels for example go beyond this limit.
In most cases though, the decoder will not need that much. All levels <= 10 for example, which tend to be preferred due to their speed, require a memory budget <= 2 MB.
As could be guessed, using less memory is also good for speed.

Wrap up

That’s basically it. All these operations form the core of Zstandard compression format. There are a few more little details involved, such as the repeat offset symbols, shortened header with repeat tables and so on, which are described in the specification, but this description should be enough to grab the essence of the decoder.
The encoder is a bit more complex, not least because there are, in fact, multiple encoders.
The format doesn’t impose a single way to find or select references into the past. At every position into the file, there are always multiple possibilities to encode what’s next.
That’s why different strategies exist, providing different speed / compression trade off. Lower level are mapped onto LZ4, being very fast. Upper levels can be very complex, on top of very slow, offering improved compression ratio.
But the decoder remains always the same, preserving its speed properties.