Laszlo - HTTP2 Huffman Decoding

HTTP2 Huffman Decoding

07/19/2025

6 minutes

Introduction

HTTP2 headers can contain string literals that are Huffman encoded for compression. Before processing these headers, servers must first decode them according to RFC 7541.

A string literal in a header consists of three components:

A single bit flag indicating Huffman encoding
The length of the encoded data
The encoded data itself

Understanding Huffman Encoding

Huffman encoding is a compression technique that uses variable-length codes based on symbol frequency. In HTTP2:

Common ASCII characters get shorter codes.
Less frequent characters get longer codes.
The goal is overall data compression.
In some cases, the encoded output might be longer than the input.

The encoding uses a prefix code system, meaning no code can be a prefix of another code. This is achieved through a binary tree structure where:

Each leaf represents an input symbol.
Tree edges are labeled with 0 or 1.
A symbol's code is the path from root to leaf.

HTTP2 Implementation

HPACK defines specific Huffman codes for ASCII characters, optimized for typical HTTP headers.

Decoding Process

The decoding process involves:

Reconstructing the Huffman tree from RFC 7541's code-symbol pairs.
Walking the tree from root to leaf for each symbol:
- Read one bit at a time.
- Follow the corresponding path (0 or 1).
- Reach a leaf node to obtain the decoded symbol.
Repeating until all data is decoded.

The predefined codes vary from 5 to 30 bits in length. This means an encoded input may get ~4 times longer in the worst case or 5/8 length in the best case. Because the concatenated codes may not align to byte boundaries, the last byte must contain padding.

ASP.NET Core Solution

ASP.NET Core uses the Huffman.cs from System.Net.Http to decode incoming headers. At the time of writing this blog post (.NET 9), ASP.NET Core does not support returning Huffman encoded headers in the response.

The Huffman implementation generates a lookup table of ushort (16-bit) values representing the Huffman tree. The first bit indicates if a value represents an internal node or a leaf node. The next 7 bits represent a pointer to another entry of a table in case of internal nodes. For leaf nodes, they return the number of bits used by the code. Finally, for leaf nodes, the next 8 bits represent the symbol.

A comment in Huffman.cs describes it as:

// -----------------------------------------------------------------
//  15  14  13  12  11  10   9   8   7   6   5   4   3   2   1   0
// +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
// | 1 |   next_lookup_table_index |            not_used           |
// +---+---------------------------+-------------------------------+
// or
// +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
// | 0 |     number_of_used_bits   |              octet            |
// +---+---------------------------+-------------------------------+

Overall, the idea is the following for decoding a value (with some edge cases omitted for brevity):

At decoding, the source is read byte-by-byte. The algorithm bit shifts the source value so that it can match the byte boundaries required by the lookup table. It looks up the table value. When a symbol is found, it returns the symbol. Otherwise, it must repeat reading further bytes and traversing the lookup table pointed by the previous table index until a symbol is found.

Using LZCNT

It might seem reasonable to read the source byte-by-byte, because in the optimistic case all symbols are encoded in 1 byte or less. However, it does not save from bit shifting and dealing with two bytes at the same time. For example, in case of an input of "ca" the Huffman code would be 00100 00011 + padding. When decoding the second symbol 'a', not only some bits of the 1 byte need to be used, but also a portion of the second byte, excluding the padding.

A different approach could be counting the number of consecutive 1 bits until the first 0 bit is reached. Modern processors are equipped with POPCNT and LZCNT instructions. POPCNT returns the number of 1 bits, while LZCNT returns the number of leading 0 bits. None of them are exactly what is needed. However, the input can be bitwise negated, so that all 0-s become one and 1 become zero. Now, LZCNT returns the number of consecutive leading zeros. All bits must be loaded into a type that supports the LZCNT instruction. Fortunately, ulong types do support it. It stores 64 bits, which means it can store more than 2 of the longest Huffman codes - or it stores always at least one complete (valid) code.

Finally, a similar lookup table must be generated as the one used by .NET 9. Notice that after a given number of consecutive leading 1 bits and a 0 bit, the remaining bits of the Huffman codes result in prefix codes. Hence a lookup table of tables can be generated, where the table index is the number of consecutive leading 1 bits, and the remainder of bits of the code index into the referenced table revealing the symbol and its code length.

An algorithm can read the next 8 bytes of the input (when at least 8 bytes are available) into a ulong value. It might need to reverse the endianness. Then shift the value so that MSB of the code matches the byte boundary. Next, it uses the LZCNT command on the bitwise negated ulong value. This returns a tableIndex value and then the remainder 1 byte shall index into one of the 256 values of a given table.

However, the remainder of bits might actually not fill a full byte. For example, in case of a code 11111111 11011 + next code the tableIndex would be 10 and the remainder byte would be 011 and some random bits following as the start of the next code. The pregenerated table at any index will return a valid symbol and the code length (some edge cases set aside). So the same symbol might be returned at multiple indexes of the table.

The benefit of this approach is that LZCNT is a fast operation for most CPUs (on a modern x64 Intel CPU it goes with a Latency of [0;3] and Throughput 1.00).

Based on my measurements, a ~15% performance improvement can be achieved with this approach on a test machine with an input using the 32-150 range of ASCII symbols:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2605)
12th Gen Intel Core i7-1255U, 1 CPU, 12 logical and 10 physical cores
.NET SDK 9.0.100
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2


| Method            | Mean     | Error     | StdDev    |
|------------------ |---------:|----------:|----------:|
| HuffmanOriginal   | 6.561 us | 0.1254 us | 0.1231 us |
| HuffmanSparseFull | 5.634 us | 0.0315 us | 0.0263 us |

Conclusion

The algorithm of Huffman codes decoding does not seem to lend itself very well for vectorization, because codes do not have a well-known length. If there is a reasonable way to use SIMD, I would be happy to learn about it. Algorithms can balance memory and CPU utilization. They can trade memory space with time complexity or vice-versa. In this post, I have shown two approaches, one used by ASP.NET Core and one custom solution.

http kestrel server CHttp

Hello, I am Laszlo

.NET developer