Saturday, February 2, 2013

Varints

Variable length encoding of integers results in using less space for small integers and more for big ones. This comes handy in a few places. For instance, if you are encoding a string, you have the choice of using a null-terminated string or a length-encoded string. In the former case, you only use 1 extra byte, but end up with the restriction of not being able to include the 0 byte in your string. If you go the length-encoded way, then you can use 4-bytes to encode the string length. In this alternative, the 0 byte can appear in the string without any issue. However, for short strings, using 4 bytes for the length is very wasteful. To solve this, one can use variable length encoding for the string length, which makes it possible to use up only 1 byte for the encoding for a small string.

Another useful scenario is when you have a single integer type in your system, say a 64 bit signed integer, but most of the time, the integers you represent have a much smaller range. Again, the variable length encoding will help minimize the size of the encoding, in case you want to write things to disk or send data over the network. In fact, Google Protocol buffers uses a variable length encoding for a similar purpose. Wikipedia has an article on variable length encoding of integers here. Apparently it was popularized by the MIDI file format.  IBM's SPL language uses a simpler version of variable length encoding for serializing its string types (see here).

I want to open a parenthesis here and talk about type names in programming languages briefly. Until C99, there was a lot of confusion in C in terms of the sizes of the types. How big is a long? Well, it depends on the system. Historically, not all systems supported all sizes, so the size for different types were not fixed. With C99, inttypes.h now defines types such as int64_t and uint64_t, which remove the ambiguity (there is also the craziness of int_leastX_t and int_fastX_t stuff, and in fact uintX_t is optional, but let's not get into that). What surprises me is that, Java has decided to call its types byte, short, int, long, etc. I much prefer int8, int16, int32, int64. If short is defined as a signed integer with a size of 16 bits (which means the size is not an implementation detail), why not call it int16, which makes it easier to remember as well. 

Going back to variable length encoding, the idea is to divide the integer into 7-bit blocks, and use 1-byte (8-bits) to encode each block. The trailing 0-only bytes need not be encoded, saving space for small valued integers. When encoding, we go from the least significant block to most significant block, and set the most significant bit of each block to 1, except for the last one, which will have 0 as its most significant bit. As a result, a most significant bit value of 1 for a block is an indicator that the block is a continuation block, whereas a 0 value means that the block is the termination block. Here is an example:

89657 = 10101111000111001

Divide into 7-bit blocks

 101 - 0111100 - 0111001

Set the most significant bit to 1 for all but the leftmost block, which is padded with 0's instead:

00000101 - 10111100 - 10111001

Recall that we encode starting from the least significant block, so we get:

10111001 - 10111100 - 00000101

To decode, we apply the following procedure. If the first block starts with a 0, we have a single block. Otherwise, we collect all the consecutive blocks that start with 1, until we reach the block that starts with a 0 - the termination block. We then strip the most significant bits from each block and concatenate. In the running example, we have:

Strip the most significant bit:

0111001 - 0111100 - 0000101

Reverse block order:

0000101 - 0111100 - 0111001

And concatenate:

10101111000111001= 89657

This works for unsigned types. For singed types, the negative numbers will have their most significant bit set. So -1 will take a lot of space. If you have an application where you want the encoding to take up less space if the absolute value of the number is small, then you can apply a mapping like the following:

0 -> 0
-1 -> 1
1 -> 2
-2 -> 3
2 -> 4
...
-2^(n-1)+1 -> 2^n-3
2^(n-1)-1 -> 2^n-2 
-2^(n-1) -> 2^n-1 

Code not reusable is not worth writing. So, below is the C++ code for this, which I hope you will find reusable:

#include <inttypes.h>
#include <cstdlib>


class VarintUtils {
private:
    template <typename T>
    static char * encodeUnsigned(char * buf, T value) {
        do {
            uint8_t byte = static_cast<uint8_t>(value) & 0x7F;
            value >>= 7;
            if (value)
                byte |= 0x80;
            *(buf++) = byte;
        } while (value!=0);
        return buf;
    }

    template <typename T>
    static char * encodeSigned(char * buf, T value) {
        size_t const size = sizeof(T)<<3;
        T newValue = (value << 1) ^ (value >> (size-1));
        return encodeUnsigned(buf, toUnsigned(newValue));
    }

    template <typename T>
    static char * decodeUnsigned(char * buf, T & value) {
        value = 0;
        size_t shift = 0;
        uint8_t byte;
        do {
            byte = *(buf++);
            value |= (static_cast<T>(byte & 0x7F) << shift);
            shift += 7;
        } while ((byte&0x80)!=0);
        return buf;
    }

    template <typename T>
    static char * decodeSigned(char * buf, T & value) {
        buf = decodeUnsigned(buf, toUnsignedRef(value));
        T half = toUnsigned(value) >> 1;
        value = (value & 1) ? ~half : half;
        return buf;
    }

    // conversion from signed to unsigned
    inline static uint32_t toUnsigned(int32_t value) { return value; }
    inline static uint64_t toUnsigned(int64_t value) { return value; }
    inline static uint32_t & toUnsignedRef(int32_t & value)
      { return reinterpret_cast<uint32_t &>(value); }
    inline static uint64_t & toUnsignedRef(int64_t & value)
      { return reinterpret_cast<uint64_t &>(value); }

public:
    inline static char * encode(char * buf, uint32_t value) {
        return encodeUnsigned(buf, value);
    }
    inline static char * encode(char * buf, uint64_t value) {
        return encodeUnsigned(buf, value);
    }
    inline static char * encode(char * buf, int32_t value) {
        return encodeSigned(buf, value);
    }
    inline static char * encode(char * buf, int64_t value) {
        return encodeSigned(buf, value);
    }
    inline static char * decode(char * buf, uint32_t & value) {
        return decodeUnsigned(buf, value);
    }
    inline static char * decode(char * buf, uint64_t & value) {
        return decodeUnsigned(buf, value);
    }
    inline static char * decode(char * buf, int32_t & value) {
        return decodeSigned(buf, value);
    }
    inline static char * decode(char * buf, int64_t & value) {
        return decodeSigned(buf, value);
    }
};

Interestingly enough, you can perform binary search on a variable length encoded buffer. You might wonder why this is useful. Well, imagine you have a large graph stored using the adjacency list layout. Further assume that you have remapped the vertex ids so that the ones with the highest degree have the lowest ids. Now you can get significant compression if you encode adjacency lists using the variable length encoding of vertex ids. However, once you do that, how are you going to find if a given vertex is a neighbor of another vertex? You would have to resort to a linear search. Well, it is relatively easy to implement binary search on a variable length encoded list (of course, assuming it is sorted).

The idea is to readjust the location of the mid byte so that it aligns with the start of a variable length encoded value. Here is the code:


class VarintUtils {
    ...
private:
    template <typename T>
    static char * binarySearchGeneric(char * start, char * end, T value) {
        char * origin = start;
        T refval;
        while (start < end) {
           char * mid = adjust(start + (end-start) / 2, origin);
           char * midNext = decode(mid, refval);
           if (refval==value)
               return mid;
           if (refval>value)
               end = mid;
           else
               start = midNext;
        }
        return NULL;
    }

    static char * adjust(char * bufCurr, char * bufStart) {
        if (isFinalByte(*bufCurr)) {
            if (bufCurr==bufStart || isFinalByte(*(bufCurr-1)))
                return bufCurr;
            bufCurr--;
        }
        while (bufCurr!=bufStart && !isFinalByte(*(bufCurr-1)))
            bufCurr--;
        return bufCurr;
    }

    inline static bool isFinalByte(char c) {
        return (0x80 & c)==0;
    }
public:
    inline static char * binarySearch(char * bufferStart, char * bufferEnd, uint32_t value) {
        return binarySearchGeneric(bufferStart, bufferEnd, value);
    }
    inline static char * binarySearch(char * bufferStart, char * bufferEnd, uint64_t value) {
        return binarySearchGeneric(bufferStart, bufferEnd, value);
    }
    inline static char * binarySearch(char * bufferStart, char * bufferEnd, int32_t value) {
        return binarySearchGeneric(bufferStart, bufferEnd, value);
    }
    inline static char * binarySearch(char * bufferStart, char * bufferEnd, int64_t value) {
        return binarySearchGeneric(bufferStart, bufferEnd, value);
    }
};




Well, one weakness of the above implementation is that, mid point is determined by considering the number of bytes between the start and the end. However, towards the end of the list, the numbers are larger and thus may take more bytes. Still, for long lists, this provides good speedup compared to a linear scan.

That's all for today...









No comments:

Post a Comment