This is the first post on what is supposed to be an small series about writing compression algoritms in Rust, please take into account that I am no expert in compression so this is the perspective of a learner, if you think there is some improvements or corrections to be done, let me know.

You might be wondering why bit vectors are relevant for compression and so was I until I started reading a little bit about it. So, let's talk about compression.

What is compression?

Data compression is the process of encoding data in a certain representation using fewer symbols than the original representation, or at least that is what Wikipedia says.

There are several ways to classify the existing compression algorithms (and in a more general sense coding algorithms) but I am going to divide them into two categories: The algoritms that encode the original symbols into other symbols with the same size and the algorithms that have a variable size for the encoded symbols. The first ones are known as fixed length coding algorithms and the second ones are known as variable length coding algorithms. Each type of coding algorithm is better suited in certain applications. But when I started reading about compression, I realized that several popular coding algorithms are actually variable-length algorithms.

The general idea behind variable-length coding is to use the length of the encoded symbols to (hopefully) reduce the space needed to represent a chunk of information. For example, if we are compressing texts in english, "e" is the most frequent letter in the english language, so we should try to assig a short symbol to "e" and in contrast assign a longer symbol to "x" because "x" is the least common letter in english.

Because bytes are the basic unit of information, most of the usual encodings such as UTF-8, UTF-16 and ISO-8859-1 use sequences of bytes to represent characters [1].

However, given that we want to be really size efficient when compressing, it is useful to work at a finer level and handle bits instead. Going back to our example, we could encode the letter "e" into a sequence of bits smaller than a byte to save some space. If your encoding is done correctly, this will actually compress the text even if the letter "x" gets encoded into a sequence of bits larger than a byte.

The problem is that almost every language works at a byte level. Integers, floats, and even booleans are usually represented by one or more bytes. So if you use a vector of booleans to store the bits of your encoding you are using one whole byte for every bit you want to represent.

Introducing Bit vectors

Basically what we want to do here is to have a vector (like the Vec<T> collection in Rust), with the additional feature of using an small amount of space to store a single bit.

There are several bit vector implementations on https://crates.io. I am completely sure that those are better than mine. But, I just wanted to give it a try and learn the inner workings of this structure instead of just using it.

In order to have a basic vector we need to be able to:

  • Push one bit into the vector.
  • Pop one bit from the vector.
  • Concatenate two vectors.

There are more operations that might be useful but we will go with those first. In order to be space-efficient, we will use the bits of an integer to store information. I am going to work with u8 integers here but any integer size should do the trick. We will refer to these integers as blocks, i.e., in this post, one block is the same as one u8.

The push operation

Now let's assume that we want to push a 1 into an empty bit vector. Given that we need to store that one somewhere and we can only use a whole block to do the trick we should push a new block into the vector with a 1 somewhere.

              push(1)
         [ ] --------> [ 10000000 ]

We are wasting space because we are not using the other 7 bits inside the 10000000 block. However, if we need to push another one into the vector, we are not going to add a whole new block to the vector, we are going to flip one of the bits of the last block instead.

              push(1)
[ 10000000 ] --------> [ 11000000 ]

That is great because now we can store 8 bits inside each block without wasting any space, and just add more blocks when the last block is full. But there is a catch, what happens if we push a 0?

              push(0)
[ 11000000 ] --------> [ 11000000 ]

As you can see, pushing a zero at the end does not change the current block, so we need a way to know what is the position of the last bit used inside a block, we are going to call this the offset. A offset of 0 means we are just using the first bit of our block, an offset of 7 means that we are using the last bit of a block and, as a consequence, it means that the current block is full [2]. Let's repeat the same examples as before including the offset as an small ^ below the last used bit:

              push(1)
         [ ] --------> [ 10000000 ]
                         ^
              push(1)
[ 10000000 ] --------> [ 11000000 ]
  ^                       ^
              push(0)
[ 11000000 ] --------> [ 11000000 ]
   ^                       ^

Now we have a way to know if we have trailing zeros in our blocks.

The pop operation

Popping elements from a vector is just the inverse of the push operation, i.e., it changes the last bit to zero and decreases the offset by one. If the offset is zero, it means that the current block will be empty and we can actually remove the whole block from the vector

               pop()
[ 10000000 ] --------> [ ]
  ^                     
               pop()
[ 11000000 ] --------> [ 10000000 ]
   ^                     ^
               pop()
[ 11000000 ] --------> [ 11000000 ]
    ^                     ^

The concat operation

We can concatenate two vectors by using the push and pop operations that we just defined, but that sounds really inefficient because we are doing 16 operations to move a block. We can do better by moving whole blocks from one vector to the other. Let's do an small example first, in order to do a concatenation like this one

[ 10101100 ] ++ [ 01101100 11010000 ] ---> [ 10101101 10110011 01000000 ]
       ^                       ^                                 ^

we can move one block at a time to make it easier

[ 10101100 ] ++ [ 01101100 11010000 ] ---> [ 10101101 10110000 ] ++ [ 11010000 ] 
       ^                       ^                           ^              ^
[ 10101101 10110000 ] ++ [ 11010000 ] ---> [ 10101101 10110011 01000000 ]
                ^              ^                                 ^

This should be slower than using a memcpy to move the contents of a vector next to the other one but it is faster than doing it bit by bit, if you think there is a better way of doing this please let me know.

The implementation

Now that we have the bigger picture, let's get into the actual implementation details. First we will declare a new struct to store the blocks and the offset

pub struct BitVec {
    blocks: Vec<u8>,
    offset: u8,
}

we will also import and declare a couple of constants to make our code easier to mantain.

use std::u8::MAX;

const BLOCK_SIZE: u8 = 8;
const MAX_OFFSET: u8 = BLOCK_SIZE - 1;
const LEADING_ONE: u8 = 1.rotate_right(1);

Next, we are going to write a method to create an empty BitVec

impl BitVec {
    pub fn new() -> Self {
        BitVec {
            blocks: Vec::new(),
            offset: MAX_OFFSET,
        }
    }
    ...

The offset field is initialized as the maximum possible offset in order to be consistent when doing push. This is, if we push something to an empty vector, we will do the exact same operation as if the vector had it last block full: Add a new block and set the offset to zero. [3] We are adding some utility methods to avoid repeating the same conditionals all over our code.

The first method, is_aligned, is used to check if our vector is aligned with BLOCK_SIZE. We consider that a vector is aligned if all its blocks are full. However, given that the only block that has a chance of not being full is the last one, we have just to check if self.offset is maximal or not

pub fn is_aligned(&self) -> bool {
    self.offset == MAX_OFFSET
}

Next, we have a method to compute the length or number of bits of our vector. If all the blocks were full, the length would be just the number of blocks, multiplied by BLOCK_SIZE. So we just need to substract from that the number of unused bits in the last block.

pub fn len(&self) -> usize {
    self.blocks.len() * (BLOCK_SIZE as usize) - (MAX_OFFSET - self.offset) as usize
}

Now, we will write a method to check if a vector is empty. This could be done by computing its length, but it is easier to just say that a vector is empty if self.blocks is empty,

pub fn is_empty(&self) -> bool {
    self.blocks.is_empty() && self.is_aligned()
}

the self.is_aligned() inside is_empty is not necessary but its a nice way to debug some possible errors. If we do everything correctly, self.offset will be maximal if self.blocks is empty.

Before writing the actual push, pop and concat methods, it is important to define the representation of a bit outside the vector. One reasonable choice would be to use booleans given that they only have two values that correspond to having a one-bit or a zero-bit. So we will use booleans as the type of the element to push and as the return type of pop. This is just an API convention, we are not actually storing whole booleans inside our vector.

Now, it is time to write the push method. The first thing to consider is if we need to push a new block or not. This should only happen if the last block of the vector is full, so

pub fn push(&mut self, bit: bool) {
    if self.is_aligned() {
        // We push a new block
    } else {
        // We add the bit to the last block
    }
}

If we actually need to push a new block into self.blocks, the self.offset field must be updated to zero. Next, we need to decide the value for the new block. If we want to push a one, the block should have a one in the first bit only (which is the value that LEADING_ONE stores). In the other case, we should just let the block be completely zero.

pub fn push(&mut self, bit: bool) {
    if self.is_aligned() {
        self.offset = 0;
        if bit {
            self.blocks.push(LEADING_ONE);
        } else {
            self.blocks.push(0)
        }
    } else {
        // We add the bit to the last block
    }
}

If we are in the case when the vector is not full, we must update the last block and increase self.offset. If we want to push a zero, we just increase the offset given that the unused bits are zero by default. If we want to push a one, we need to take the current block and change the bit in the position self.offset + 1 to one. This is easily done by taking LEADING_ONE, shifting its bytes to the right self.offset + 1 positions and adding it to the last block [4].

pub fn push(&mut self, bit: bool) {
    if self.is_aligned() {
        self.offset = 0;
        if bit {
            self.blocks.push(LEADING_ONE);
        } else {
            self.blocks.push(0);
        }
    } else {
        self.offset += 1;
        if bit {
            let block = self.blocks.last_mut().unwrap();
            *block += LEADING_ONE >> self.offset;
        }
    }
}

This last step is a masking operation. What we did is building a block that has a one in the exact position we want to update , this is LEADING_ONE >> self.offset, and then we did a bitwise operation with the last block. We can use the same technique to change a bit to zero when doing the pop method. [5]

In order to pop a bit from our vector we need to check if the last block is empty. If it is we return None, otherwise we need to update this last block. We will do this check using the ? operator

pub fn pop(&mut self) -> Option<bool> {
    let block = self.blocks.last_mut()?;
    // Update the block
}

If the vector is not empty, we need to read the last bit of block to decide if we have to return true or false. In order to do this, we will build a mask with a one in the position of the last byte and then we will do a bitwise and operation between the mask and the block. This operation returns zero if and only if the last used bit is a zero.

pub fn pop(&mut self) -> Option<bool> {
    let block = self.blocks.last_mut()?;

    let mask = LEADING_ONE >> self.offset;

    let bit = if (*block & mask) != 0 {
        true
    } else {
        false
    };
}

Now, if the last bit is a one, we need to remove it by substracting the mask from the block

pub fn pop(&mut self) -> Option<bool> {
    let block = self.blocks.last_mut()?;

    let mask = LEADING_ONE >> self.offset;

    let bit = if (*block & mask) != 0 {
        *block -= mask;
        true
    } else {
        false
    };
    
    Some(bit)
}

Finally, we need to update self.offset. If the offset is zero, we should pop the whole block, because we removed the only bit it had, and update the offset to be maximal. In the other case we just reduce the offset by one.

pub fn pop(&mut self) -> Option<bool> {
    let block = self.blocks.last_mut()?;

    let mask = LEADING_ONE >> self.offset;

    let bit = if (*block & mask) != 0 {
        *block -= mask;
        true/
    } else {
        false
    };

    if self.offset == 0 {
        self.blocks.pop().unwrap();
        self.offset = MAX_OFFSET;
    } else {
        self.offset -= 1;
    }

    Some(bit)
}

I believe we have covered enough here, so I will let the implementation of concat for the next post.

Discussion

I had a hard time implementing this, in fact this is my second time trying to do it, the first time was almost a year ago, but with enough drawings everything can be done. I hope that someone will find this post useful or interesting, I am more than happy to discuss this.

Someone could argue that is easier to just use Vec<bool> to store everything in memory and then write to disk by grouping booleans and transform those groups into whole bytes. But I believe this is more efficient nonetheless, you are wasting less memory and when you need to write to disk you can just dump the raw blocks.

I do not know how important is the block size for this. It is more or less the traditional trade-off between speed and space. If you can afford large blocks, your operations will be faster because you need to remove and add blocks less often. However, every time you add a new block, you are wasting a lot of space until you actually are close to filling it. On the other hand, it should be easy to change the block size using unsafe and transmuting the blocks field, because all the values related to the size of the block are abstracted into constants.

When I finish writing about concat I will start with the Shannon-Fano coding, which is one of the first variable-length encoding compression algorithms.

Footnotes

  • [1] In fact all of these are variable-length encodings, but they work at the byte level.
  • [2] Remember that here blocks are u8 so 7 is the largest index that a bit can have inside a block.
  • [3] A similar statement is true when doing pop in a vector that is only using one bit of its last block.
  • [4] We are using unwrap when extracting block because we know that is_aligned is always true for an empty vector.
  • [5] LEADING_ONE >> self.offset never overflows because self.offset is always smaller BLOCK_SIZE.