# Bit vectors and variable-length encoding

This is the first post on what is supposed to be an small series about writing compression algoritms in Rust, please take into account that I am no expert in compression so this is the perspective of a learner, if you think there is some improvements or corrections to be done, let me know.

You might be wondering why bit vectors are relevant for compression and so was I until I started reading a little bit about it. So, let's talk about compression.

# What is compression?

Data compression is the process of encoding data in a certain representation using fewer symbols than the original representation, or at least that is what Wikipedia says.

There are several ways to classify the existing compression algorithms (and in a more general sense coding algorithms) but I am going to divide them into two categories: The algoritms that encode the original symbols into other symbols with the same size and the algorithms that have a variable size for the encoded symbols. The first ones are known as fixed length coding algorithms and the second ones are known as variable length coding algorithms. Each type of coding algorithm is better suited in certain applications. But when I started reading about compression, I realized that several popular coding algorithms are actually variable-length algorithms.

The general idea behind variable-length coding is to use the length of the encoded symbols to (hopefully) reduce the space needed to represent a chunk of information. For example, if we are compressing texts in english, "e" is the most frequent letter in the english language, so we should try to assig a short symbol to "e" and in contrast assign a longer symbol to "x" because "x" is the least common letter in english.

Because bytes are the basic unit of information, most of the usual encodings such as UTF-8, UTF-16 and ISO-8859-1 use sequences of bytes to represent characters [1].

However, given that we want to be really size efficient when compressing, it is useful to work at a finer level and handle bits instead. Going back to our example, we could encode the letter "e" into a sequence of bits smaller than a byte to save some space. If your encoding is done correctly, this will actually compress the text even if the letter "x" gets encoded into a sequence of bits larger than a byte.

The problem is that almost every language works at a byte level. Integers, floats, and even booleans are usually represented by one or more bytes. So if you use a vector of booleans to store the bits of your encoding you are using one whole byte for every bit you want to represent.

# Introducing Bit vectors

Basically what we want to do here is to have a vector (like the
`Vec<T>`

collection in
Rust), with the additional feature of using an small amount of space to store a
single bit.

There are several bit vector implementations on https://crates.io. I am completely sure that those are better than mine. But, I just wanted to give it a try and learn the inner workings of this structure instead of just using it.

In order to have a basic vector we need to be able to:

- Push one bit into the vector.
- Pop one bit from the vector.
- Concatenate two vectors.

There are more operations that might be useful but we will go with those first.
In order to be space-efficient, we will use the bits of an integer to store
information. I am going to work with `u8`

integers here but any integer size
should do the trick. We will refer to these integers as *blocks*, i.e., in
this post, one block is the same as one `u8`

.

## The push operation

Now let's assume that we want to push a `1`

into an empty bit vector. Given
that we need to store that one somewhere and we can only use a whole block to
do the trick we should push a new block into the vector with a `1`

somewhere.

```
push(1)
[ ] --------> [ 10000000 ]
```

We are wasting space because we are not using the other 7 bits inside the
`10000000`

block. However, if we need to push another one into the vector, we
are not going to add a whole new block to the vector, we are going to flip one of
the bits of the last block instead.

```
push(1)
[ 10000000 ] --------> [ 11000000 ]
```

That is great because now we can store 8 bits inside each block without wasting
any space, and just add more blocks when the last block is full. But there is a
catch, what happens if we push a `0`

?

```
push(0)
[ 11000000 ] --------> [ 11000000 ]
```

As you can see, pushing a zero at the end does not change the current block, so
we need a way to know what is the position of the last bit used inside a
block, we are going to call this the *offset*. A offset of `0`

means we are
just using the first bit of our block, an offset of `7`

means that we are using
the last bit of a block and, as a consequence, it means that the current block
is full [2]. Let's repeat the same examples as before including the offset as
an small `^`

below the last used bit:

```
push(1)
[ ] --------> [ 10000000 ]
^
push(1)
[ 10000000 ] --------> [ 11000000 ]
^ ^
push(0)
[ 11000000 ] --------> [ 11000000 ]
^ ^
```

Now we have a way to know if we have trailing zeros in our blocks.

## The pop operation

Popping elements from a vector is just the inverse of the push operation, i.e., it changes the last bit to zero and decreases the offset by one. If the offset is zero, it means that the current block will be empty and we can actually remove the whole block from the vector

```
pop()
[ 10000000 ] --------> [ ]
^
pop()
[ 11000000 ] --------> [ 10000000 ]
^ ^
pop()
[ 11000000 ] --------> [ 11000000 ]
^ ^
```

## The concat operation

We can concatenate two vectors by using the push and pop operations that we just defined, but that sounds really inefficient because we are doing 16 operations to move a block. We can do better by moving whole blocks from one vector to the other. Let's do an small example first, in order to do a concatenation like this one

```
[ 10101100 ] ++ [ 01101100 11010000 ] ---> [ 10101101 10110011 01000000 ]
^ ^ ^
```

we can move one block at a time to make it easier

```
[ 10101100 ] ++ [ 01101100 11010000 ] ---> [ 10101101 10110000 ] ++ [ 11010000 ]
^ ^ ^ ^
[ 10101101 10110000 ] ++ [ 11010000 ] ---> [ 10101101 10110011 01000000 ]
^ ^ ^
```

This should be slower than using a `memcpy`

to move the contents of a vector
next to the other one but it is faster than doing it bit by bit, if you think
there is a better way of doing this please let me know.

# The implementation

Now that we have the bigger picture, let's get into the actual implementation details. First we will declare a new struct to store the blocks and the offset

pub struct BitVec { blocks: Vec<u8>, offset: u8, }

we will also import and declare a couple of constants to make our code easier to mantain.

use std::u8::MAX; const BLOCK_SIZE: u8 = 8; const MAX_OFFSET: u8 = BLOCK_SIZE - 1; const LEADING_ONE: u8 = 1.rotate_right(1);

Next, we are going to write a method to create an empty `BitVec`

impl BitVec { pub fn new() -> Self { BitVec { blocks: Vec::new(), offset: MAX_OFFSET, } } ...

The `offset`

field is initialized as the maximum possible offset in order to be
consistent when doing `push`

. This is, if we push something to an empty vector,
we will do the exact same operation as if the vector had it last block full:
Add a new block and set the offset to zero. [3] We are adding some utility
methods to avoid repeating the same conditionals all over our code.

The first method, `is_aligned`

, is used to check if our vector is aligned with
`BLOCK_SIZE`

. We consider that a vector is aligned if all its blocks are full.
However, given that the only block that has a chance of not being full is the
last one, we have just to check if `self.offset`

is maximal or not

pub fn is_aligned(&self) -> bool { self.offset == MAX_OFFSET }

Next, we have a method to compute the length or number of bits of our vector.
If all the blocks were full, the length would be just the number of blocks,
multiplied by `BLOCK_SIZE`

. So we just need to substract from that the number
of unused bits in the last block.

pub fn len(&self) -> usize { self.blocks.len() * (BLOCK_SIZE as usize) - (MAX_OFFSET - self.offset) as usize }

Now, we will write a method to check if a vector is empty. This could be done
by computing its length, but it is easier to just say that a vector is empty if
`self.blocks`

is empty,

pub fn is_empty(&self) -> bool { self.blocks.is_empty() && self.is_aligned() }

the `self.is_aligned()`

inside `is_empty`

is not necessary but its a nice way
to debug some possible errors. If we do everything correctly, `self.offset`

will be maximal if `self.blocks`

is empty.

Before writing the actual push, pop and concat methods, it is important to define the representation of a bit outside the vector. One reasonable choice would be to use booleans given that they only have two values that correspond to having a one-bit or a zero-bit. So we will use booleans as the type of the element to push and as the return type of pop. This is just an API convention, we are not actually storing whole booleans inside our vector.

Now, it is time to write the `push`

method. The first thing to consider is if
we need to push a new block or not. This should only happen if the last block
of the vector is full, so

pub fn push(&mut self, bit: bool) { if self.is_aligned() { // We push a new block } else { // We add the bit to the last block } }

If we actually need to push a new block into `self.blocks`

, the `self.offset`

field must be updated to zero. Next, we need to decide the value for the new
block. If we want to push a one, the block should have a one in the first bit
only (which is the value that `LEADING_ONE`

stores). In the other case, we
should just let the block be completely zero.

pub fn push(&mut self, bit: bool) { if self.is_aligned() { self.offset = 0; if bit { self.blocks.push(LEADING_ONE); } else { self.blocks.push(0) } } else { // We add the bit to the last block } }

If we are in the case when the vector is not full, we must update the last
block and increase `self.offset`

. If we want to push a zero, we just increase
the offset given that the unused bits are zero by default. If we want to push a
one, we need to take the current block and change the bit in the position
`self.offset + 1`

to one. This is easily done by taking `LEADING_ONE`

, shifting
its bytes to the right `self.offset + 1`

positions and adding it to the last
block [4].

pub fn push(&mut self, bit: bool) { if self.is_aligned() { self.offset = 0; if bit { self.blocks.push(LEADING_ONE); } else { self.blocks.push(0); } } else { self.offset += 1; if bit { let block = self.blocks.last_mut().unwrap(); *block += LEADING_ONE >> self.offset; } } }

This last step is a masking
operation. What we did is
building a block that has a one in the exact position we want to update , this
is `LEADING_ONE >> self.offset`

, and then we did a bitwise operation with the
last block. We can use the same technique to change a bit to zero when doing
the pop method. [5]

In order to pop a bit from our vector we need to check if the last block is
empty. If it is we return `None`

, otherwise we need to update this last block.
We will do this check using the `?`

operator

pub fn pop(&mut self) -> Option<bool> { let block = self.blocks.last_mut()?; // Update the block }

If the vector is not empty, we need to read the last bit of `block`

to decide
if we have to return `true`

or `false`

. In order to do this, we will build a
mask with a one in the position of the last byte and then we will do a bitwise
and operation between the mask and the block. This operation returns zero if
and only if the last used bit is a zero.

pub fn pop(&mut self) -> Option<bool> { let block = self.blocks.last_mut()?; let mask = LEADING_ONE >> self.offset; let bit = if (*block & mask) != 0 { true } else { false }; }

Now, if the last bit is a one, we need to remove it by substracting the mask from the block

pub fn pop(&mut self) -> Option<bool> { let block = self.blocks.last_mut()?; let mask = LEADING_ONE >> self.offset; let bit = if (*block & mask) != 0 { *block -= mask; true } else { false }; Some(bit) }

Finally, we need to update `self.offset`

. If the offset is zero, we should pop
the whole block, because we removed the only bit it had, and update the offset
to be maximal. In the other case we just reduce the offset by one.

pub fn pop(&mut self) -> Option<bool> { let block = self.blocks.last_mut()?; let mask = LEADING_ONE >> self.offset; let bit = if (*block & mask) != 0 { *block -= mask; true/ } else { false }; if self.offset == 0 { self.blocks.pop().unwrap(); self.offset = MAX_OFFSET; } else { self.offset -= 1; } Some(bit) }

I believe we have covered enough here, so I will let the implementation of concat for the next post.

# Discussion

I had a hard time implementing this, in fact this is my second time trying to do it, the first time was almost a year ago, but with enough drawings everything can be done. I hope that someone will find this post useful or interesting, I am more than happy to discuss this.

Someone could argue that is easier to just use `Vec<bool>`

to store everything
in memory and then write to disk by grouping booleans and transform those
groups into whole bytes. But I believe this is more efficient nonetheless, you
are wasting less memory and when you need to write to disk you can just dump
the raw blocks.

I do not know how important is the block size for this. It is more or less the
traditional trade-off between speed and space. If you can afford large blocks,
your operations will be faster because you need to remove and add blocks less
often. However, every time you add a new block, you are wasting a lot of space
until you actually are close to filling it. On the other hand, it should be
easy to change the block size using unsafe and transmuting the `blocks`

field,
because all the values related to the size of the block are abstracted into
constants.

When I finish writing about concat I will start with the Shannon-Fano coding, which is one of the first variable-length encoding compression algorithms.

# Footnotes

- [1] In fact all of these are variable-length encodings, but they work at the byte level.
- [2] Remember that here blocks are
`u8`

so`7`

is the largest index that a bit can have inside a block. - [3] A similar statement is true when doing
`pop`

in a vector that is only using one bit of its last block. - [4] We are using
`unwrap`

when extracting`block`

because we know that`is_aligned`

is always true for an empty vector. - [5]
`LEADING_ONE >> self.offset`

never overflows because`self.offset`

is always smaller`BLOCK_SIZE`

.