CS3851 -- Lab 3: Burrows-Wheeler Data Compression

Outcomes Addressed

list at least three engineering applications of algorithmic design and analysis.

Team Assignment

You must work with your assigned partner.

Overview

In this assignment you will implement the Burrows-Wheeler data (de)compression algorithm. This algorithm is used by the Unix bzip2 compression utility and often achieves higher compression ratios than gzip and PKZIP. Implementation of the compression algorithm involves three sequential steps:

Burrows-Wheeler Transform (BWT): Reorders the symbols in a file to form long sequences of identical symbols. For example, www.t-a-y-l-o-r.com! becomes tyloarw-.-o-c-.ww!-m where each symbol is a character and ! is the EOF character.
Move To Front (MTF) transform: Rewrite the sequence of symbols as a sequence of indices of a dynamic list. For example, tyloarw-.-o-c-.ww!-m becomes 8a6869a7815191340b3b (in hex) where -.aclmortw! is used as the ordering of the characters in the input stream.
Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols. See below for a detailed example.

The uncompression algorithm applies the inverse operations in the opposite order. I.e., do Huffman decoding, then inverse MTF transform, and finally inverse BWT.

Details

Burrows-Wheeler Transform

The BWT rearranges the order of the symbols in the input stream such that

Identical symbols tend to be clustered together
The transformation is invertible

The goal is to transform the stream into a sequence of symbols that are more easily compressed by entropy encoding. Suppose the input stream consists of English text. Intuitively, the letters etter are more likely to be preceded by a b or l than most other characters. The BWT attempts to group the likely preceding characters (or, more generally, symbols) together.

The BWT is implemented as follows:

Create a list of all possible rotations of the input stream
Let each rotation be one row in a large table
Sort the rows of the table
The BWT'd stream is the rightmost column of the table

For example, consider the following input stream: www.t-a-y-l-o-r.com! where ! indicates the end of the stream. The twenty rotations of the input stream are:

www.t-a-y-l-o-r.com!
!www.t-a-y-l-o-r.com
m!www.t-a-y-l-o-r.co
om!www.t-a-y-l-o-r.c
com!www.t-a-y-l-o-r.
.com!www.t-a-y-l-o-r
r.com!www.t-a-y-l-o-
-r.com!www.t-a-y-l-o
o-r.com!www.t-a-y-l-
-o-r.com!www.t-a-y-l
l-o-r.com!www.t-a-y-
-l-o-r.com!www.t-a-y
y-l-o-r.com!www.t-a-
-y-l-o-r.com!www.t-a
a-y-l-o-r.com!www.t-
-a-y-l-o-r.com!www.t
t-a-y-l-o-r.com!www.
.t-a-y-l-o-r.com!www
w.t-a-y-l-o-r.com!ww
ww.t-a-y-l-o-r.com!w

Sorting them (using the following sorting order: -.aclmortwy!) results in the following table:

-a-y-l-o-r.com!www.t
-l-o-r.com!www.t-a-y
-o-r.com!www.t-a-y-l
-r.com!www.t-a-y-l-o
-y-l-o-r.com!www.t-a
.com!www.t-a-y-l-o-r
.t-a-y-l-o-r.com!www
a-y-l-o-r.com!www.t-
com!www.t-a-y-l-o-r.
l-o-r.com!www.t-a-y-
m!www.t-a-y-l-o-r.co
o-r.com!www.t-a-y-l-
om!www.t-a-y-l-o-r.c
r.com!www.t-a-y-l-o-
t-a-y-l-o-r.com!www.
w.t-a-y-l-o-r.com!ww
ww.t-a-y-l-o-r.com!w
www.t-a-y-l-o-r.com!
y-l-o-r.com!www.t-a-
!www.t-a-y-l-o-r.com

As a result, the BWT of www.t-a-y-l-o-r.com! is tyloarw-.-o-c-.ww!-m (the last column of the above table).

Move-To-Front Transform

The MTF transform has a similar purpose to the BWT: make an output stream that is likely to be more compressible than the input stream. The common groupings from the BWT are transformed once more to produce a symbols stream that is more likely to compress well in the final step.

We start with an ordered list of all the active symbols (all of the unique symbols in the stream) in sorted order.

The MTF transform is implemented as follows:

Begin with an ordered list of all of the unique symbols in the input stream
For each symbol in the input stream:
1. Send the position in which the symbol appears in the list.
2. Move the symbol to the front of the list.

Continuing our example from above, our input stream is a set of characters, tyloarw-.-o-c-.ww!-m. The ordered list described in step 1 is -.aclmortwy!. Step 2 proceeds as follows:

Input Output List

t 8 t-.aclmorwy!

y a yt-.aclmorw!

l 6 lyt-.acmorw!

o 8 olyt-.acmrw!

a 6 aolyt-.cmrw!

r 9 raolyt-.cmw!

w a wraolyt-.cm!

- 7 -wraolyt.cm!

. 8 .-wraolytcm!

- 1 -.wraolytcm!

o 5 o-.wralytcm!

- 1 -o.wralytcm!

c 9 c-o.wralytm!

- 1 -co.wralytm!

. 3 .-cowralytm!

w 4 w.-coralytm!

w 0 w.-coralytm!

! b !w.-coralytm

- 3 -!w.coralytm

m b m-!w.coralyt

Therefore, the resulting stream from the MTF transform is 8a6869a7815191340b3b (written in hexadecimal).

Huffman Encoding

Huffman encoding is an entropy encoding algorithm that attempts to represent the input stream in an alternate way that requires fewer bytes. The basic idea is to determine the frequency in which the various symbols appear in the input stream and assign shorter length codewords to symbols that appear more frequently and longer length codewords to symbols that appear infrequently.

Consider the following input stream: bacadaeafabbaaagah. We could represent each symbol (letter) with three bit codewords: a as 000, b as 001, c as 010, ... h as 111. Representing the 18 symbol input with the three bit codewords requires 54 bits. Huffman encoding this input stream could result in the following variable length codewords: a as 0, b as 100, c as 1010, d as 1011, e as 1100, f as 1101, g as 1110, ... h as 1111. Since a is found so frequently in the input stream, we give it just one bit, even though it means that most of the other symbols now require four bits. The input stream now requires only 42 bits to represent.

The optimal set of codewords are determined by creating a binary tree whose path to each symbol defines the codeword. Continuing our example from the MTF transform section could result in the following binary tree:

              0                 |                1
              |-----------------+----------------|
              |                 20               |
     0        |        1                0        |        1
     |--------+--------|                |--------+--------|
     |        8        |                |        12       |
0    |    1       0    |    1      0    |    1       0    |    1
|----+----|       |----+----|      |----+----|       |----+----|
|    4    |       |    4    |      |    5    |       |    7    |
3         6       9         a      b         1       8         |
                                                          0    |    1
                                                          |----+----|
                                                          |    4    |
                                                          |         |   
                                                       0  |  1   0  |  1
                                                       |--+--|   |--+--|
                                                       |  2  |   |  2  |
                                                       0     4   5     7

In order to construct the tree, we can make use of a heap data structure where each element of the heap is either

A symbol with priority equal to the number of occurences of that symbol in the input stream, or
A binary sub-tree of symbols with priority equal to the sum of priorities of all symbols in the sub-tree.

Here are the steps for creating the tree:

Begin by pushing each symbol into a heap where the root element of the heap is the lowest priority node
As long as there is more than one node in the heap:
1. Remove two nodes with the lowest priority from the heap*
2. Create a new sub-tree node consisting of the two removed nodes from the above step as children and having a priority equal to the sum of priorities of the two children
3. Update the parent links in the two child nodes to point to the newly created parent node
4. Push the new node into the heap
The remaining node is the root node (containing the entire tree)

When ties are encountered, newly generated nodes should be selected first.

The codewords for our example are as follows:

Symbol	Codeword	Times in Input Stream
0	11100	1
1	101	3
2	—	0
3	000	2
4	11101	1
5	11110	1
6	001	2
7	11111	1
8	110	3
9	010	2
a	011	2
b	100	2

For our example, the input stream from the MTF transform, 8a6869a7815191340b3b is encoded into the following bitstream: 11001100111000101001111111110101111101010101010001110111100100000100.

Huffman Decoding

Given the symbol table (mapping symbols to codewords) decoding the input stream is a simple matter of matching the bit pattern in the input stream to the codewords in the symbol table.

Inverse Move-To-Front Transform

Left as an exercise. Feel free to following the links provided above for more information.

Inverse Burrows-Wheeler Transform

Left as an exercise. Feel free to following the links provided above for more information.

Problem Statement

You should write one program to compress file(s) and one to uncompress file(s).

Implementation Notes

Your solution does not need to actually convert the ones and zeros to a byte stream (although it would be nice, and you are free to use someone elses classes to do this, e.g., BitI/OStream). You can just send a sequence of characters (0/1) out instead.
You will need to store the huffman codeword table in the compressed file.

Design Decisions

While the concepts implemented in this assignment are not difficult, the details involved in implementing them are not trivial. Be sure to spend adequate time in the design stages before moving on. I have intentionally left many of those design decisions for you to discover.

Lab deliverables (due 11:00pm, Monday of week 8)

You should prepare an email message (plain text) that indicates how much time each of you spent on this assignment and addresses the following questions:

Would sorting the words.txt file before compression improve the results of either compression scheme?
What is the largest codeword needed using Huffman compression (using the words.txt file)?

Each team must prepare a video that is 9 to 14 minutes in length. The video must address the following:

An analysis of appropriate patterns (think about how the program may be extended by others in the future, identify logical patterns to consider even if you decide not to make use of them in your solution to this assignment). Show how these were used (or could have been used) in your code (show code or class diagrams to illustrate).
An analysis of the compression rates (do they make sense?)
A discussion of any problems you encountered
A code walk-through of key pieces of your solution
A demonstration of your compression/uncompression programs on this text file. Be sure to show the file size of the compressed file and the contents (you don't need to scroll through the entire file) of the uncompressed file.

You may use Microsoft Expression Encoder 4 (10 minute time limit), Jing (5 minute time limit), or CamStudio to record your video. I plan to have you copy your video to a USB drive during the Wednesday lecture of week 8.

There is no additional required submission; however, if I have questions after watching your video submission, I may ask for a copy of your code.

If you have any questions, consult the instructor.

Acknowledgment

This laboratory was developed by Dr. Chris Taylor and was motivated by a similar assignment by Dr. Kevin Wayne.

Office: L-343 •
Phone: 277-7339 •
npǝ˙ǝosɯ@ɹolʎɐʇ