Outcomes Addressed
Team AssignmentYou must work with your assigned partner. OverviewIn this assignment you will implement the Burrows-Wheeler data (de)compression algorithm. This algorithm is used by the Unix bzip2 compression utility and often achieves higher compression ratios than gzip and PKZIP. Implementation of the compression algorithm involves three sequential steps:
The uncompression algorithm applies the inverse operations in the opposite order. I.e., do Huffman decoding, then inverse MTF transform, and finally inverse BWT. DetailsBurrows-Wheeler TransformThe BWT rearranges the order of the symbols in the input stream such that
The goal is to transform the stream into a sequence of symbols that are more easily compressed by entropy encoding. Suppose the input stream consists of English text. Intuitively, the letters etter are more likely to be preceded by a b or l than most other characters. The BWT attempts to group the likely preceding characters (or, more generally, symbols) together. The BWT is implemented as follows:
For example, consider the following input stream: www.t-a-y-l-o-r.com! where ! indicates the end of the stream. The twenty rotations of the input stream are: www.t-a-y-l-o-r.com! !www.t-a-y-l-o-r.com m!www.t-a-y-l-o-r.co om!www.t-a-y-l-o-r.c com!www.t-a-y-l-o-r. .com!www.t-a-y-l-o-r r.com!www.t-a-y-l-o- -r.com!www.t-a-y-l-o o-r.com!www.t-a-y-l- -o-r.com!www.t-a-y-l l-o-r.com!www.t-a-y- -l-o-r.com!www.t-a-y y-l-o-r.com!www.t-a- -y-l-o-r.com!www.t-a a-y-l-o-r.com!www.t- -a-y-l-o-r.com!www.t t-a-y-l-o-r.com!www. .t-a-y-l-o-r.com!www w.t-a-y-l-o-r.com!ww ww.t-a-y-l-o-r.com!w Sorting them (using the following sorting order: -.aclmortwy!) results in the following table: -a-y-l-o-r.com!www.t -l-o-r.com!www.t-a-y -o-r.com!www.t-a-y-l -r.com!www.t-a-y-l-o -y-l-o-r.com!www.t-a .com!www.t-a-y-l-o-r .t-a-y-l-o-r.com!www a-y-l-o-r.com!www.t- com!www.t-a-y-l-o-r. l-o-r.com!www.t-a-y- m!www.t-a-y-l-o-r.co o-r.com!www.t-a-y-l- om!www.t-a-y-l-o-r.c r.com!www.t-a-y-l-o- t-a-y-l-o-r.com!www. w.t-a-y-l-o-r.com!ww ww.t-a-y-l-o-r.com!w www.t-a-y-l-o-r.com! y-l-o-r.com!www.t-a- !www.t-a-y-l-o-r.com As a result, the BWT of www.t-a-y-l-o-r.com! is tyloarw-.-o-c-.ww!-m (the last column of the above table). Move-To-Front TransformThe MTF transform has a similar purpose to the BWT: make an output stream that is likely to be more compressible than the input stream. The common groupings from the BWT are transformed once more to produce a symbols stream that is more likely to compress well in the final step. We start with an ordered list of all the active symbols (all of the unique symbols in the stream) in sorted order. The MTF transform is implemented as follows:
Continuing our example from above, our input stream is a set of characters, tyloarw-.-o-c-.ww!-m. The ordered list described in step 1 is -.aclmortwy!. Step 2 proceeds as follows:
Therefore, the resulting stream from the MTF transform is 8a6869a7815191340b3b (written in hexadecimal). Huffman EncodingHuffman encoding is an entropy encoding algorithm that attempts to represent the input stream in an alternate way that requires fewer bytes. The basic idea is to determine the frequency in which the various symbols appear in the input stream and assign shorter length codewords to symbols that appear more frequently and longer length codewords to symbols that appear infrequently. Consider the following input stream: bacadaeafabbaaagah. We could represent each symbol (letter) with three bit codewords: a as 000, b as 001, c as 010, ... h as 111. Representing the 18 symbol input with the three bit codewords requires 54 bits. Huffman encoding this input stream could result in the following variable length codewords: a as 0, b as 100, c as 1010, d as 1011, e as 1100, f as 1101, g as 1110, ... h as 1111. Since a is found so frequently in the input stream, we give it just one bit, even though it means that most of the other symbols now require four bits. The input stream now requires only 42 bits to represent. The optimal set of codewords are determined by creating a binary tree whose path to each symbol defines the codeword. Continuing our example from the MTF transform section could result in the following binary tree: 0 | 1 |-----------------+----------------| | 20 | 0 | 1 0 | 1 |--------+--------| |--------+--------| | 8 | | 12 | 0 | 1 0 | 1 0 | 1 0 | 1 |----+----| |----+----| |----+----| |----+----| | 4 | | 4 | | 5 | | 7 | 3 6 9 a b 1 8 | 0 | 1 |----+----| | 4 | | | 0 | 1 0 | 1 |--+--| |--+--| | 2 | | 2 | 0 4 5 7 In order to construct the tree, we can make use of a heap data structure where each element of the heap is either
Here are the steps for creating the tree:
When ties are encountered, newly generated nodes should be selected first. The codewords for our example are as follows:
For our example, the input stream from the MTF transform, 8a6869a7815191340b3b is encoded into the following bitstream: 11001100111000101001111111110101111101010101010001110111100100000100. Huffman DecodingGiven the symbol table (mapping symbols to codewords) decoding the input stream is a simple matter of matching the bit pattern in the input stream to the codewords in the symbol table. Inverse Move-To-Front TransformLeft as an exercise. Feel free to following the links provided above for more information. Inverse Burrows-Wheeler TransformLeft as an exercise. Feel free to following the links provided above for more information. Problem StatementYou should write one program to compress file(s) and one to uncompress file(s). Implementation Notes
Design DecisionsWhile the concepts implemented in this assignment are not difficult, the details involved in implementing them are not trivial. Be sure to spend adequate time in the design stages before moving on. I have intentionally left many of those design decisions for you to discover. Lab deliverables (due 11:00pm, Monday of week 8)You should prepare an email message (plain text) that indicates how much time each of you spent on this assignment and addresses the following questions:
Each team must prepare a video that is 9 to 14 minutes in length. The video must address the following:
You may use Microsoft Expression Encoder 4 (10 minute time limit), Jing (5 minute time limit), or CamStudio to record your video. I plan to have you copy your video to a USB drive during the Wednesday lecture of week 8. There is no additional required submission; however, if I have questions after watching your video submission, I may ask for a copy of your code. If you have any questions, consult the instructor. AcknowledgmentThis laboratory was developed by Dr. Chris Taylor and was motivated by a similar assignment by Dr. Kevin Wayne. |