Stonehill
College
Data Structures and Discrete Mathematics Learning Community
Professors Ralph Bravaco and Shai Simonson
Text Compression and Huffman Trees
Consider a simple “message” (or file) which is stored as a string of
characters using the ASCII code. Each
character in the message requires eight bits of storage.
For example character “A,”
(ASCII code 65) would be stored as
010000001, character “B” (ASCII 66) would
bet stored as 010000010 etc. In
fact, this very paragraph would
require 3472 bits of storage (434 characters and eight bits for each character).
Data compression is a technique which reduces the size of a file or data
collection without affecting the information contained in the file.
A compressed file, obviously, requires less storage space.
Moreover, electronic transfer of a compressed file is faster than the
transfer of an uncompressed file and a smaller file is consequently less
susceptible to transmission errors.
One of the simplest but efficient methods of data compression was introduced by
David Huffman in his paper ``A method for the construction of minimum redundancy
codes.'' Huffman’s method is very
intuitive. Unlike the ASCII code
which uses exactly eight bits to encode each character, a Huffman code
uses a variable number of bits for each character. With Huffman’s
scheme, a common character such as “a” or “e” might be encoded with
perhaps just two bits and less frequently used characters like “q” or
“#” would be encoded with a longer bit string.
Consequently, the most frequently occurring characters are encoded with a
minimal number of bits. (Morse code uses the same technique, “E” is encodes as a
single dot, “T”
just one dash, but “Q” is dash-dash-dot-dash). Because ASCII
encodes every character with eight bits we say that ASCII code is a fixed
length code. Huffman’s code
is a variable length code.
We’ll begin with a trivial example.
Suppose we wish to encode the string
“SHE-SELLS-SEA-SHELLS”
The ASCII code would require 160 bits (20 characters each requiring 8 bits).
Notice, however, that the characters
which appear in the string have the following frequencies ( and relative
frequencies)
A 1 .05
H 2 .10
-
3 .15
E
4 .20
L
4 .20
S
6 .30
A possible variable length code which uses just 2 bits to encode frequently used
characters like “S” or “L “ might be
:
Character
Codeword Frequency of Character
A
0000
1
H
0001
2
-
001
3
E
10
4
L
11
4
S
01
6
We will refer to this code as Code 1.
We will also use the term “codeword” to denote an individual
character code. Thus, 0000 and 001
are codewords.
Notice that “S” which occurs 6 times (or 30% of the time) is encoded using
just 2 bits and “A” which occurs only once is encoded with 4 bits.
With this code, the string, “SHE-SELLS-SEA-SHELLS,”
can be encoded with just 49 bits:
0100011000101101111010010110000000101000110111101
You may wonder why we didn’t use a “more efficient” code like:
S
0
L
1
E
00
- 11
H 01
A 10
( Code 2)
Using Code 2 we can encode the string “SHE-SELLS-SEA-SHELLS” as
001001100011011000101101000110
which uses a mere 30 bits. Certainly,
compressing the string to 30 bits is more efficient than storing it as 49 bits.
So, why not use Code 2? It
does the job with fewer bits.
To answer this question, consider just the simple substring, “SHE.”
Using the ASCII coding
scheme, “SHE” is encoded as
010100110100100001000101
Now, if we wish to decode (or translate this bit string back to ordinary,
readable text), we simply look at
each successive group of exactly 8 bits:
01010011 - (83)
– “S”
01001000 - (72)
– “H”
01000101 - (69)
– “E”
That’s all there is to it.
Now, let’s encode “SHE” using our “more efficient” Code 2:
00100.
Wow! Just 5 bits!
Now, given this five bit string, how do we decode it? Do we decode 00100 as
“EAS” (00
10 0) or
as “ SSLE” ( 0 0
1 00 ) or perhaps as “SSLSS” (0
0 1 0 0) ? You really cannot tell without putting some marker
between the codewords for each letter.
On the other hand, using Code 1 we encode “SHE” as
01000110.
To decode this bit string , simply scan
the string from left to right until
you find a complete codeword. You
have just decoded the first character. Continue
the process until you have scanned the entire bit string.
You will see there is no ambiguity.
01 0001 10
S H E
Exercise 1.
a. Decode each of the following bit strings which are encoded with Code
1:
01000011001011011110100110101101
00010000110010001000001001000110101101
b. Now, using the following bit string, show that Code 2 is ambiguous.
In other words, find two possible translations of the following bit
string:
01011100011011001000
c. Explain why Code 1 is not ambiguous. What
property does Code 1 have which ensures that each bit string has a unique
translation?
Prefix Codes
For arbitrary variable length codes, we may not be able determine the end of
one codeword . For example, using
Code 2, do we interpret the bit string 0101 as 0.10.1 (SAL) or 01.01 (HH)
or 0.1.0.1 (SLSL)?
Is the initial letter encoded as 0 (S)
or 01 (H)? Without some kind of marker between codewords, it is
impossible to determine where each codeword ends.
Code 1 however, does not have this drawback.
The bit string 00010000011 can have only one interpretation:
0001.0000.001.
Let’s look at this string.
As we scan from left to right we see:
0 is not a valid code. Continue
scanning
00 is not a valid code. Continue
000 is not a valid code. Continue
0001 is the code for H. Done.
OK, we have unearthed an “H.” But how do we know that there is no code that starts
with 0001 ? How do we know that
00011 is not the code for some other letter?
If you look closely at the codewords of Code 1, you will notice that no codeword
is a prefix of another codeword. The
codeword for “H” is 0001. No
other codeword starts with 0001. The
codeword for “S” is 01. Nothing
else begins with 01. Thus, using a
left-right scan, once we read 0001 we recognize the codeword for “H.”
Further, we know that 0001 is not the start of the codeword for some
other character, since no other character code begins with 0001.
Definition.
A code is a prefix code (
or has the prefix property) if
no codeword is the prefix of another codeword.
Again, Code 1 has the prefix property but Code 2 does not.
The codes which we will develop will have the prefix property.
A Huffman code turns out to be an “optimal” prefix code.
We will be more precise about the meaning of “optimal” a bit later.
First, let’s see how the Huffman code is constructed.
Once we look at the mechanics, we will take a closer look underneath the hood.
Codes and Binary Trees:
Let’s look at Code 1 again:
Character
Codeword Frequency
Relative Frequency
A
0000
1
.05
H
0001
2
.10
-
001
3
.15
E
10
4
.20
L
11
4
.20
S
01
6
.30
20
1.00
We can represent this code with a binary tree.
On such a tree,
each left branch is labeled 0,
each right branch is labeled 1
the encoded characters are placed in the leaves
Notice that the concatenation of the branch labels on the path from the root to
any leaf gives the code word for the symbol stored in the leaf.
For example, the path from the root to the leaf labeled “H’ is
left-left-left-right and the codeword for “H” is 0001.
The path to “L” is
right-right , so the codeword for “L” is 11.
Huffman’s Algorithm constructs such a binary tree in a bottom up manner. The
Huffman tree yields an optimal prefix code
for the symbols contained in the leaves.
Building the Huffman Tree.
Given the frequencies of each character in a file, the Huffman tree can be
constructed as follows:
For each symbol to be encoded, make the binary tree consisting of a single root.
Each of these root nodes
should hold the frequency (or
relative frequency) of one symbol.
(Ultimately, these nodes will be the leaves of
our Huffman tree)
2. Choose the two trees with the smallest root values and merge these
into a new tree, as in the following diagram. Here the trees labeled A
and H are chosen. The root of the new tree holds the value 3, obtained as the
sum of the values stored in the roots of its left and right subtrees.
3. Continue the process, always
forming a new tree from the two trees with the lowest root values. Here we merge
the two trees each with 3 stored in the root are Four trees remain with root
values 6,4,4,and 6.
4. Continue this process, only a
single tree remains. This final
tree is a representation of the Huffman code.
Remember always merge the two subtrees which have the smallest root
values
Now merge the two trees with root values 6. Two trees remain
Here is the final tree, with each branch labeled either 0 or 1.
The following is a general algorithm:
For each character c to be encoded
create a binary tree, consisting of a single root node r such that
content (r) = f(c), the frequency of c
for i = 1 to n-1
{
From the set of
binary trees ,S, remove the
two trees, x and y,
with
minimal values, f1 and f2 in the root
create a tree with root z such that
x is the left child of z
y is the right child
of z
content(z) = f1 +
f2
add
this new tree with root z to S.
}
Using the Huffman tree to determine the codewords.
Once the Huffman tree is created, it is easy to determine the codeword for
each character.
Method1: Bottom Up
As we’ve seen, each leaf represents one character.
If we begin at a leaf and proceed up ttree to the root, we can “trace
out” (in reverse) the codeword
for a given symbol:
For each leaf a
while a is not the root of the tree
if a is a left subtree
output ‘0’
else
output ‘1’
a = parent(a)
Notice
that the algorithm will output codewards in reverse order.
Example:
Let a be leaf “ -.”
a is a right subtree; output ‘1’
a = parent(a)
a is a left subtree; output ‘0’
a
= parent(a)
a is a left subtree; output ‘0’
a = parent(a)
a is the root; done
Output: 100; codeword for “-“
is 001
Method 2 : Topdown
An ordinary, recursive tree traversal with an additional parameter can also
be used to determine the codewords:
void traverse(string s, tree root)
Each time that the function is called recursively, pass it string
s concatenated with either ‘0’ or ‘1.’ When visiting a leaf,
simply output string s –
s is the codeword for
character associated with the leaf.
Encoding the file:
Simply translate each character using a table of codewords:
Example:
Character
Codeword
A
0000
H
0001
-
001
E
10
L
11
S
01
Translate: “SHE-SELLS.......”
010001100010110111101
S H
E - S E L
L S
Decoding a compressed file:
When decoding a compressed file, you will once again use the Huffman tree.
Consequently, the frequency table must once again be available.
When the tree is constructed, the text characters are stored in the
leaves. Each codeword, determines a
unique path from the root of the tree to to the leaf holding the translation of
the code word. Once we reach a leaf, we begin the decoding process again at
the root.
Example: Decode : 0100011.....
Begin
at the root:
0100011 –
move left
0100011
– move right
A leaf—“S” (01 is the
codeword for S)
Go back to the root and continue the process.
0100011
etc.
Exercise 2:
a. Construct a Huffman tree for the alphabet A B C D E F where the relative
frequencies of each character are:
A .22
B .13
C .33
D .10
E .20
F .02
b. Use the tree to determine the code word for each of the character of this
short alphabet
c. Encode the string “AECBCAF”
Exercise 3:
In terms of n, the number of
characters to be encoded, determine the running time of Huffman’s algorithm.
When the set of trees is stored as a linked list
When the set of trees is implemented as a priority queue keyed on the frequency
of a character. Assume the priority
queue is implemented as a heap.
Exercise 4. Encoding a text
file.
Write a program which will encode a text file using the Huffman algorithm.
The user interface should be very simple.
Just provide the user with two prompts:
Input file:
Output file:
so that the user may supply the
name of the input (text) file as well as a name for the compressed file.
Your output should actually be two files : the compressed file,
and also a file with the frequency table .
This second file should have the same name as the compressed file and an
extention of feq. For
example if the compressed file is called something.zip then the second
file should be something.feq
Note:
Although C++ has the facility to manipulate bits, for convenience,
you will simply use the characters ‘0’ and ‘1’.
Thus, the number of 0’s and 1’s in your encoded file will represent
the number of bits necessary for compression.
To encode text using a Huffman code you must
First build a frequency chart for all characters in your text
Build a Huffman tree
Make a list of all codewords
Using the list of codewords, encode your text.
Here are a few suggestions to help you with the implementation.
Since ASCII values fall in the range 0..127,
use a simple array (int freq[128])to store frequencies
For example, if ‘A’ has frequency 13, then freq[65] will have the value 12. Of course, not all ASCII characters
will appear in your text, so many entries in the frequency array will be 0.
When you build the Huffman tree, you must make a number of design decisions.
As you know, the algorithm builds the tree incrementally.
At each step, the subtrees with minimum roots must be chosen.
The most efficient method for determining the these two minimum valued
trees would be to use a heap based priority
queue. Unless you are really stuck,
use a priority queue.
If you plan to build your table of codewords
in a bottom up fashion, you will need to maintain a table which holds the
address of each leaf on the tree. Also,
each node must contain the address of its parent
Simply keep the codewords in an array index from 0 to 127 so that you may access
the a character’s codeword directly i.e. without a search.
Output your frequency table to a file – simply list the 128 frequencies
in the array.
Just use the table from step three.
Exercise 5: Decoding a compressed file.
Write a program which will decode a compressed file. Your program should
prompt the user for two
filenames:
Compressed file:
Text file:
If the name of the compressed file is something.zip, your program should use that file and also
the file something.feq as input.
Exercise 6:
Use the programs in the previous exercises to compress( and decode)
a. The Declaration of Independence
b. The US Constitution
c. Shakespeare’s Hamlet
( The text of these files can be downloaded at .....)
Use Winzip to compress each of the files in part 1.
Compete the following chart:
SIZE IN BYTES
|
|
Text
file |
winzip |
Huffman |
|
Declaration
of Independence |
|
|
|
|
Constitution |
|
|
|
|
Hamlet |
|
|
|
Remember:
You are using the characters ‘0’ and ‘1’ in your program.
These are not single bits. Make
sure you take that into account. You
do not need to write any programs to find these values.
You can get all this information via Windows and Word.
Huffman Codes—mathematical questions.
Now that we have looked at the
Huffman algorithm, let’s take a closer look at some of the mathematical
properties of a Huffman tree.
Prefix property:
Excercise 6: We have stated that the Huffman’s Algorithm produces a code
with the prefix property. Explain
why this is the case. You do not
have to give a rigorous mathematical proof—just think about how the tree is
constructed and give a plausible explanation.
A few sentences will suffice.
Optimal codes.
We have stated the the Huffman code is
“optimal” so perhaps it is now time to give a more precise definition
of this term.
We first begin by defining the cost of the code or equivalently the tree.
Defininition.
Suppose T is a tree representing a prefix code for a file.
The cost of T, C(T), is
the number of bits necessary to encode the file.
Thus
C(T) = S
frequency(c)*length(c), for each character c
where length(c) is the length of the path from the root of T to the leaf
representing c.
Note: length(c) is also the number of bits in the codeword for character c.
Example:
Consider the following Huffman tree, T which we built to encode the message
“SHE-SELLS-SEA-SHELLS”
C(T) = frequency(‘A’)length(‘A’)+ frequency(‘H’)length(‘H’)
+frequency(‘-’)length(‘-’)+
frequency(‘S’)length(‘S’)+
frequency(‘E’)length(‘E’)+ frequency(‘L’)length(‘L’) =
1*4 + 2*4+3*3 + 6*2 + 4*2 + 4*2 = 4
+ 8 + 9 + 12 + 8 +8 = 49
Definition.
Suppose T is a tree representing a prefix code for a file.
The code is optimal
if C(T) is minimal.
Note an optimal code is not unique – for a given file more than one optimal
code may exist. (Trivially, just change all 0’s to 1’s and 1’s to 0’s in
any optimal code and you have another optimal code).
With a precise definition of optimality in hand, we will prove (in the next
several exercises) that a Huffman tree is optimal. is optimal.
Excersise 7:
A full binary tree is a binary tree in which ever nonleaf node has two
children.
Notice that the Huffman trees in our examples have been full trees.
Prove: An optimal prefix code for a
file is always represented by a full binary tree
Hint: Suppose there is a non-full
tree that did represent an optimal code. Find
a contradiction by producing a better code.
Exercise 8:
Show by means of a counter example that a full binary tree may represent a
prefix code which is not optimal, even if the letters with lower frequencies
have the longer codewords.
Suggestion: Build another full tree for “SHE-SELLS-SEA-SHELLS” which has a
higher cost than the tree of our previous examples.
Exercise 9: Let x and y be
two characters with the lowest frequencies. (Arbitrarily, assume frequency(x)
<= frequency(y)). Prove that
there is an optimal tree where x and y are
siblings at the maximum depth of the tree.
Hint:
Assume that you have an optimal prefix tree, T,h that a and b are at the maximum depth.
(You can assume frequency (a) <=
frequency(b)). Here is a partial
view:
Now, because x and y have the two smallest frequencies freqency(x) <=
frequency(a) and frequency(y) <= frequency(b). Also, length(a) >= length(x) and length(b) >=
length(y).
Now switch a and x, resulting in a new tree T’. Show C(T’) <= C(T)?. Now, in T’, switch y and b.
Show C(T’’) <= C(T’)?
Exercise 10:
Prove that Huffman’s algorithm produces an optimal prefix code tree.
Hint: Use mathematical induction on n, the number of characters/leaves.
For n = 1, Huffman’s algorithm produces a tree with one node, obviously
optimal.
Now assume that for n leaves, Huffman’s algorithm produces an optimal tree.
Show that Huffman’s algorithm produces an optimal tree for n+1 leaves.
Let Tn+1 be an optimal tree with n+1 leaves. We may assume that in an
optimal tree the two characters, say x and y, with the lowest frequencies are
siblings at the lowest level of the tree. (Why may we assume that?).
Now remove x and y and replace them with a “new character z with
frequency(z) = frequency(x) + frequency(y).
Call the new tree Tn.
Let T’n be
the Huffman tree constructed from these n characters
Now finish the proof.
Exercise 11:
Suppose that the frequencies
for the characters a,b,c,d,e,f,,g, and h are the first 8 Fibonacci
numbers – 1,1,2,3,5,8,13,and 21.
Construct the Huffman tree.
What is the cost of the tree?
Now, determine the cost of the tree
for n characters when the frequencies are the first n Fibonacci numbers.