Transmission bandwidth is always limited (hey... if you're reading
this over a 28.8kb link, you get the picture!). Data compression
can help a lot.
Senator Joe Biden found himself in hot water for allegedly plagiarizing
material during college. The military academies regularly erupt
in cheating scandals. I'm trying to teach my six year old to do
creative work in school without peering over someone else's shoulder.
Yet, my engineering motto is to steal constantly. I surreptitiously
glance around and furtively rip pages from the latest issues of
ESP, EDN, or any of a hundred other sources. All go in my algorithm
collection. (Jack Crenshaw's articles in this magazine are great
- just last week I pulled his bit on CRCs to help a friend). I
buy books, and now CD ROMs, about obscure topics, tossing them
on a shelf in case someday I'll need to refer to them. Once or
twice a year I browse the University of Maryland's Computer Science
and Math libraries, looking for neat techniques, ideas, and methods.
As a result my algorithm collection grows a little every month;
perhaps haphazardly, perhaps with little clear direction; but
always it expands. Someday I'll have to do a decent job of categorizing
the material, but in the meantime it is the source of many of
the algorithms I use in my work. Why re-invent a FFT or floating
point routine when others have developed proven, reliable methods?
Steal shamelessly!
While recently looking for a compression routine I was shocked
to find that almost none of the dozens of articles in the collection
contained any flowcharts, pseudo-code, or code snippets. All plunge
quickly into a discussion of entropy, Shannon's information theories,
and the like. The closest most come to providing something useful
is a deep discussion of creating binary trees which depict coding
schemes. Hey - I don't about you, but I don't have the time or
patience to figure out some academician's theories. I need code
now!
Since many embedded systems transmit and store data, compression
is an important issue that many designers ignore, perhaps because
of the paucity of useful information. Here I'll present a couple
of quick and dirty techniques that may not provide the best possible
results, but that can be used immediately. Next month I'll follow
up with more discussion and code details.
Until recently computer science had interest only in lossless
compression. A lossless scheme transmits the data perfectly; there
is no degradation due to the compression/decompression process.
Some sorts of data can tolerate quite a bit of corruption, though.
Video, in particular, just does not require the same sort of bit-by-bit
perfection a computer program does. Lossy compression reduces
the data's bandwidth in part by allowing redundant or non-essential
information to be lost.
Under the lossy category the most fascinating approach (to me)
is to reduce the image to a series of fractals - floating point
numbers that represent the complexity of a shape, exploiting hidden
regularities in the image. Often compression ratios of thousands
to one result. Though there have been some attempts to commercialize
this, the process of extracting the fractal nature of a picture
is so complex that compression times are still very high.
Lossy compression is even today being used by Bell Atlantic to
transmit full motion video over conventional copper phone lines
in Virginia. I haven't been able to find out how they are making
this happen, but the thought of a wide bandwidth link into the
homes, without rewiring the entire United States for fiber optics,
is pretty cool.
Lossy compression is an entire field in itself that is now being
studied heavily. However, my interest lies more in lossless techniques
that let me transmit a text stream or program code intact.
All data compression schemes exploit some pattern or characteristic
of the data, so no one method is ideal for all situations. For
example, fax machines transmit mostly white images; the bit stream
is therefore compressed by sending only the differences between
scan lines.
Lots of data sources are very repetitive in nature. Most pictures,
for instance, contain long strings of identical data. Often, computer-transmitted
text has vast amounts of white space (sequences of 0x20 characters).
Run Length Encoding (RLE) compresses data streams by simply noting
the presence of identical character sequences and replacing each
sequence with the repeated character and a number indicating the
required number of repetitions.
For example, the string "sssttttooo0o" could be compressed
to something like "3s4t5o", where the numbers indicate
the number of repetitions for each character.
- If a character(i) is not the same as character (i+1), set
bit 7 and transmit it.
- Otherwise, count the number of identical characters (to a
limit of 127).
- Drop the repeat count followed by the character into the transmission
stream.
Pseudo-code for the encoding process is:
i=0;
WHILE (i < total number of character to transmit)
{IF(data(i)!=data(i+1) DO ; if this point is not like the next point
{transmit(data(i)+128) ; set bit 7 and transmit
i=i+1 ; increment pointer
}ENDDO
ELSE
{count=i+1 ; count-i is the repeat count
WHILE ((data(i)=data(count) AND (count-i<127)) DO
{count=count+1 ; increment counter
} ENDDO
count=count-i ; repeat count
transmit(count) ; send repeat count with bit7=0
transmit(data(i)) ; send character
i=count+1 ; point to next character
}ENDIF
}ENDWHILE
Decoding is even easier:
WHILE (characters still coming)
{IF((data BITWISE AND 2**7)=2**7) DO
{save(data-128) ; save char without bit 7
}ENDDO
ELSE
{count=data ; repeat count
save (count) copies of (next data)
}ENDIF
}ENDWHILE
RLE encoding is easy to implement and, in many cases, very efficient.
See "Run-Length Encoding Slashes Image-File Storage Requirements"
by Richard Quinn, Personal Engineering & Instrumentation New,
September 1990 for C code and more discussion of the subject.
EDN ran a more comprehensive article in the June 21, 1990 issue:
"Compression Algorithms Reduce Digitized Images to Manageable
Size", by William Warner. Highly recommended for dealing
with more complex image and transmission types.
Huffman Coding
As I mentioned, all data compression exploits some known characteristic
of the data stream. In English text, for example, the letter "e"
occurs much more often than "z", yet ASCII assigns exactly
the same number of bits (7) to the two characters.
In 1952 D.A. Huffman proposed using varying length codes to express
characters. The resulting code is used by most text compression
systems, including the Windows Help compiler. Again, since "e"
is so common, an ideal Huffman code would assign it just a couple
of bits, giving the rare letter "z" many more.
Huffman codes allocate the shortest codes to the most likely characters,
and the longest to those that are infrequent. The bad news: data
transmitted so encoded will not be aligned on byte boundaries,
since most characters will have far fewer than the standard 8
bits. To avoid ambiguity, each code must have the "prefix"
property. That is, no code can occur as the first part of another,
longer code. Thus, if the letter "e" were encoded as
"01", then no other code could start with this sequence.
This does allow the decoder to properly align on a character,
but also somewhat lengthens the minimum code length.
Figure 1 gives a set of Huffman codes for the alphabet. Note that
commonly occurring letters have fewer bits than those less common.
The compression process uses a so-called "dictionary lookup",
nothing more than getting the character's code and size from a
lookup table, and then dropping them into the transmission stream.
Pseudo code might look like the following, assuming the codes
exist in a structure <huffman>, which has elements <code>
(left justified code as in figure 1), and <size>, which
is the number of bits in the code.
WHILE (there are more characters)
{
code=huffman[character].code ; get huffman code
size=huffman[character].size ; get code size
i=0
WHILE(i<size)
{sendbit(code) ; send MSB of code
shift code left 1 bit ; select next bit
i=i+1 ; increment # bits sent
} ENDWHILE
} ENDWHILE
Decoding the compressed transmission is a bit more complex, simply
because the character's size isn't known until a match occurs.
Most approaches break the Huffman code into a binary tree, depicted
in a table with forward links to speed the searching. While the
code is not terribly complex, building the table for efficient
searching requires a bit of explanation, which I'll provide in
next month's installment.
For information about Huffman coding, consult "An Introduction
to Data Compression" by Harold Corbin (April, 1981 Byte),
or "Data Compression", by Gilbert Held, John Wiley
& Sons, New York, 1983.
There are many Huffman codes. Just because the average English-language
tome contains some particular distribution of letters does not
mean that all transmitted English data will be the same. Vast
quantities of C code, for example, will show an unusually high
preponderance of curly braces. Though many Huffman users ignore
this and encode their data with the standard English-language
probability distributions, it is sometimes possible to greatly
improve the compression efficiency by computing new codes for
characters based on the text block being transmitted. This is
called Adaptive Compression. It requires more compute time to
figure the codes, and uses more overhead in that the code table
must be transmitted along with the compressed data, but can result
in substantially better performance.
Finally, it's fascinating to watch a child learn to read. A young
one starts by clearly recognizing each letter, only laboriously
assembling each into a word. The same sort of high-entropy effect
occurs in Huffman encoding when we assign short codes to characters,
but ignore the fact that English abounds with frequently-used
phrases and words. In English text, "the", "to",
and a number of other words are so common it makes sense to extend
the compression beyond traditional character boundaries and assign
a short code to entire common words. The decoder will recognize
the word as a single entity with relatively few bits, just as
a skilled reader recognizes words, ignoring details of the letters.
In fact, few programmers bother as the increased computation time
may not justify the results.
Data Compression - Part 2
Last month I introduced several schemes that compress text and
binary data transmitted over a communications channel. All of
the methods I covered were for lossless compression, which preserves
every little nuance of the raw data. Lossy compression is a whole
'nother field, which it's hard to say much about unless one is
willing to define exactly how much lack of fidelity is tolerable.
Huffman codes are an easy and commonly used method to convert
a string of data to tokens, each of whose length is inversely
proportional to the frequency of the encoded character. For example,
to transmit Huffman-encoded standard English text, the token corresponding
to the letter "e" uses very few bits, as "e"
is the most common character in the alphabet. The letter "z"
will surely take many more bits to send, though will not impact
the communications bandwidth much as it occurs so infrequently.
Although I presented a frequency distribution table for standard
English text last month, most compression schemes dynamically
compute the frequency of characters from each data set. Surely
you can see that transmitted C code, for instance, will have an
unusually high number of curly brackets compared to the epic Beowulf.
Even short segments of standard English text may have skewed distributions
due to the material's subject matter or the writer's skills.
It's easiest to visualize how the Huffman algorithm translates
each character to a set of bits by using a binary tree. Indeed,
most compression techniques use trees to simplify the code and
to aid in building recursive algorithms.
The figure is a portion of a typical binary tree used to build
the compression dictionary (the list of codes for each character).
Each character is depicted by the branches one descends to reach
the character at the bottom. In this simplified tree the letter
"e" uses only a single bit, a zero, for its encoding.
"t" is 10; "m" is 11.
root node
|
|
__0__________1_________________
| |
| ____0_________1__________________
E | |
| |
T M
Huffman compression consists of the following steps:
1) Scan the input data and develop an array of the frequency of
occurrence of each character. If you are compressing ASCII or
other byte-wide data, the code looks something like:
Repeat Till All Data Scanned
{
lCount[input_char]++
}
The lCount array will have 256 elements, one per possible entry
in the input character set. Be sure to use a data structure that
will tolerate the maximum number of possible counts per character
in the data stream. I like longs, as it's awfully hard to overflow
them on any realistic data stream.
It's a good idea to then reduce the array of longs to an array
of bytes. Most likely you'll be handling lots of input data; it's
foolish to do massive numbers of long compares when byte versions
will suffice. Since there are only 256 possible characters, you
can simply convert the array of counts to an array of relative
frequency, 0 meaning the character was never seen and 255 meaning
it is the most frequency.
2) Build the dictionary tree. I could use tree terminology here
and most likely confuse at least half of you, Gentle Readers,
so will try more of an illustrative approach so you can really
see how the tree is built. The steps are as follows, and
result in the tree shown in figure 2.
2.1) Arrange the character set in an array ordered by decreasing
frequency. Include the frequency (i.e., the count of the number
of times each character occurred).
2.2) Starting at the bottom, "connect" the two characters
with the lowest frequency. Add up the counts for each and place
these at the intersecting lines.
2.3) Continue combining groups of two until all have merged into
one.
2.4) Assign a zero to one side of each nodal point and a one to
the other (i.e., zero for left branches and one for right branches).
The tree is all we need to compress input data, but it is cumbersome
to use. Worse, we've built an upside-down tree, with the "input
data" (the characters) at the leaf, or bottom, nodes. Since
we'll presumably compress an input array that is much bigger than
this small tree, it makes sense to add another step to build a
simpler data structure. We really want an array of Huffman codes
(with the associate number of bits per code), with one entry per
character. In other words, we can traverse the tree once per possible
character and log the resulting codes to an array HuffCode. HuffCode
has 256 entries, and is a structure that includes two elements
per entry: the code itself, and an integer indicating the number
of bits used in each code.
3) Send the tree to the receiver so that it can be used to decompress
the data. Though you can simply transmit the entire tree structure,
some people reduce it to bare elements to conserve transmission
bandwidth. Much of the tree could be empty, depending on the input
stream, in which case it's silly to add the extra null nodes when
the whole point of the exercise is to reduce data!
4) Compress each data byte. Now that the dictionary array has
been built (HuffCode), transmit data a bit at a time as follows:
WHILE (there are more characters)
{
code=HuffCode[character].code ; get huffman code
size=HuffCode[character].size ; get code size
i=size
WHILE(i)
{sendbit(code) ; send MSB of code
I<<1 ; select next bit
i-- ; increment # bits sent
} ENDWHILE
} ENDWHILE
Huffman Expansion
The receiver must have a copy of the tree compression tree so
it can decode the bit stream properly. I suggested above that
you transmit more or less the entire tree. Some folks send the
linear dictionary array (HuffCode) instead, but the code needed
to decompress a byte using only HuffCode is slow and bulky.
Expansion using a tree is a more natural approach, particularly
since you have no idea where the character boundaries are. A stream
of rapid fire bits, perhaps millions of them, will come shooting
down the transmission line with no clear demarcation between characters.
The tree captures both the characaters codes and the number of
bits needed to represent each one.
This makes decompressions just a matter of traversing the tree.
For each input bit descend the tree, branching to the left if
the input bit is a zero, or to the right branch for a one. You'll
know the expansion is complete when you hit a leaf node of the
tree. The code to decompress a single character looks like this:
TreeNode=RootNode
Repeat Until at a leaf
{
TreeNode=Tree.right ; assume right child branch
if (inputbit==0) TreeNode=Tree.left
; if 0, traverse left child branch
}
Output(TreeNode);
It's a good idea to add code to print the intermediate data structures
(e.g., the tree) for debugging purposes.
Lempel/Ziv
Why compress single characters when often text and binary data
will contain long identical sequences of phrases? In English,
the word "and" crops up constantly, yet always burns
up four ASCII characters (including the leading space): 32 bits
for a word rarely unused in a paragraph.
Most of the familiar compression utilities like PKZIP and LHARC
attempt to find common phrases rather than single characters.
All of these algorithms trace their roots to work done by Lempel
and Ziv in 1977 and later.
The LZ77 approach, Lempel and Ziv's first shot at a new method
of data compression, tries to compress a file by building a dictionary
of entire phrases used before in the data being compressed. It
maintains a sliding window, a long string (several thousand bytes)
of data that has been recently processed, as well as a short "look
ahead buffer". Every time it finds a phrase in the look-ahead
buffer that matches a portion of the text in the sliding window,
it emits a pointer to the corresponding phrase in the sliding
window, as well as a number indicating how many characters matched,
and what the first non-matching character was. The Lempel/Ziv
philosophy is to assume that the current data string most likely
looks a lot like the what was most recently processed.
The sliding window becomes a dictionary whose contents are always
changing as it slides over the input data stream. The compressed
output is a series of numbers telling the decoder where to look
in the sliding window for the compressed phrase. The fact that
the window slides, and is not stationary, keeps search times reasonable.
The LZ77 algorithm and its more recent versions can sometimes
produce tremendous compression ratios for embedded instrumentation
transmitting more or less the same sorts of data over and over.
Given some a priori knowledge of your instrument's output, it
may be possible to construct an LZ sliding window and look-ahead
buffer size that greatly exploits the repeating nature of the
data.
Be wary! Several variants of the LZ algorithms have been patented,
and commercial use of these may require the payment of non-trivial
royalties. It seems appalling to have to say "consult your
attorney before writing code", but perhaps there is no other
choice. For this reason alone I stay with the standard Huffman
techniques, even though not as efficient as the LZ algorithms.
Resources
You can play with all of these techniques on a PC for non-commercial
purposes using the code in "The Data Compression Book",
by Mark Nelson (M&T Books, San Mateo, CA, 1992 800-533-4372
(in CA 800-356-2002)). It comes with a pair of disks that contain
C source code for the Huffman, LZ77, and many other compression/expansion
schemes. I highly recommend it.
Also, refer to the May, 1986 Byte for more information about handling
the Huffman trees. See "Data Compression with Huffman Coding",
by Jonathan Amsterdam, pgs 99-108.
Figure 1: A Set of Huffman Codes
e 100
t 001
a 1111
o 1110
n 1100
r 1011
i 1010
s 0110
h 0101
d 11011
l 01111
f 01001
c 01000
m 00011
u 00010
g 00001
y 00000
p 110101
w 011101
b 011100
v 1101001
k 110100011
x 110100001
j 110100000
q 1101000101
z 1101000100