Data Compression

Transmission bandwidth is always limited (hey... if you're reading this over a 28.8kb link, you get the picture!). Data compression can help a lot.

Published in Embedded Systems Programming in two installments, January/February 1994.

By Jack Ganssle

Senator Joe Biden found himself in hot water for allegedly plagiarizing material during college. The military academies regularly erupt in cheating scandals. I'm trying to teach my six year old to do creative work in school without peering over someone else's shoulder.

Yet, my engineering motto is to steal constantly. I surreptitiously glance around and furtively rip pages from the latest issues of ESP, EDN, or any of a hundred other sources. All go in my algorithm collection. (Jack Crenshaw's articles in this magazine are great - just last week I pulled his bit on CRCs to help a friend). I buy books, and now CD ROMs, about obscure topics, tossing them on a shelf in case someday I'll need to refer to them. Once or twice a year I browse the University of Maryland's Computer Science and Math libraries, looking for neat techniques, ideas, and methods.

As a result my algorithm collection grows a little every month; perhaps haphazardly, perhaps with little clear direction; but always it expands. Someday I'll have to do a decent job of categorizing the material, but in the meantime it is the source of many of the algorithms I use in my work. Why re-invent a FFT or floating point routine when others have developed proven, reliable methods? Steal shamelessly!

While recently looking for a compression routine I was shocked to find that almost none of the dozens of articles in the collection contained any flowcharts, pseudo-code, or code snippets. All plunge quickly into a discussion of entropy, Shannon's information theories, and the like. The closest most come to providing something useful is a deep discussion of creating binary trees which depict coding schemes. Hey - I don't about you, but I don't have the time or patience to figure out some academician's theories. I need code now!

Since many embedded systems transmit and store data, compression is an important issue that many designers ignore, perhaps because of the paucity of useful information. Here I'll present a couple of quick and dirty techniques that may not provide the best possible results, but that can be used immediately. Next month I'll follow up with more discussion and code details.

Options

Until recently computer science had interest only in lossless compression. A lossless scheme transmits the data perfectly; there is no degradation due to the compression/decompression process.

Some sorts of data can tolerate quite a bit of corruption, though. Video, in particular, just does not require the same sort of bit-by-bit perfection a computer program does. Lossy compression reduces the data's bandwidth in part by allowing redundant or non-essential information to be lost.

Under the lossy category the most fascinating approach (to me) is to reduce the image to a series of fractals - floating point numbers that represent the complexity of a shape, exploiting hidden regularities in the image. Often compression ratios of thousands to one result. Though there have been some attempts to commercialize this, the process of extracting the fractal nature of a picture is so complex that compression times are still very high.

Lossy compression is even today being used by Bell Atlantic to transmit full motion video over conventional copper phone lines in Virginia. I haven't been able to find out how they are making this happen, but the thought of a wide bandwidth link into the homes, without rewiring the entire United States for fiber optics, is pretty cool.

Lossy compression is an entire field in itself that is now being studied heavily. However, my interest lies more in lossless techniques that let me transmit a text stream or program code intact.

All data compression schemes exploit some pattern or characteristic of the data, so no one method is ideal for all situations. For example, fax machines transmit mostly white images; the bit stream is therefore compressed by sending only the differences between scan lines.

Lots of data sources are very repetitive in nature. Most pictures, for instance, contain long strings of identical data. Often, computer-transmitted text has vast amounts of white space (sequences of 0x20 characters). Run Length Encoding (RLE) compresses data streams by simply noting the presence of identical character sequences and replacing each sequence with the repeated character and a number indicating the required number of repetitions.

For example, the string "sssttttooo0o" could be compressed to something like "3s4t5o", where the numbers indicate the number of repetitions for each character.

RLE can actually reduce the information-carrying capacity of a channel when the data is very busy. A picture of static may have few repeating sequences. One solution, in a byte-based system, is to reserve bit 7 of each byte as a flag bit indicating "send this character though without a repeat count". Any character with bit 7=0 will be a repeat count, always followed by the single character to be repeated. Of course, this means that the repeat count cannot exceed 127.

The rules for encoding this sort of RLE code are thus:

If a character(i) is not the same as character (i+1), set bit 7 and transmit it.
Otherwise, count the number of identical characters (to a limit of 127).
Drop the repeat count followed by the character into the transmission stream.

Pseudo-code for the encoding process is:

i=0;
WHILE (i < total number of character to transmit)
{IF(data(i)!=data(i+1) DO	; if this point is not like the next point
     {transmit(data(i)+128)	; set bit 7 and transmit
       i=i+1			; increment pointer
     }ENDDO
  ELSE
    {count=i+1			; count-i is the repeat count
      WHILE ((data(i)=data(count) AND (count-i<127)) DO
         {count=count+1		; increment counter
          } ENDDO
     count=count-i	            ; repeat count
     transmit(count)		; send repeat count with bit7=0
     transmit(data(i))		; send character
     i=count+1			; point to next character 
    }ENDIF
}ENDWHILE

Decoding is even easier:

WHILE (characters still coming)
{IF((data BITWISE AND 2**7)=2**7) DO
    {save(data-128)		; save char without bit 7
    }ENDDO
  ELSE
    {count=data		; repeat count
      save (count) copies of (next data)
    }ENDIF
}ENDWHILE

RLE encoding is easy to implement and, in many cases, very efficient. See "Run-Length Encoding Slashes Image-File Storage Requirements" by Richard Quinn, Personal Engineering & Instrumentation New, September 1990 for C code and more discussion of the subject.

EDN ran a more comprehensive article in the June 21, 1990 issue: "Compression Algorithms Reduce Digitized Images to Manageable Size", by William Warner. Highly recommended for dealing with more complex image and transmission types.

Huffman Coding

As I mentioned, all data compression exploits some known characteristic of the data stream. In English text, for example, the letter "e" occurs much more often than "z", yet ASCII assigns exactly the same number of bits (7) to the two characters.

In 1952 D.A. Huffman proposed using varying length codes to express characters. The resulting code is used by most text compression systems, including the Windows Help compiler. Again, since "e" is so common, an ideal Huffman code would assign it just a couple of bits, giving the rare letter "z" many more.

Huffman codes allocate the shortest codes to the most likely characters, and the longest to those that are infrequent. The bad news: data transmitted so encoded will not be aligned on byte boundaries, since most characters will have far fewer than the standard 8 bits. To avoid ambiguity, each code must have the "prefix" property. That is, no code can occur as the first part of another, longer code. Thus, if the letter "e" were encoded as "01", then no other code could start with this sequence. This does allow the decoder to properly align on a character, but also somewhat lengthens the minimum code length.

Figure 1 gives a set of Huffman codes for the alphabet. Note that commonly occurring letters have fewer bits than those less common.

The compression process uses a so-called "dictionary lookup", nothing more than getting the character's code and size from a lookup table, and then dropping them into the transmission stream. Pseudo code might look like the following, assuming the codes exist in a structure <huffman>, which has elements <code> (left justified code as in figure 1), and <size>, which is the number of bits in the code.

WHILE (there are more characters)
{
   code=huffman[character].code ; get huffman code
   size=huffman[character].size    ; get code size
   i=0
   WHILE(i<size)
       {sendbit(code)		; send MSB of code
         shift code left 1 bit	; select next bit
         i=i+1			; increment # bits sent
       } ENDWHILE
} ENDWHILE

Decoding the compressed transmission is a bit more complex, simply because the character's size isn't known until a match occurs. Most approaches break the Huffman code into a binary tree, depicted in a table with forward links to speed the searching. While the code is not terribly complex, building the table for efficient searching requires a bit of explanation, which I'll provide in next month's installment.

For information about Huffman coding, consult "An Introduction to Data Compression" by Harold Corbin (April, 1981 Byte), or "Data Compression", by Gilbert Held, John Wiley & Sons, New York, 1983.

There are many Huffman codes. Just because the average English-language tome contains some particular distribution of letters does not mean that all transmitted English data will be the same. Vast quantities of C code, for example, will show an unusually high preponderance of curly braces. Though many Huffman users ignore this and encode their data with the standard English-language probability distributions, it is sometimes possible to greatly improve the compression efficiency by computing new codes for characters based on the text block being transmitted. This is called Adaptive Compression. It requires more compute time to figure the codes, and uses more overhead in that the code table must be transmitted along with the compressed data, but can result in substantially better performance.

Finally, it's fascinating to watch a child learn to read. A young one starts by clearly recognizing each letter, only laboriously assembling each into a word. The same sort of high-entropy effect occurs in Huffman encoding when we assign short codes to characters, but ignore the fact that English abounds with frequently-used phrases and words. In English text, "the", "to", the compression beyond traditional character boundaries and assign a short code to entire common words. The decoder will recognize the word as a single entity with relatively few bits, just as a skilled reader recognizes words, ignoring details of the letters. In fact, few programmers bother as the increased computation time may not justify the results.

Data Compression - Part 2

Last month I introduced several schemes that compress text and binary data transmitted over a communications channel. All of the methods I covered were for lossless compression, which preserves every little nuance of the raw data. Lossy compression is a whole 'nother field, which it's hard to say much about unless one is willing to define exactly how much lack of fidelity is tolerable.

Huffman codes are an easy and commonly used method to convert a string of data to tokens, each of whose length is inversely proportional to the frequency of the encoded character. For example, to transmit Huffman-encoded standard English text, the token corresponding to the letter "e" uses very few bits, as "e" is the most common character in the alphabet. The letter "z" will surely take many more bits to send, though will not impact the communications bandwidth much as it occurs so infrequently.

Although I presented a frequency distribution table for standard English text last month, most compression schemes dynamically compute the frequency of characters from each data set. Surely you can see that transmitted C code, for instance, will have an unusually high number of curly brackets compared to the epic Beowulf. Even short segments of standard English text may have skewed distributions due to the material's subject matter or the writer's skills.

It's easiest to visualize how the Huffman algorithm translates each character to a set of bits by using a binary tree. Indeed, most compression techniques use trees to simplify the code and to aid in building recursive algorithms.

The figure is a portion of a typical binary tree used to build the compression dictionary (the list of codes for each character). Each character is depicted by the branches one descends to reach the character at the bottom. In this simplified tree the letter "e" uses only a single bit, a zero, for its encoding. "t" is 10; "m" is 11.

	root node
  		|
		|
 __0__________1_________________
 |				|
 |		____0_________1__________________
 E		|				|
		|				|
		T				M

Huffman compression consists of the following steps:

1) Scan the input data and develop an array of the frequency of occurrence of each character. If you are compressing ASCII or other byte-wide data, the code looks something like:

 Repeat Till All Data Scanned
 { 
  lCount[input_char]++
 }

The lCount array will have 256 elements, one per possible entry in the input character set. Be sure to use a data structure that will tolerate the maximum number of possible counts per character in the data stream. I like longs, as it's awfully hard to overflow them on any realistic data stream.

It's a good idea to then reduce the array of longs to an array of bytes. Most likely you'll be handling lots of input data; it's foolish to do massive numbers of long compares when byte versions will suffice. Since there are only 256 possible characters, you can simply convert the array of counts to an array of relative frequency, 0 meaning the character was never seen and 255 meaning it is the most frequency.

2) Build the dictionary tree. I could use tree terminology here and most likely confuse at least half of you, Gentle Readers, so will try more of an illustrative approach so you can really see how the tree is built. The steps are as follows, and result in the tree shown in figure 2.

2.1) Arrange the character set in an array ordered by decreasing frequency. Include the frequency (i.e., the count of the number of times each character occurred).

2.2) Starting at the bottom, "connect" the two characters with the lowest frequency. Add up the counts for each and place these at the intersecting lines.

2.3) Continue combining groups of two until all have merged into one.

2.4) Assign a zero to one side of each nodal point and a one to the other (i.e., zero for left branches and one for right branches).

The tree is all we need to compress input data, but it is cumbersome to use. Worse, we've built an upside-down tree, with the "input data" (the characters) at the leaf, or bottom, nodes. Since we'll presumably compress an input array that is much bigger than this small tree, it makes sense to add another step to build a simpler data structure. We really want an array of Huffman codes (with the associate number of bits per code), with one entry per character. In other words, we can traverse the tree once per possible character and log the resulting codes to an array HuffCode. HuffCode has 256 entries, and is a structure that includes two elements per entry: the code itself, and an integer indicating the number of bits used in each code.

3) Send the tree to the receiver so that it can be used to decompress the data. Though you can simply transmit the entire tree structure, some people reduce it to bare elements to conserve transmission bandwidth. Much of the tree could be empty, depending on the input stream, in which case it's silly to add the extra null nodes when the whole point of the exercise is to reduce data!

4) Compress each data byte. Now that the dictionary array has been built (HuffCode), transmit data a bit at a time as follows:

WHILE (there are more characters)
{
   code=HuffCode[character].code ; get huffman code
   size=HuffCode[character].size    ; get code size
   i=size
   WHILE(i)
       {sendbit(code)		; send MSB of code
         I<<1			; select next bit
         i--			; increment # bits sent
       } ENDWHILE
} ENDWHILE

Huffman Expansion

The receiver must have a copy of the tree compression tree so it can decode the bit stream properly. I suggested above that you transmit more or less the entire tree. Some folks send the linear dictionary array (HuffCode) instead, but the code needed to decompress a byte using only HuffCode is slow and bulky.

Expansion using a tree is a more natural approach, particularly since you have no idea where the character boundaries are. A stream of rapid fire bits, perhaps millions of them, will come shooting down the transmission line with no clear demarcation between characters. The tree captures both the characaters codes and the number of bits needed to represent each one.

This makes decompressions just a matter of traversing the tree. For each input bit descend the tree, branching to the left if the input bit is a zero, or to the right branch for a one. You'll know the expansion is complete when you hit a leaf node of the tree. The code to decompress a single character looks like this:

TreeNode=RootNode
Repeat Until at a leaf
   {
     TreeNode=Tree.right	   ; assume right child branch
     if (inputbit==0) TreeNode=Tree.left 
                               ; if 0, traverse left child branch
   }
Output(TreeNode);

It's a good idea to add code to print the intermediate data structures (e.g., the tree) for debugging purposes.

Lempel/Ziv

Why compress single characters when often text and binary data will contain long identical sequences of phrases? In English, the word "and" crops up constantly, yet always burns up four ASCII characters (including the leading space): 32 bits for a word rarely unused in a paragraph.

Most of the familiar compression utilities like PKZIP and LHARC attempt to find common phrases rather than single characters. All of these algorithms trace their roots to work done by Lempel and Ziv in 1977 and later.

The LZ77 approach, Lempel and Ziv's first shot at a new method of data compression, tries to compress a file by building a dictionary of entire phrases used before in the data being compressed. It maintains a sliding window, a long string (several thousand bytes) of data that has been recently processed, as well as a short "look ahead buffer". Every time it finds a phrase in the look-ahead buffer that matches a portion of the text in the sliding window, it emits a pointer to the corresponding phrase in the sliding window, as well as a number indicating how many characters matched, and what the first non-matching character was. The Lempel/Ziv philosophy is to assume that the current data string most likely looks a lot like the what was most recently processed.

The sliding window becomes a dictionary whose contents are always changing as it slides over the input data stream. The compressed output is a series of numbers telling the decoder where to look in the sliding window for the compressed phrase. The fact that the window slides, and is not stationary, keeps search times reasonable.

The LZ77 algorithm and its more recent versions can sometimes produce tremendous compression ratios for embedded instrumentation transmitting more or less the same sorts of data over and over. Given some a priori knowledge of your instrument's output, it may be possible to construct an LZ sliding window and look-ahead buffer size that greatly exploits the repeating nature of the data.

Be wary! Several variants of the LZ algorithms have been patented, and commercial use of these may require the payment of non-trivial royalties. It seems appalling to have to say "consult your attorney before writing code", but perhaps there is no other choice. For this reason alone I stay with the standard Huffman techniques, even though not as efficient as the LZ algorithms.

Resources

You can play with all of these techniques on a PC for non-commercial purposes using the code in "The Data Compression Book", by Mark Nelson (M&T Books, San Mateo, CA, 1992 800-533-4372 (in CA 800-356-2002)). It comes with a pair of disks that contain C source code for the Huffman, LZ77, and many other compression/expansion schemes. I highly recommend it.

Also, refer to the May, 1986 Byte for more information about handling the Huffman trees. See "Data Compression with Huffman Coding", by Jonathan Amsterdam, pgs 99-108.

Figure 1: A Set of Huffman Codes


e 100
t 001
a 1111
o 1110
n 1100
r 1011
i 1010
s 0110
h 0101
d 11011
l 01111
f 01001
c 01000
m 00011
u 00010
g 00001
y 00000
p 110101
w 011101
b 011100
v 1101001
k 110100011
x 110100001
j 110100000
q 1101000101
z 1101000100