Computing CRCs in Parallel

The CRC is the best way to guage data comm. Here's how to build one in an FPGA.

Published in Circuit Cellar Ink, September, 1995.

For novel ideas about building embedded systems (both hardware and firmware), join the 40,000+ engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype and no vendor PR. Click here to subscribe.

By Jack Ganssle

The CRC (Cyclic Redundancy Check) is a sophisticated checksum that has long been the most common means of testing data for correctness. Every sector on a disk has a CRC of the data stored, so the operating system will be aware of data dropouts. A CRC doesn't identify which byte is in error, but it pretty much guarantees that you'll be alerted to at least single bit errors.

All CRCs are binary polynomials that are divided into the data stream. Although the definition and selection of the CRC is quite complex, its use is not.

The most common CRC polynomial is the CCITT form used by IBM's SDLC protocol. It is of the form:

X**16 + X**12 + X**5 + 1

This means that the input data stream (say, the data being written to the disk drive) is exclusive ORed into a 16 bit shift register that has feedback terms at the bit locations with coefficients in the formula, one bit at a time. Each register bit is a function of the CRC to that point, the input data, and the feedback taps.

In most systems data is transferred as a serial bit stream. Floppy and hard disks, as well as the newer optical disks, all write a single bit at a time to the medium. Modem applications are also bit oriented. This is fortunate, since the CRC is particularly well suited to serial data transfers. Serial shift registers, even with the feedback terms, are easy to implement in silicon. Fairchild's 74F401 chip is a 14 pin package that will compute CRCs on serial data using any of eight different polynomials. Many floppy disk controllers automatically append a CRC to the data stream. With this technology CRCs are almost trivial to add to a system.

If the peripheral is a parallel device, the problem becomes much more complex. Think about it: the CRC is computed by eight shifts per byte, each shift resulting in the exclusive ORing of a number of terms within the byte. After 8 of these operations the output is far from obvious. How do you compute a CRC on 8 bits at once?

The quest for speed is bringing more parallel devices into the mainstream. An obvious example is a large RAM disk. Most implementations make the hardware look just like a hard disk, so the operating system can handle the device without special drivers. Other peripherals may be connected using SCSI, which almost always is designed to emulate a disk. Whenever the operating expects to see a disk, it will also expect a CRC.

How do you implement a CRC in a purely parallel interface? An obvious approach is to convert the data to serial, compute the CRC, and convert it back to parallel. Although conceptually easy, fast data transfers will require a mind-boggling clock rate (at least 8 times the data rate), and a lot of hardware.

The new CRC (after a byte is shifted in) is a function of the input data and the old CRC. Why not derive formulas for each of the 16 bits of the CRC after the 8 new bits are shifted in? Then all 16 CRC bits can be computed in a single clock cycle.

Deriving the formulas is conceptually easy but rather tedious. If the input data is b0, b1, ... b7, and the current CRC is a0, a1, ... a15, then compute the new CRC after one bit arrives. Iterate seven more times, using new values of a0, a1, ... a15 each time. You'll get 16 new equations each time a new bit is shifted in. The last set of 16 is the result; these define the values of each bit after an entire byte is CRCed. Apply these 16 equations to a byte of data to compute a new CRC in a single cycle.

It turns out that a number of terms are common to many of the 16 equations. To simplify the algorithm the common terms are identified as I, J, K, L, M, N, O, and P. Figure 1 shows a program written in Turbo-C to compute the CRC using the derived formulas.

This program looks horrible! It is far more cumbersome and complicated than a loop to do the normal serial CRC calculation. Why go through all of this grief for so little reward?

The form of the program closely resembles that of the equations needed to define a Programmable Logic Device (PLD). In other words, there is a very close equivalence between the code and a hardware implementation of the algorithm. Fortunately, the PLD version can operate in a single clock cycle, and is much more ascetically appealing than its awkward software cousin.

The PLD

For the uninitiated, a PLD is a sort of "super PAL". Based on EPROM technology, the PLD is a device that can be programmed with a user's equations. Like an EPROM, a fairly simple device programmer is used to load the formulas.

PLDs are defined in terms of "macrocells". Every macrocell is a multiple-input OR gate optionally connected to a flip-flop. Each of the OR gate inputs is an extremely wide AND gate; any or all of the PLD pins (and their inversions) can be connected to the AND gates.

You can specify the type of flip-flop; typically D, T, JK, and SR versions are all available. Or, you can disable the register altogether, so the macrocell becomes a big combinatorial gate structure.

A PLD is therefore a very large collection of AND and OR gates with some internal registers also thrown in. The connection of these components is up to the user; you write a set of boolean equations that makes the PLD implement the a circuit you need.

Intel's 5C090 (equivalent to Altera's EP900) is a 40 pin PLD with 24 macrocells. It contains enough logic to completely implement the CRC algorithm in parallel.

Putting the CRC in a PLD

Examining the program, it quickly becomes apparent that the intermediate I through P terms must be computed before any of the a0 - a15 outputs, but once I-P are known, then all 16 a0 - a15 terms could be computed simultaneously. We should therefore assign the I - P terms to combinatorial outputs in the PLD. The a0 to a15 terms (the CRC itself) are assigned to registered outputs. In this case the current CRC is needed as part of the new one; therefore the flip-flops' output will be fed back into the equation matrix.

It's pretty easy to translate the Turbo-C program into PLD equations. This was my intention when writing the code; the real purpose of the program is to provide a simulation of the CRC function. Figure 2 shows the file that defines the PLD.

For those unfamiliar with PLD design, the NETWORK section of the PLD file defines the nature of each of the device's pins. b0 to b7 are input bits. I to P are combinatorial outputs. The a0 to a15 (CRC) terms are defined as RORF (Registered Output, Registered Feedback). The terms are latched in D flip-flops, but the current value (before the device is clocked) goes back into the equation matrix.

Figure 3 shows the connection of the PLD to a computer's data bus. Most data communications processors manipulate 8 bit data, so an 8 bit data bus is shown. The input data is a byte, but the output CRC is 16 bits. This PLD is designed to let us multiplex the CRC onto the bus as two individual bytes. The input bits come from the same data bus. b0 is thus tied to a0 and a8, b1 to a1 and a9, etc.

The LOCRC and HICRC inputs are used to dump the upper or lower CRC byte onto the bus. Once the calculation is complete, the processor asserts these inputs low (one at a time) and reads the two bytes. Typically these inputs are connected as input port selects.

The PRESET input is used to initialize the CRC before any data is transferred. The CCITT specification requires that the CRC be initialized to all ones. When asserted, PRESET will load a one into all of the a0-a15 terms after the next clock cycle.

Be warned! Your interface circuit must assert clock when preset is also low. Clock is high-edge triggered.

To use the device, first initialize the CRC by asserting PRESET low and driving clock high. Then transfer data one byte at a time. For each byte, drive clock high once the input data is stable, and after the data has been present long enough to let the I to P terms propagate (typically 100 nsec). The PLD will compute a new value for the CRC and load it in the a0-a15 registers when clock goes high. After a block is transferred, the CRC can then be read by the computer.

This is a pretty complex series of equations. How can you be sure the circuit works correctly? One way is to compare the CRC to known good values after specific data is transferred. Table 1 shows CRC values for an input data stream of successive zeroes. You can check the CRC after each clock against this table.

The CCITT polynomial has a self-checking property. Once a stream of data has gone through the CRC calculator, you can supply the one's complement of the resulting CRC to the PLD and always get F0B8 as the new value. Always. This serves as a sanity check when developing hardware, and as a quick way of verifying data against a previously computed CRC during read of a device. Be sure to transfer the low part of the CRC first, and then the high byte when making this test.

The PLD lets you compute a CRC in parallel using only a single IC package. It can support a data rate of at least 10 Mhz, even using relatively slow devices. Although the equations are messy and tedious, you can use the ones presented here as a "cookbook" solution.

-----------------------------------------------------------
/*	compute a parallel CRC */
#include <stdio.h>
int I,J,K,L,M,N,O,P;
int b0,b1,b2,b3,b4,b5,b6,b7;
int a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15;
int iter;
main()
{
int k0,k1,k2,k3,k4,k5,k6,k7;
/* Preset CRC to all ones - CCITT rules */
	a0=a1=a2=a3=a4=a5=a6=a7=a8=a9=a10=a11=a12=a13=a14=a15=1;
/* Compute a number of CRCs with input data of all zeroes */
	b0=b1=b2=b3=b4=b5=b6=b7=0;
	for(iter=0; iter<10; ++iter)crc();
/* Now feed ones complement of the CRC back into calculation;
   result should be F0B8 */
	b0=!a0; b1=!a1; b2=!a2; b3=!a3; b4=!a4; b5=!a5; b6=!a6; b7=!a7;
	k0=!a8; k1=!a9; k2=!a10; k3=!a11; k4=!a12; k5=!a13; k6=!a14; k7=!a15;
	crc();
	b0=k0; b1=k1; b2=k2; b3=k3; b4=k4; b5=k5; b6=k6; b7=k7;
	crc();
}
crc()
{
	I=(b3 & !a3) | (!b3 & a3);
	J=(b2 & !a2) | (!b2 & a2);
	K=(b1 & !a1) | (!b1 & a1);
	L=(b0 & !a0) | (!b0 & a0);
	M=  b7 & !a7 & !I
	  | !b7 & !a7 &  I
	  | !b7 &  a7 & !I
	  |  b7 &  a7 &  I;
	N=  b6 & !a6 & !J
	  | !b6 & !a6 &  J
	  | !b6 &  a6 & !J
	  |  b6 &  a6 &  J;
	O=  b5 & !a5 & !K
	  | !b5 & !a5 &  K
	  | !b5 &  a5 & !K
	  |  b5 &  a5 &  K;
	P=  b4 & !a4 & !L
	  | !b4 & !a4 &  L
	  | !b4 &  a4 & !L
	  |  b4 &  a4 &  L;
	a7=(a15 & !P) | (!a15 & P);
	a6=(a14 & !I) | (!a14 & I);
	a5=(a13 & !J) | (!a13 & J);
	a4=(a12 & !K) | (!a12 & K);
	a3=  a11 & !L & !M
	  | !a11 & !L &  M
	  | !a11 &  L & !M
	  |  a11 &  L &  M;
	a2=(a10 & !N) | (!a10 & N);
	a1=(a9  & !O) | (!a9  & O);
	a0=(a8  & !P) | (!a8  & P);
	a15=M;
	a14=N;
	a13=O;
	a12=P;
	a11=I;
	a10=(J & !M) | (!J & M);
	a9= (K & !N) | (!K & N);
	a8= (L & !O) | (!L & O);
	printf("\nCRC=   %x%x%x%x %x%x%x%x %x%x%x%x %x%x%x%x",
		 a15,a14,a13,a12,a11,a10,a9,a8,a7,a6,a5,a4,a3,a2,a1,a0);
}
FIGURE 1: PROGRAM TO COMPUTE CRC IN PARALLEL
-----------------------------------------------------------
Designer: Jack G. Ganssle
Company: Softaid, Inc.
Date: March 17, 1989
Number: 1
Revision:
EPLD: 5C090
Comments: This file computes a CCITT-SDLC CRC in parallel
OPTIONS: TURBO=ON
PART: 5C090
INPUTS: b0@2,b1@3,b2@4,b3@17,b4@18,b5@19,b6@22,b7@23,preset@38,
        hicrc@37,locrc@24,clk1@1,clk2@21
OUTPUTS:a0@5,a1@7,a2@9,a3@11,a4@13,a5@15,a6@25,a7@27,
        a8@6,a9@8,a10@10,a11@12,a12@14,a13@16,a14@26,a15@28,
        I@32,J@31,K@30,L@29,M@36,N@33,O@34,P@35
NETWORK:
b0=INP(b0)
b1=INP(b1)
b2=INP(b2)
b3=INP(b3)
b4=INP(b4)
b5=INP(b5)
b6=INP(b6)
b7=INP(b7)
preset=INP(preset)
losel=INP(locrc)
hisel=INP(hicrc)
clk1=INP(clk1)
clk2=INP(clk2)
a0,a0=RORF(a0d,clk1,gnd,gnd,lo)
a1,a1=RORF(a1d,clk1,gnd,gnd,lo)
a2,a2=RORF(a2d,clk1,gnd,gnd,lo)
a3,a3=RORF(a3d,clk1,gnd,gnd,lo)
a4,a4=RORF(a4d,clk1,gnd,gnd,lo)
a5,a5=RORF(a5d,clk1,gnd,gnd,lo)
a6,a6=RORF(a6d,clk2,gnd,gnd,lo)
a7,a7=RORF(a7d,clk2,gnd,gnd,lo)
a8,a8=RORF(a8d,clk1,gnd,gnd,hi)
a9,a9=RORF(a9d,clk1,gnd,gnd,hi)
a10,a10=RORF(a10d,clk1,gnd,gnd,hi)
a11,a11=RORF(a11d,clk1,gnd,gnd,hi)
a12,a12=RORF(a12d,clk1,gnd,gnd,hi)
a13,a13=RORF(a13d,clk1,gnd,gnd,hi)
a14,a14=RORF(a14d,clk2,gnd,gnd,hi)
a15,a15=RORF(a15d,clk2,gnd,gnd,hi)
L,L=COIF(Ld,)
K,K=COIF(Kd,)
J,J=COIF(Jd,)
I,I=COIF(Id,)
P,P=COIF(Pd,)
O,O=COIF(Od,)
N,N=COIF(Nd,)
M,M=COIF(Md,)
EQUATIONS:
lo=/losel;
hi=/hisel;
Id=(b3 * /a3) + (/b3 * a3);
Jd=(b2 * /a2) + (/b2 * a2);
Kd=(b1 * /a1) + (/b1 * a1);
Ld=(b0 * /a0) + (/b0 * a0);
Md=  b7 * /a7 * /I
  + /b7 * /a7 *  I
  + /b7 *  a7 * /I
  +  b7 *  a7 *  I;
Nd=  b6 * /a6 * /J
  + /b6 * /a6 *  J
  + /b6 *  a6 * /J
  +  b6 *  a6 *  J;
Od=  b5 * /a5 * /K
  + /b5 * /a5 *  K
  + /b5 *  a5 * /K
  +  b5 *  a5 *  K;
Pd=  b4 * /a4 * /L
  + /b4 * /a4 *  L
  + /b4 *  a4 * /L
  +  b4 *  a4 *  L;
a7d=(a15 * /P) + (/a15 * P)
  + /preset;
a6d=(a14 * /I) + (/a14 * I)
  + /preset;
a5d=(a13 * /J) + (/a13 * J)
  + /preset;
a4d=(a12 * /K) + (/a12 * K)
  + /preset;
a3d=  a11 * /L * /M
  + /a11 * /L *  M
  + /a11 *  L * /M
  +  a11 *  L *  M
  + /preset;
a2d=(a10 * /N) + (/a10 * N)
  + /preset;
a1d=(a9  * /O) + (/a9  * O)
  + /preset;
a0d=(a8  * /P) + (/a8  * P)
  + /preset;
a15d=M
  + /preset;
a14d=N
  + /preset;
a13d=O
  + /preset;
a12d=P
  + /preset;
a11d=I
  + /preset;
a10d=(J * /M) + (/J * M)
  + /preset;
a9d= (K * /N) + (/K * N)
  + /preset;
a8d= (L * /O) + (/L * O)
  + /preset;
END$
FIGURE 2: PLD FILE FOR A PARALLEL CRC
----------------------------------------------------------
Input data	Output CRC
preset		FFFF
0			0F87
0			F0B8
0			3933
0			0321
0			3088
77			0F48
CF			F0B8
(Note that feeding one's complement of CRC into the PLD yields
F0B8)
TABLE 1: SAMPLE CRC VALUES